Elevated Error Rate

Incident Report for Close

Postmortem

Close sincerely apologizes for the interruption of our service. We take the stability of a platform very seriously. Below is an explanation of what happened and how we will prevent another such interruption from occurring.

Impact

Between 16:47 UTC and 17:18 UTC on July 26, 2023 Close users may have experienced degraded performance and an elevated error rate when using the Close application or API.

Root Cause and Resolution

This incident began when our internal metrics service became degraded at 16:47 UTC on July 26, 2023. Our internal metrics service is used by our compute platform to automatically determine the appropriate amount of resources needed to run the Close application. When the metrics service became degraded our compute platform was unable to determine the correct amount of resources needed to run the Close application. Over the next several minutes our compute platform de-provisioned resources causing the system to become overloaded and perform poorly. When the internal metrics service was restored at 17:18 UTC our compute platform provisioned the correct amount of resources and performance returned to normal.

Our compute platform has a fail safe built in to avoid this exact situation that did not function. We are investigating why this fail safe did not function. We are also deploying improvements to our internal metrics service to make it more stable and alert sooner if it becomes unstable.

Timeline

16:47 UTC - Our internal metrics service becomes degraded.
17:00 UTC - Without metrics available our system assumed it was not under load and deprovisioned resources. This caused performance to degrade.
17:09 UTC - The Engineering Team becomes aware of an issue with application performance.
17:18 UTC - Our internal metrics system is restored. This causes our system to provision the appropriate amount of resources. Application performance recovers

Posted Jul 28, 2023 - 08:46 PDT

Resolved

Between 16:47 UTC and 17:18 UTC on July 26, 2023 Close users may have experienced degraded performance and an elevated error rate when using the Close application or API.

Posted Jul 26, 2023 - 09:30 PDT