Close sincerely apologizes for the interruption of our service. We take the stability of our platform very seriously. Below is an explanation of what happened and how we will prevent another such interruption from occurring.
A small number of Close API requests began failing on October 14, 2025 for 67 minutes from 13:20Z to 14:27Z. A larger number of failures occurred for 88 minutes from 14:27Z to 15:55Z. The failures impacted our API as well as various components of the Close App.
Unfortunately, a cascading set of events contributed to this failure. Three Pull Requests (PR) were in flight during this incident. The first and third had defects that impacted our API. Additionally, although it had no defects, the second PR added to the complexity of troubleshooting the core issues caused by the other two.
The first PR made changes to our billing logic that inadvertently started making more requests than expected to an external provider. This extra load on the external provider caused a small number of our API requests to timeout.
The larger impact was caused by the third PR which made changes to how some parts of our web requests are handled. We have recently updated our dependencies and are in the process of evolving various parts of our codebase to be better aligned with modern async processing techniques. This update appeared fine during normal testing but failed under our production load. The changes caused a very high utilization of our web workers’ thread pools. This in turn caused some of them to become saturated and left in a state where they were unable to process any new requests. This caused a significant number of API requests to time out.
It took some time to identify which update was causing the API timeouts and revert its changes. A flaky test further delayed our deployment process. The recovery was prolonged because the failure affected our infrastructure's health check system. The same endpoints that were timing out were needed to validate and control our deployment changes.
We are still reviewing all details related to this incident and are taking several actions to improve our response to complex incidents such as these. We are working to ensure the right resources are available to quickly diagnose issues, strengthening our CI/CD processes, and enhancing our internal QA procedures to better catch these complex defects during testing.