Partial outage - application not loading

Incident Report for Close

Postmortem

Close sincerely apologizes for the interruption of our service. We take the stability of our platform very seriously. Below is an explanation of what happened and how we will prevent another such interruption from occurring.

Impact

A small number of Close API requests began failing on October 14, 2025 for 67 minutes from 13:20Z to 14:27Z. A larger number of failures occurred for 88 minutes from 14:27Z to 15:55Z. The failures impacted our API as well as various components of the Close App.

Root Cause and Resolution

Unfortunately, a cascading set of events contributed to this failure. Three Pull Requests (PR) were in flight during this incident. The first and third had defects that impacted our API. Additionally, although it had no defects, the second PR added to the complexity of troubleshooting the core issues caused by the other two.

The first PR made changes to our billing logic that inadvertently started making more requests than expected to an external provider. This extra load on the external provider caused a small number of our API requests to timeout.

The larger impact was caused by the third PR which made changes to how some parts of our web requests are handled. We have recently updated our dependencies and are in the process of evolving various parts of our codebase to be better aligned with modern async processing techniques. This update appeared fine during normal testing but failed under our production load. The changes caused a very high utilization of our web workers’ thread pools. This in turn caused some of them to become saturated and left in a state where they were unable to process any new requests. This caused a significant number of API requests to time out.

It took some time to identify which update was causing the API timeouts and revert its changes. A flaky test further delayed our deployment process. The recovery was prolonged because the failure affected our infrastructure's health check system. The same endpoints that were timing out were needed to validate and control our deployment changes.

We are still reviewing all details related to this incident and are taking several actions to improve our response to complex incidents such as these. We are working to ensure the right resources are available to quickly diagnose issues, strengthening our CI/CD processes, and enhancing our internal QA procedures to better catch these complex defects during testing.

Timeline

  • 13:20 UTC - Billing PR merged earlier in the morning starts causing a small number of API failures
  • 13:25 UTC - Engineering team alerted of API failures, response starts
  • 13:37 UTC - Async PR is merged, fails to deploy due to flaky e2e test
  • 13:45 UTC - Engineering team alerted of failing CICD deployments
  • 13:51 UTC - Revert billing PR, deploy fails due to flaky e2e test
  • 14:04 UTC - Unrelated PR merged which includes the billing change revert
  • 14:20 UTC - Both PR’s changes deployed
  • 14:27 UTC - Large spike of API failures followed up with pods failing to start due to status endpoint failing
  • 14:33 UTC - Start of main customer impact of the Async PR
  • 15:18 UTC - Revert Async PR
  • 15:55 UTC - Customer impact ends
Posted Oct 16, 2025 - 11:46 PDT

Resolved

This incident has been resolved.
Posted Oct 14, 2025 - 09:00 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Oct 14, 2025 - 08:53 PDT

Update

We are currently applying fixes to restore operations and monitoring their impact.
Posted Oct 14, 2025 - 08:48 PDT

Investigating

Some users are experience downtime. Our engineers are currently investigating issues loading the Close application and pulling data from the Close API.
Posted Oct 14, 2025 - 08:26 PDT
This incident affected: Application UI, API and Search (Querying).