High database load causing app performance issues

Incident Report for Close

Postmortem

Close sincerely apologizes for the interruption of our service. We take the stability of our platform very seriously. Below is an explanation of what happened and how we will prevent another such interruption from occurring.

Impact

Between 13:40 and 16:55 UTC on Wednesday September 28, 2024 the Close App and API experienced degraded performance. Some users may have noticed the App UI & API responding sluggishly.

Concurrently, between 14:33 and 19:30 UTC background task processing inside of the Close app was disrupted. During this time Workflows and Email sending may not have occurred on schedule.

Root Cause and Resolution

At 13:14 UTC on Wednesday September 28, 2024 Close Engineering deployed an updated version of our browser application. A bug in this new version caused a large increase of impactful requests to be sent to our back end system. By 14:00 UTC the number of additional requests had grown such that our back end database was overloaded causing poor application performance.

Close Engineering was able to revert the change to our browser application by 14:51 UTC. While waiting for all of our clients’ browsers to update to the fixed version of our app Close Engineering undertook several steps to reduce the load on our overloaded database between 14:30 UTC and 17:00 UTC.

Disruption during this time also degraded our ability to collect runtime metrics on our background task processing system. This caused the background task processing system to think that it was not under load and to scale down. Close Engineering fixed the issue with metrics gathering by 18:20 UTC. At which point background task processing returned to normal operation.

To prevent another incident like the from occurring Close Engineering will audit our growing data stores for opportunities to better distribute load and prevent the database from becoming overloaded. We will also implement a training regimen for our incident responders to ensure more timely and consistent communication during future incidents.

Timeline

13:14 UTC - Close Engineering deploys an updated version of our browser application
13:59 UTC - Close Engineering is alerted to degraded performance of our system
14:30 UTC - Close Engineering identifies our back end database as overloaded
14:30 UTC - Close Engineering begins load shedding operations to preserve system performance
14:33 UTC - Disruption to background task processing begins
14:51 UTC - Close Engineering reverts the change to our browser application
17:31 UTC - Close Engineering begins to undo load shedding to restore normal operation
18:20 UTC - Close Engineering begins manual operations to restore background task processing.
18:50 UTC - The back end database becomes overloaded once more
19:30 UTC - Close Engineering scales up the back end database
19:30 UTC - Background task processing returns to normal. All Close systems are functioning normally

Posted Sep 26, 2024 - 14:00 PDT

Resolved

This incident has been resolved.

Posted Sep 25, 2024 - 12:52 PDT

Update

The Close app is functioning normally. Some background task processing may be delayed. We are continuing to monitor for further issues.

Posted Sep 25, 2024 - 12:12 PDT

Update

We are continuing to monitor for any further issues.

Posted Sep 25, 2024 - 12:11 PDT

Update

The Close app is functioning normally. Some background task processing may be delayed. We are continuing to monitor for further issues.

Posted Sep 25, 2024 - 12:05 PDT

Update

We are continuing to monitor for any further issues.

Posted Sep 25, 2024 - 11:53 PDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Sep 25, 2024 - 11:31 PDT

Update

The app is recovering and we are continuing to monitor the situation. Some background tasks could still be delayed.

Posted Sep 25, 2024 - 11:28 PDT

Investigating

We are currently investigating this issue

Posted Sep 25, 2024 - 07:40 PDT

This incident affected: Application UI, API, Search (Indexing), and Email (Sending).