Application loading issues

Incident Report for Close

Postmortem

Close sincerely apologizes for the interruption of our service. We take the stability of our platform very seriously. Below is an explanation of what happened and how we will prevent another such interruption from occurring.

Impact

All Close systems were disrupted for 11 minutes between 19:56 and 20:07 UTC on November 23th, 2020.

Root Cause & Resolution

The Close application exhausted all available connections to a backend database, preventing the application from functioning normally. This was due to a code change that resulted in our indexing processes not releasing connections to the database.

The issue was caused by session management differences between some of our older and newer code. A change was deployed at 12:30 UTC that caused connections to be left open in certain situations where call paths crossed into older session management code. At 19:25 UTC, a combination of increased load and an accumulation of stale connections caused the database to run out of connections and trigger alerts. Temporary actions were taken to free connections until the problematic code was reverted.

Timeline

Nov 23 09:30 UTC - Deployed change that left connections open
Nov 23 19:25 UTC - Engineering team begins investigating system alerts
Nov 23 19:56 UTC - Close begins receiving reports of application instability
Nov 23 19:59 UTC - Engineering team begins taking temporary actions to free DB resources
Nov 23 20:07 UTC - Close application behavior returns to normal.
Nov 23 20:30 UTC - Revert of problematic code deployed

Next Steps

Implement a warning system for the new and old code interacting in an incorrect manner(already done).
Continue refactoring our older code to use the new session management logic.
Implement alert rules that can give early warnings to this particular issue.

Posted Nov 25, 2020 - 10:43 PST

Resolved

We are currently investigating this issue.

Posted Nov 23, 2020 - 09:00 PST