Close sincerely apologizes for the interruption of our service. We take the stability of our platform very seriously. Below is an explanation of what happened and how we will prevent another such interruption from occurring.
All Close systems were disrupted for 11 minutes between 19:56 and 20:07 UTC on November 23th, 2020.
The Close application exhausted all available connections to a backend database, preventing the application from functioning normally. This was due to a code change that resulted in our indexing processes not releasing connections to the database.
The issue was caused by session management differences between some of our older and newer code. A change was deployed at 12:30 UTC that caused connections to be left open in certain situations where call paths crossed into older session management code. At 19:25 UTC, a combination of increased load and an accumulation of stale connections caused the database to run out of connections and trigger alerts. Temporary actions were taken to free connections until the problematic code was reverted.
Nov 23 09:30 UTC - Deployed change that left connections open
Nov 23 19:25 UTC - Engineering team begins investigating system alerts
Nov 23 19:56 UTC - Close begins receiving reports of application instability
Nov 23 19:59 UTC - Engineering team begins taking temporary actions to free DB resources
Nov 23 20:07 UTC - Close application behavior returns to normal.
Nov 23 20:30 UTC - Revert of problematic code deployed