Application loading issues

Incident Report for Close

Postmortem

Close sincerely apologizes for the interruption of our service. We take the stability of a platform very seriously. Below is an explanation of what happened and how we will prevent another such interruption from occurring.

Impact

All Close systems were disrupted for approximately 30 minutes between 2021-10-30 23:40 and 2021-10-31 00:11 UTC.

Root cause

The disk volume performance for two shards of our primary database cluster became severely degraded due to issues in our hosting providers network. Unfortunately the cluster did not failover to healthy instances because this failure only caused significant degraded performance and not a complete failure of the instances.

Timeline

Oct 30 23:40 UTC - Instances for two shards of primary database experience severely degraded disk volume performance
Oct 30 23:51 UTC - Engineering team begins investigating system alerts
Oct 31 00:03 UTC - Impacted instances are stopped to trigger election of new primaries for affected shards
Oct 31 00:07 UTC - Restarted app facing services that didn’t automatically recover after database cluster became healthy
Oct 31 00:24 UTC - Finished restarting all backend services to ensure they all had healthy database connections

Next Steps

Investigate options to detect when parts of the database cluster are in a degraded state and trigger failing over to healthy instances if they are available.
Work with hosting provider to better understand the failure to determine if there are better ways to detect this type of issue.
Fix the issue that required some services to be restarted to reconnect to the healthy cluster.

Posted Nov 02, 2021 - 11:04 PDT

Resolved

This incident has been resolved.

Posted Oct 30, 2021 - 17:39 PDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Oct 30, 2021 - 17:15 PDT

Investigating

We are currently investigating this issue.

Posted Oct 30, 2021 - 17:01 PDT

This incident affected: Application UI and API.