Close sincerely apologizes for the interruption of our service. We take the stability of a platform very seriously. Below is an explanation of what happened and how we will prevent another such interruption from occurring.
All Close systems were disrupted for approximately 30 minutes between 2021-10-30 23:40 and 2021-10-31 00:11 UTC.
The disk volume performance for two shards of our primary database cluster became severely degraded due to issues in our hosting providers network. Unfortunately the cluster did not failover to healthy instances because this failure only caused significant degraded performance and not a complete failure of the instances.
Oct 30 23:40 UTC - Instances for two shards of primary database experience severely degraded disk volume performance
Oct 30 23:51 UTC - Engineering team begins investigating system alerts
Oct 31 00:03 UTC - Impacted instances are stopped to trigger election of new primaries for affected shards
Oct 31 00:07 UTC - Restarted app facing services that didn’t automatically recover after database cluster became healthy
Oct 31 00:24 UTC - Finished restarting all backend services to ensure they all had healthy database connections