Application not loading

Incident Report for Close

Postmortem

Close sincerely apologizes for the interruption of our service. We take the stability of a platform very seriously. Below is an explanation of what happened and how we will prevent another such interruption from occurring.

Impact

The Close App and API was unavailable to all customers for 43 minutes from 17:00 to 17:43 UTC on Wednesday July 22, 2019 due to the failure of a backend database. Severe degradation began at 16:53 UTC. The database was recovered and all services were restored by 17:43 UTC.

Root Cause & Resolution

One of our backend PostreSQL databases became starved of available memory. This prevented the database from accepting new work, resulting in an interruption of service to the Close system. The issue was resolved by increasing the amount of memory available to the database.

Timeline

15:57: First alarms begin to fire about delays in Email Sequences and PostgreSQL CPU usage

16:00: Close Infrastructure begins investigation

16:30: Memory pressure identified as the cause of alarms on the affected database

16:53: The affected database failed, causing the Close app and API to become unavailable

16:53: Decision made to scale the database to an instance class with more memory

17:04: The maintenance page is posted in preparation for the database scaling operation

17:04: Scaling operation begins on the affected database

17:32: Scaling operation completes

17:43: Application services are restored

Next Steps

To ensure that events such as this do not occur in the future we are taking the following actions:

Enhance our monitoring so memory starvation can be proactively avoided
Enhance our incident response procedures to lessen the effect of database performance issues
Enhance our deployment automation so our system recover more quickly for scaling operations
Enhance our database systems to be more robust during scaling operations

Posted Jul 22, 2020 - 14:49 PDT

Resolved

After the database upgrade, the system is performing well and the problem is resolved. We'll post more details about this incident in an upcoming postmortem.

Posted Jul 22, 2020 - 11:28 PDT

Update

App is back up. We are closely monitoring and verifying all app components.

Posted Jul 22, 2020 - 10:44 PDT

Monitoring

App is back up. We are closely monitoring.

Posted Jul 22, 2020 - 10:43 PDT

Update

We performed an emergency database upgrade, which has finished, and we are beginning to redeploy the app

Posted Jul 22, 2020 - 10:33 PDT

Update

Our engineers have identified a fix and are in the process of deploying it.

Posted Jul 22, 2020 - 10:26 PDT

Identified

Our engineers have identified the issue and are currently working on a fix.

Posted Jul 22, 2020 - 10:11 PDT

Update

We are continuing to investigate this issue.

Posted Jul 22, 2020 - 10:04 PDT

Investigating

We are currently investigating an issue causing the Close Application not to load properly for users.

Posted Jul 22, 2020 - 10:02 PDT

This incident affected: Application UI and API.