Close sincerely apologizes for the interruption of our service. We take the stability of our platform very seriously. Below is an explanation of what happened and how we will prevent another such interruption from occurring.
Between Tuesday 10/21 14:46–17:24 UTC, Wednesday 10/22 7:50–17:00 UTC, and Thursday 10/23 8:00–15:00 UTC, customers relying on real-time updates in Close (which is critical for proper & timely functioning of our calling features and the Dialer) experienced delays, outdated data being displayed on their screens, Dialer disconnections, and other unexpected behavior.
On Monday 10/20 at 20:09 UTC, a subtle bug was introduced, which resulted in UI clients slowly opening more and more WebSocket subscriptions for our Data Monitoring Service – a component which powers real-time updates for our Activity Reporting pages.
Over the next few days, the number of subscriptions kept rising. The rise was slow enough where it didn’t trigger our alerting, but fast enough to eventually start putting a real strain on our infrastructure and saturating our processing capacity.
Our first course of action was to scale the supply to meet the demand, but the demand continued to rise more quickly. Because of unrelated rate-limiting changes that looked more directly responsible for the limitation of the supply, our engineering efforts went in a few wrong directions until finally the root cause – the UI bug – was identified and fixed on Thu 10/23 at 5:19 UTC. However, because the fix required clients to reload their UI, intermittent problems continued until enough clients have reloaded, finally resolving the problem on Thu 10/23 at 15:00 UTC.
To prevent this problem from ever happening again, we are improving our monitoring, alerting, testing, and in the longer-term the architecture of our WebSocket subscription service.