Experiencing delays on Dialer Sessions

Incident Report for Close

Postmortem

Close sincerely apologizes for the interruption of our service. We take the stability of our platform very seriously. Below is an explanation of what happened and how we will prevent another such interruption from occurring.

Impact

Between Tuesday 10/21 14:46–17:24 UTC, Wednesday 10/22 7:50–17:00 UTC, and Thursday 10/23 8:00–15:00 UTC, customers relying on real-time updates in Close (which is critical for proper & timely functioning of our calling features and the Dialer) experienced delays, outdated data being displayed on their screens, Dialer disconnections, and other unexpected behavior.

Root Cause and Resolution

On Monday 10/20 at 20:09 UTC, a subtle bug was introduced, which resulted in UI clients slowly opening more and more WebSocket subscriptions for our Data Monitoring Service – a component which powers real-time updates for our Activity Reporting pages.

Over the next few days, the number of subscriptions kept rising. The rise was slow enough where it didn’t trigger our alerting, but fast enough to eventually start putting a real strain on our infrastructure and saturating our processing capacity.

Our first course of action was to scale the supply to meet the demand, but the demand continued to rise more quickly. Because of unrelated rate-limiting changes that looked more directly responsible for the limitation of the supply, our engineering efforts went in a few wrong directions until finally the root cause – the UI bug – was identified and fixed on Thu 10/23 at 5:19 UTC. However, because the fix required clients to reload their UI, intermittent problems continued until enough clients have reloaded, finally resolving the problem on Thu 10/23 at 15:00 UTC.

To prevent this problem from ever happening again, we are improving our monitoring, alerting, testing, and in the longer-term the architecture of our WebSocket subscription service.

Timeline

Mon 10/20 20:09 UTC: A subtle bug is introduced and over the next few days keeps slowly increasing the number of WebSocket-based Data Monitoring Service subscriptions.
Tue 10/21 14:46 UTC: Close receives first signals of Dialer problems where Users hear the callee many seconds before the corresponding Lead Page loads and immediately start looking into the problem.
Tue 10/21 5:24 UTC: First fixes aimed at increasing the allowed rate of processing of the system land in production. The situation appears resolved.
Wed 10/22 7:50 UTC: The problem comes back despite the fixes. Engineers increase the scale of the system to meet the increasing demand, but the demand keeps saturating the scale. Status page is opened at 10:35 UTC (https://status.close.com/incidents/ydzwc4l6xmhc). Problems continue on and off until they fully subside around 17:00 UTC
Thu 10/23 5:19 UTC: Engineers identify the root cause and fix it.
Thu 10/23 8:00 UTC: Concerning metrics continue to emerge despite the fix. Customer reports follow and Status page is opened at 11:40 UTC (https://status.close.com/incidents/j49qgd3m049n). Engineers continue working on the impact mitigation and root cause diagnosis until it is found and fixed at 11:47 UTC (excessive WebSocket Subscriptions for our real-time Data Monitoring Service, coming from UI clients who visited the Activity Reporting page). However, the fix being picked up requires some idle UI clients to reload, hence the saturated demand continues until 15 UTC despite scaling attempts.
Thu 10/23 15:00 UTC: The problem is officially resolved.

Posted Oct 27, 2025 - 10:07 PDT

Resolved

This incident has been resolved.

Posted Oct 23, 2025 - 08:52 PDT

Monitoring

We identified an issue causing Lead Pages to load slowly during Dialer sessions.

A fix has been implemented and will be deployed shortly. After deployment, we’ll monitor the system to ensure the issue is fully resolved.

New updates to follow.

Posted Oct 23, 2025 - 04:40 PDT

This incident affected: Phone (Dialer).