Close sincerely apologizes for the interruption of our service. We take the stability of our platform very seriously. Below is an explanation of what happened and how we will prevent another such interruption from occurring.
Impact
Close CRM was not loading for a subset of our customers for 187 minutes from 12:56 to 16:03 UTC. Users that had a bad version of our application cached were unable to load Close.
As a workaround, they could use a different browser or clear their native app (or browser) cache in order to immediately fix the issue during this period.
Root Cause and Resolution
One day before the downtime, we published a large change to the way we load our app in order to improve the app'sperformance for our customers. For approximately 17 hours, nothing else was published and the change was live.
When a new small change was published in the next day, some customers were unable to open the app, being headed directly to an error screen.
The issue happened due to wrong configuration of the software that we use to generate files that run the UI of our app. Files with different content should always have a different name, but due to this misconfiguration, some new files that were published had the exact same name, so customers that had the previous version in cache weren’t able to load the app.
In order to resolve the issue, we reverted both changes to a previous working version, and we also took the following steps to ensure this type of issue won’t occur anymore:
Timeline
- 2022-11-07 15:53 UTC - Published a large release for improving the UI performance of the app.
- 2022-11-08 12:55 UTC - Published a very small unrelated bug fix.
- 2022-11-08 12:56 UTC - Received the first support ticket of a customer not being able to log in.
- 2022-11-08 12:56 UTC - Started investigating the issue and checking if any coworker was facing the same issue in order to facilitate the debugging.
- 2022-11-08 13:12 UTC - After debugging the error, we figured out that clearing the browser/native app cache fixed the issue. We started asking customers to clear their browser / native app cache in order to regain access to Close in the meantime as a work-around.
- 2022-11-08 13:16 UTC - Since we figured out that the main issue was related to cache, we decided to invalidate the cache on our CDN.
- 2022-11-08 13:31 UTC - Cache invalidation completed, but the problem persisted.
- 2022-11-08 14:08 UTC - After trying many different configuration changes in our CDN and no success, we decided to get back to debugging the error directly in the browser.
- 2022-11-08 14:47 UTC - By comparing files from someone who was facing the issue with the files of another person that was able to use the app, we saw that they both had the same file, but the contents of those files were different for them.
- 2022-11-08 15:13 UTC - Uncovered the underlying issue with the configuration of the software that generate those files and started implementing a fix.
- 2022-11-08 15:31 UTC - Implemented the fix and started publishing the release.
- 2022-11-08 16:03 UTC - Published the release and the fix was live.