Some users cannot log into Close
Incident Report for Close
Postmortem

Close sincerely apologizes for the interruption of our service. We take the stability of our platform very seriously. Below is an explanation of what happened and how we will prevent another such interruption from occurring. 

Impact

Close CRM was not loading for a subset of our customers for 187 minutes from 12:56 to 16:03 UTC. Users that had a bad version of our application cached were unable to load Close.

As a workaround, they could use a different browser or clear their native app (or browser) cache in order to immediately fix the issue during this period.

Root Cause and Resolution

One day before the downtime, we published a large change to the way we load our app in order to improve the app'sperformance for our customers. For approximately 17 hours, nothing else was published and the change was live.

When a new small change was published in the next day, some customers were unable to open the app, being headed directly to an error screen.

The issue happened due to wrong configuration of the software that we use to generate files that run the UI of our app. Files with different content should always have a different name, but due to this misconfiguration, some new files that were published had the exact same name, so customers that had the previous version in cache weren’t able to load the app.

In order to resolve the issue, we reverted both changes to a previous working version, and we also took the following steps to ensure this type of issue won’t occur anymore:

  • We compared file per file of each release in order to understand the bottom of the issue.
  • Afterwards, we fixed the wrong configuration and in our test environment we:

    • Published the first release once again and thoroughly tested the app. We also made sure that our browser / native app were holding the files in cache.
    • Published the second release and once again thoroughly tested the app.
    • Finally, just to be 100% sure that things were fixed, we compared all files once again, and came to the conclusion that the problem was indeed solved.
    • As an extra step, we also released a few more different types of changes and compared files every time in order to guarantee that this issue wouldn’t repeat.

Timeline

  • 2022-11-07 15:53 UTC - Published a large release for improving the UI performance of the app.
  • 2022-11-08 12:55 UTC - Published a very small unrelated bug fix.
  • 2022-11-08 12:56 UTC - Received the first support ticket of a customer not being able to log in.
  • 2022-11-08 12:56 UTC - Started investigating the issue and checking if any coworker was facing the same issue in order to facilitate the debugging.
  • 2022-11-08 13:12 UTC - After debugging the error, we figured out that clearing the browser/native app cache fixed the issue. We started asking customers to clear their browser / native app cache in order to regain access to Close in the meantime as a work-around.
  • 2022-11-08 13:16 UTC - Since we figured out that the main issue was related to cache, we decided to invalidate the cache on our CDN.
  • 2022-11-08 13:31 UTC - Cache invalidation completed, but the problem persisted.
  • 2022-11-08 14:08 UTC - After trying many different configuration changes in our CDN and no success, we decided to get back to debugging the error directly in the browser.
  • 2022-11-08 14:47 UTC - By comparing files from someone who was facing the issue with the files of another person that was able to use the app, we saw that they both had the same file, but the contents of those files were different for them.
  • 2022-11-08 15:13 UTC - Uncovered the underlying issue with the configuration of the software that generate those files and started implementing a fix.
  • 2022-11-08 15:31 UTC - Implemented the fix and started publishing the release.
  • 2022-11-08 16:03 UTC - Published the release and the fix was live.
Posted Nov 16, 2022 - 12:24 PST

Resolved
This incident has been resolved.
Posted Nov 08, 2022 - 08:51 PST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Nov 08, 2022 - 08:10 PST
Identified
The issue has been identified and a fix is being implemented.
Posted Nov 08, 2022 - 07:48 PST
Investigating
We are currently investigating this issue.
Posted Nov 08, 2022 - 05:09 PST
This incident affected: Application UI.