Incorrect DOWN reports

Incident Report for Dead Man's Snitch

Postmortem

Postmortem: March 6th, 2015

On Friday, March 6th we had a major outage caused by a loss of historical data. During the outage we failed to alert on missed snitch check-ins and sent a large number of erroneous failure alerts for healthy snitches. It took 8 hours to restore or reconstruct all missing data and get our systems stabilized. I am incredibly sorry for the chaos and confusion this caused.

So what happened?

On March 6th at 9:30 EST we deployed a change that decoupled two of the models in our system (historical periods and check-ins). At 9:45 EST a user triggered an unscoped deletion of all historical period records when they changed the interval of their snitch.

We were alerted at 9:50 EST and immediately disabled our alerter process to avoid further confusion. We began diagnosing the cause and at 10:50 EST deployed a fix for the unscoped delete. Our next step was to restore the missing data from our backups. We decided to keep the system live and to use a slower but more accurate process to restore the data due to possible conflicts created by keeping the system running.

At 17:30 EST we finished the restoration of most of the historical data and ran a set of data integrity checks to ensure everything was in a clean state. We sent out one final set of "reporting" alerts for any snitches that were healthy but thought to be failed.

How did this happen?

We use a pull request based development process. Whenever a change is made it is reviewed by another developer and then merged by the reviewer. It's common to make several revisions to a change before it is merged.

In this case, the unscoped deletion was introduced as part of implementing a suggestion to reduce the number of queries made during an interval change. When making the change the scoping to only those periods for a snitch was accidentally removed. The code was reviewed but the scoping issue was missed on final review.

Additionally, we have an extensive test suite in place that gives us confidence when we make large changes to the system. Our tests did not uncover this issue since the unscoped delete satisfied our testing conditions.

Our next steps

We have reviewed our use of destructive operations that could be prone to scoping issues (e.g. Model.where(...).delete_all) and have found that this was the only instance of it left in our codebase.
We have reviewed our tests around destructive behavior and have added cases to ensure they only affect the records they should.
Our restore and recovery process took much longer than we would like. We developed a set of tools for checking data integrity while we waited for the restore to finish and we will be fleshing these out further and making them a part of our normal maintenance routine. Lastly we will be planning and running operations fire drills to improve our readiness for cases like this.

Summary

Monitoring failures can mean lost sleep, lost time, and added stress to an already stressful job. As an operations person I am well aware of the trouble a malfunctioning system can cause. I am very sorry for the chaos and confusion caused by our failings. We very much see Friday's issues a failure of our development process and are taking the steps to improve that process.

Should we have future issues the best place to get notified is to subscribe to notifications at status.deadmanssnitch.com or to follow us on twitter.

Chris Gaffney
[i] Collective Idea

Posted Mar 09, 2015 - 16:12 EDT

Resolved

Systems appear stable, root cause has been identified and patched, we have sent out our final batch of status updates for alerted but health snitches. We will be following up with a full postmortem on March 9th.

Posted Mar 06, 2015 - 23:23 EST

Monitoring

The dashboard should be accurate for the current state of the system but there are a small percentage of older historic periods that we are continuing to important. We expect them to finish within the next hour or two. We are vigilantly monitoring the system to make sure everything is stable.

Posted Mar 06, 2015 - 17:37 EST

Update

Backfilling continues but will take some time to finish. Snitches will slowly be updating so their state is consistent with when they last checked in. We are expecting that we will be able to reenable alerting in the next hour or two.

Posted Mar 06, 2015 - 14:44 EST

Update

Backfilling of data continues. We are at a point where we can begin to update snitches so their current state is correct.

Posted Mar 06, 2015 - 13:13 EST

Update

We are continuing to backfill historic periods and we will be reconciling snitches so they will show correctly once that process is complete. We will post our next update in an hour.

Posted Mar 06, 2015 - 12:03 EST

Update

We have identified and patched the root cause of the issue. We are currently backfilling old data but the process will take a while to run. We can confirm that no check-in data was lost but we have had to pull some cached state (historic periods) from a backup.

Posted Mar 06, 2015 - 10:58 EST

Update

Most customers will see correct data, a handful will be incorrect and we're working to fix it. Underlying data is unaffected.

Posted Mar 06, 2015 - 10:24 EST

Identified

We've identified a problem with our alerting process that sent out far too many emails. Underlying check-in data is unaffected.

Posted Mar 06, 2015 - 10:01 EST

Investigating

You may have received many DOWN report emails around 14:47 UTC.

We're investigating the situation.

Posted Mar 06, 2015 - 09:52 EST