Elevated check-in error rate and timeouts

Incident Report for Dead Man's Snitch

Resolved

Error rates have returned to normal and systems are all green. We've identified the possible root cause as a bug in the version of the message bus client we were using that could cause a deadlock in some cases. We've updated to the latest version of the client and rolled out the update to production.

Posted Jun 22, 2022 - 10:25 EDT

Monitoring

We've replaced the affected nodes and error rates have gone back to normal levels

Posted Jun 22, 2022 - 10:07 EDT

Investigating

We're seeing an increase in timeouts for our check-in service. We've removed a pair of misbehaving nodes from our load balancer and we're investigating the root cause.

Any check-in that received a 408 Request Timeout error was not processed.

Posted Jun 22, 2022 - 09:34 EDT

This incident affected: Snitch Check-in Processing.