Resolved -
Error rates have returned to normal and systems are all green. We've identified the possible root cause as a bug in the version of the message bus client we were using that could cause a deadlock in some cases. We've updated to the latest version of the client and rolled out the update to production.
Jun 22, 10:25 EDT
Monitoring -
We've replaced the affected nodes and error rates have gone back to normal levels
Jun 22, 10:07 EDT
Investigating -
We're seeing an increase in timeouts for our check-in service. We've removed a pair of misbehaving nodes from our load balancer and we're investigating the root cause.
Any check-in that received a 408 Request Timeout error was not processed.
Jun 22, 09:34 EDT