Error rates have returned to normal and systems are all green. We've identified the possible root cause as a bug in the version of the message bus client we were using that could cause a deadlock in some cases. We've updated to the latest version of the client and rolled out the update to production.
Posted Jun 22, 2022 - 10:25 EDT
Monitoring
We've replaced the affected nodes and error rates have gone back to normal levels
Posted Jun 22, 2022 - 10:07 EDT
Investigating
We're seeing an increase in timeouts for our check-in service. We've removed a pair of misbehaving nodes from our load balancer and we're investigating the root cause.
Any check-in that received a 408 Request Timeout error was not processed.
Posted Jun 22, 2022 - 09:34 EDT
This incident affected: Snitch Check-in Processing.