At 6:25AM UTC one of our two collection instances failed and was removed from the load balancer. This is normally not a problem as we have more than enough capacity for a single instance to handle all of our load. We have brought this system back online and are planning to replace it today as it's showing strange behavior and being sluggish.
Between 9:55AM UTC and 10:15AM UTC our remaining collection instance stopped receiving check-ins. We have isolated the issue to a slow but constant memory leak that caused the system to become unresponsive until the OOM killer took action. We will be adding more metrics and alerts around memory usage and rotating our services more often.
Posted about 3 years ago. Oct 30, 2014 - 09:40 EDT