During the affected period, we saw elevated request volume which led to
increased CPU usage on our database. Because of the increased utilization,
queries took longer to run, and eventually the combination of increased requests
and increased database latency resulted in our API instances consuming all
available connections in their connection pool. They then became unavailable to
the load balancer.
This was similar to the incident the day before, but more severe.
We resolved the immediate issue by increasing our cache timeout for certain
queries about device state, which reduced the load enough to recover. We then
restored the normal timeouts. As a long-term mitigation, we've added
dedicated pools for our endpoints most likely to begin consuming
connections under load. This has the dual effect of ensuring that if connections
become exhausted the API core instances will remain unaffected, as well as
increasing capacity generally.