Degraded API performance affecting device updates

Incident Report for Balena.io

Postmortem

During the affected period, we saw elevated request volume which led to
increased CPU usage on our database. Because of the increased utilization,
queries took longer to run, and eventually the combination of increased requests
and increased database latency resulted in our API instances consuming all
available connections in their connection pool. They then became unavailable to
the load balancer.

This was similar to the incident the day before, but more severe.

We resolved the immediate issue by increasing our cache timeout for certain
queries about device state, which reduced the load enough to recover. We then
restored the normal timeouts. As a long-term mitigation, we've added
dedicated pools for our endpoints most likely to begin consuming
connections under load. This has the dual effect of ensuring that if connections
become exhausted the API core instances will remain unaffected, as well as
increasing capacity generally.

Posted Aug 14, 2018 - 23:44 UTC

Resolved

This incident has been resolved.

Posted Aug 07, 2018 - 22:32 UTC

Update

We are continuing to work on a fix for this issue.

Posted Aug 07, 2018 - 22:22 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Aug 07, 2018 - 21:09 UTC

This incident affected: API.