Resolved -
On February 29, 2024, between 9:32 and 11:54 UTC, queuing in our background job service caused processing delays to Webhooks, Actions, and Issues. Nearly 95% of delays occurred between 11:05 and 11:27 UTC, with 5% during the remainder of the incident. During this incident, the following customer impacts occurred: 50% of webhooks experienced delays of up to 5m, 1% of webhooks experienced delays of 17m at peak; Actions: on average, 7% of customers experienced delays, with a peak of 44%; and many Issues saw a delay in appearing in searches. At 9:32 UTC our automated failover successfully routed traffic to a secondary cluster. But an improper restoration to primary at 10:32 UTC caused a significant increase in queued jobs until 11:21 UTC, when a correction was made and healthy services began burning down the backlog until full resolution.
We have made improvements to the automation and reliability of our fallback process to prevent recurrence. We also have larger work already in progress to improve the overall reliability of our job processing platform.
Feb 29, 12:27 UTC
Update -
We're seeing recovery and are going to take time to verify that all systems are back in a working state.
Feb 29, 12:21 UTC
Update -
Issues is operating normally.
Feb 29, 12:19 UTC
Update -
Webhooks is operating normally.
Feb 29, 12:18 UTC
Update -
We're continuing to investigate delayed background jobs. We've seen partial recovery for Issues, and there is ongoing impact to actions, notifications and webhooks.
Feb 29, 11:05 UTC
Update -
Actions is experiencing degraded performance. We are continuing to investigate.
Feb 29, 10:58 UTC
Update -
We're seeing issues related to background jobs, which are causing delays for webhook delivery and search indexing, and other updates.
Feb 29, 10:36 UTC
Investigating -
We are investigating reports of degraded performance for Issues and Webhooks
Feb 29, 10:33 UTC