June 25th, 2023 Deno Deploy Postmortem

On June 25th, 2023, at 09:42 UTC a large chunk of Deno Deploy services experienced a service disruption until approximately 10:26 UTC (44 minutes). During this period users could not access projects hosted on Deno Deploy, access Deno web properties such as deno.com, or download modules hosted on deno.land.

Customers making use of Deno Deploy Subhosting were not affected by this outage.

Our commitment to providing a stable and robust platform to our users is our topmost priority. We deeply regret this incident and sincerely apologize for any disruption caused. This report provides an overview of the event, the cause of the outage, and measures we plan to take to prevent such instances in the future.

Impact

During a 44-minute period, users experienced a service disruption and could not access key Deno web properties and deployments on Deno Deploy, including deno.com and deno.land.

Timeline of Events

All times in UTC, on June 25th 2023.

09:42 - DDOS attack begins
09:43 - Alerts are triggered and teams are paged
09:46 - Team member get online to find root issue.
09:47 - Logs indicate queries to database are rate limited.
10:14 - DDOS attack ends
10:26 - All alerts are resolved. Majority of our systems were recovered.
11:02 - Incident is marked as resolved.

We estimate a downtime of approximately 44 minutes from when our systems first started failing until full recovery was achieved.

Root cause

On Sunday, June 25, a DDOS attack was mounted against one of the web properties hosted on Deno Deploy. During this time we saw a request rate that is about 10 times higher than normal during this time of day. Our systems responded to this by initially rejecting requests to the affected domain, and by scaling up. Due to the attack volume, a large number of processing nodes had to be brought online at the same time. Each of these nodes require a small amount of metadata that gets pulled from a central database. Unfortunately our database was unable to handle the query load associated with this, which resulted in some nodes coming online much slower than expected, while others would time out while waiting for their start-up payload and subsequently rebooted themselves, which exacerbated the load on our database further.

What’s next?

We’re taking some steps to prevent this problem in the future:

The processing capacity of the central database has been increased significantly.
The start-up payload will be cached across the globe to remove the single-point-of-failure dependency on our central database.
Processing nodes will be allowed to operate from (slightly) stale data in case the latest boot payload is unavailable for any reason.
In the coming weeks, we will conduct additional load tests to ensure that our systems can withstand a sudden load spike of the magnitude that we saw during this event.

Have questions, suggestions or other thoughts? Feel free to drop us a line.