12/2019: Service downtime

Dec 16, 2019

At 3PM CEST (9PM UTC), GitBook experienced 6 hours of downtime. This downtime was caused by a congestion of our servers that were handling too many executions of parallel tasks. This resulted in 429 HTTP error codes (Too Many Requests) for our users and a total interruption of our services.

Timeline

As soon as our monitoring system alerted us at 3PM CEST, our engineering team started investigating the root cause of the issue. Our metrics and logs showed that our CPU execution quotas had exceeded due to a higher-than-expected use of tasks. We reached out to our service provider (Google) for a temporary solution and they replied by inflating our quotas, which resulted in a near-immediate recovery of our services (see chart below).

In green, the active instances causing out-of-quotas issues with CPU usage.

Resolution

However, to prevent any similar issues, we changed radically how tasks executions happen on our servers to avoid a too high number of parallelized tasks. This will ensure a more linear handling of requests in the future and a much more predictable behaviour.

Providing a reliable service, that users can count on, is core to our mission. So whenever we fail to deliver on that promise, it's important for us to be transparent and do a post-mortem on the issue.

We're sorry for any inconvenience this downtime may have caused you, we're thankful for your trust and we strive to live up to that trust by delivering an excellent product that you can count on.