During the morning of the 4th June 2020, our alerting system started notifying the team that there were problems accessing GitBook content. We monitored that all GitBook content was gradually and rapidly becoming unavailable. By midday, the majority of our content was inaccessible even to the point where we couldn't send or receive emails using our @gitbook.com mails.
This port-mortem aims to explain what happened, how we acted, and what we're doing to prevent it happening again.
Google Domains suspended our domains due to phishing activity on a site we had already suspended.
We mitigated custom domains, and had a day of sporadic communication with their support team until they eventually lifted the suspension.
Moving forward, we will be improving our malicious content detection & the speed at which we suspend accounts.
We will also be moving our domains to CloudFlare Registrar.
With our services themselves reporting OK, we realised quite quickly that the problem was either with our DNS provider or the GitBook domains.
At 06:48 UTC, we received our first communication from the Google Domains team telling us that all GitBook domains had been suspended due to reports of phishing activities.
GitBook is a content creation platform. We allow anyone to write content and publish it to their team or to the public.
The vast majority of our users are using GitBook to write better content and share knowledge. However, a minority of (typically robotic) users abuse the GitBook platform to create spam content or content that attempts to impersonate others in order to phish information from visitors.
We've been dealing with this kind of abuse since day 1. We ask users to verify their accounts with an e-mail and a phone number so that we know they're genuine, we have the ability to remove content and ban users the moment we see they're abusing the platform, and our support team are diligent in responding to these kind of requests.
Over the past few weeks, we've seen an increase in phishing content on the GitBook platform. With the help of some of our diligent users, we identified the content and banned all accounts involved. Google had also identified this content internally and flagged it against us, which eventually led to a blanket ban across all our domains.
We actively design our entire system to be able to handle things failing. We use a serverless architecture where possible, and employ a multi-node failover strategy whenever we need to deploy our own servers.
Unfortunately, a domain registrar is a single point-of-failure you can't mitigate against. If your registrar decides to stop handling "gitbook.com", you're completely at their mercy. DNS and domain handling is the cause of a huge number of outages, particularly in recent years.
We were immediately all hands on deck to get things back up and running. Our initial response was to open up a line of communication with the Google Domains support team to find out what content had been reported and what we can do to get things back to normal.
At 08:06 UTC, we received an email informing us of the specific domain that had been reported (and notably, that we had already banned from GitBook) and the steps necessary to unblock our domain (which involved a third-party scan of the content and a vulnerability report of GitBook as a whole - clearly nothing something attainable in the short-term).
As it was our entire domain that has been suspended, there was no solution for the GitBook domains besides trying desperately to find someone on the Google support team to handle our case.
Many of our users have connected their custom domains to point to their GitBook content. For these users, we put in place a temporary working domain and asked our users to modify their CNAME entries to our new domain. We managed to spin up a CloudFlare worker and get things running around 09:30 UTC, and started letting our custom domain clients know how to use the workaround.
Around 11:00 UTC, we managed to get in touch with someone from the Google Domains support team via Live Chat. They told us that the case had been escalated to the compliance team and that we'd have their response via e-mail in 24h-48h... sent to an e-mail address on a domain they had just suspended...
With stress rising to insurmountable levels, we split our time frantically between trying to get in touch with literally anyone who could look at our case, responding on Twitter and Intercom to our customers, and trying to brainstorm ways of fixing the problem without relying on Google's response.
Finally, at 15:05 UTC, we received the following nonchalant e-mail:
... and our services started coming back after 8 hours of downtime.
27 May 2020
The malicious GitBook is detected by our content filter and the account suspended for phishing activity
04 June 2020
06:40 UTC - our alerting system tells us that certain GitBook services are inaccessible.
06:48 UTC - we receive an e-mail from Google telling us our domains have been suspended.
08:06 UTC - we receive a follow-up e-mail from Google telling us which content caused the suspension.
09:30 UTC - mitigation solution for custom domain customer was made available.
11:00 UTC - we manage to get in touch with someone from Google Domains support via Live Chat.
15:05 UTC - Google lifts the suspension on our domains.
No customer data was lost during the downtime. Blocking our domains meant that our services were inaccessible, but they were still working correctly behind closed doors.
We accept full responsibility for moderating the content on GitBook. This situation has highlighted we need to be better at dealing with this. Our initial steps include:
Detecting & blocking outgoing links to risky URLs as they're published in GitBook.
Detecting & blocking malicious content as it's published. Over the past few weeks our ability to detect spam has improved drastically, but we're still only doing it reactively. We'll be passing all GitBook sites through our content filter as they're published.
Ultimately, we feel that a problem what was originally a problem with phishing was turned into a complete disaster by how it was handled at the domain level. Malicious content is, and will always be, a problem for products that allow user-generated content. We need to find a domain provider who understands this and works with us rather than against us, so we intend to move away from Google Domains. Here are our reasons why:
The first e-mail we received from Google was to tell us that they had suspended our domains. We didn't receive a single warning that they had received a report or that they were investigating.
The GitBook site that was originally reported is a phishing account that's been known to us for over a week now. The content had been scanned, identified as malicious, and immediately banned.
That means that the suspension we received on the 4th June was based on information received by Google at least a week earlier. In this case, we had acted correctly and suspended the account from our side, but Google didn't check again the moment they gave the ban (or even tell us that a report had been made).
Google Domains isn't suitable for sites hosting user content. As soon as a single URL on the site is detected as malicious, the entire domain is banned. One of the funnier (although certainly not at the time) exchanges with the Google Domains team was around whether Google themselves would use their domain service for their own products. Imagine the entirety of Blogger or YouTube going down due to a single piece of spam content.
Overall the communication with the support team was poor. We received very little communication from their side and what little we did receive was poor and unhelpful. At one point we were so desperate for a point of contact that we resorted to using our Talent Manager to try and source Google Domain engineers on LinkedIn.
As a product company, one of the most frustrating feelings is being powerless to fix something when you know it's harming your users. Many GitBook clients are using our product to provide content to their own users, and today we put them in the same situation as we were in.
We want to thank our users for their patience and understanding whilst we worked through this. There was a general feeling of support from the majority of our users despite everyone being in a difficult situation. Rest assured we're going to do everything we can to learn from our mistakes and ensure this series of events doesn't repeat itself.