GitLab, a startup with $25 million in funding, is having a “very bad day,” as Interim VP of Marketing Tim Anglade put it to Business Insider, after a series of human errors caused the service to go down overnight.
Basically, GitLab provides a virtual workspace for programmers to work on their code together, merging individual projects into a cohesive whole. It’s a fast-growing alternative to the leading $2 billion GitHub, the high-profile Silicon Valley startup.
And as of Wednesday morning, GitLab was only just starting to come back online. But even worse than the embarrassment of such major downtime, the company now has to warn a handful of its customers that some of their data might be gone forever.
A bad day
The bad day started on Tuesday evening, when a GitLab system administrator tried to fix a slowdown on the site by clearing out the backup database and restarting the copying process. Unfortunately, the admin accidentally typed the command to delete the primary database instead, according to a blog entry.
And by the time he noticed and scrambled to stop the deletion “of around 300 GB only about 4.5 GB is left,” the blog explains. Oops. The site had to be taken down for emergency maintenance while they figured out what to do, keeping users apprised via its blog, Twitter, and a Google Doc that the GitLab team kept updated as new developments arose.
Making matters worse, they couldn’t just restore: “Out of 5 backup/replication techniques deployed none are working reliably or set up in the first place” the blog said. “We ended up restoring a 6 hours old backup.” Which means that any data created in that six-hour window may be lost forever, Anglade says.
Bad news, good news
While in the process of restoring that older version of the database, the site went completely down for at least six hours, Anglade says. Worse, intermittent failures while they got the service back online took another several hours, with everything only starting to get back to normal on Wednesday morning.
The good news, says Anglade, is that the database that was affected didn’t actually contain anyone’s code, just stuff like comments and bug reports. Furthermore, Anglade says that the many customers who installed GitLab’s software on their own servers weren’t affected, since that doesn’t connect up to GitLab.com. And most paying customers weren’t affected at all, the company said, which minimise the financial impact.
The outage is bad, as is the looming possibility that some of that data might be gone, Anglade acknowledges, but nobody is going to have to start rewriting their software from scratch, and only around 1% of GitLab’s users will see any lasting effects from this incident.
As for the systems administrator who made the mistake, Anglade is hesitant to place blame, since it was really the whole team’s fault that none of their other backup systems were working. “It’s fair to say it’s more than one employee making a mistake,” he says.