If aliens were to invade earth, or an earthquake lopped California into the ocean, Google wants you to know that it will still be there for you.
That’s because a group of 10 Google engineers go around purposely breaking things at the company, under the most dire circumstances they can imagine, just to see what happens.
Through this process, the team makes sure that Google can keep itself running no matter what happens.
Sometimes this team makes Martians invade the earth. Sometimes they feign an earthquake or ecological disaster.
They strike their internal Google targets without warning.
This is Google’s renowned DiRT team, (Disaster Recovery Testing), led by director Kripa Krishnan, who has perfected the art of breaking Google over the past nine years.
Angry, tense and caffeinated
Despite the humour-inspired scenarios she inflicts on her co-workers, the tests themselves are very, very serious, she tells us.
For one thing, the team is taking down actual, live systems, sometimes whole data centres. And if they bork things up too badly for too long and cause Google to have a serious outage, there will be money lost and hell to pay.
Before each test, people gather in various “war rooms,” sometimes located all over the world, Krishnan tells Business Insider.
“We have intense situations in the war room. There’s like 20-30 people sitting there. The room is warm. Everyone is very, very high on caffeine, ready to go and everybody’s angry all the time,” laughs Krishnan who, during this interview, is the opposite of tense or angry in every way. She’s cheerful. She’s cracking jokes. She telling us funny stories of how her small band roams Google, wreaking havoc, with the help of hundreds of other Google experts who are called upon to work on the tests as needed.
But, if things go wrong, which they often do (because making things go wrong is the whole point), the stress in the war room really heats up, she says with a smile.
For instance, in the middle of one massively orchestrated test on the network that could have taken down a big chunk of Google, the team noticed that a popular app used by millions of people was slowing down.
They didn’t think their test would cause that. But they didn’t know for sure and they wondered if they should abort the test, which would also be dangerous at that particular moment, she recalls.
Within 15 minutes they determined their test was not the problem.
“But for those 15 minutes, we were just yipping at each other. Shouting, tears in the war room,” she recalls.
Googler’s don’t phone home
A funnier situation involved the incident command team trying to communicate with each other. They were planning a scenario where the internet was down, making things like chat rooms or Google Hangouts unavailable. So the plan was that everyone would just go retro and dial into an old-fashioned telephone bridge.
Only it turns out Googlers “don’t like phones,” she said.
People couldn’t find the bridge phone number, and they didn’t know basic things like how to dial a phone to get an outside line.
Worst of all, once they all did get on the line, some folks would put the phone on hold, forcing the rest of the team to stop talking and listen to hold music.
“We didn’t know how to work the bridge to boot those people out,” she laughs.
Ditto for satellite phones. Years ago, Google bought its top engineers and execs satellite phones so they could always be reached in case of a disaster.
When the DiRT team tested that idea with a fake earthquake, 100% of the people with satellite phones failed to use them. They couldn’t find the phones, or they weren’t charged, or the only way to get a satellite signal was to “climb onto the roof,” and “that’s not a safe thing to do” especially after an earthquake, she said.
“If we hadn’t tested this, we would never have known. We would have kept investing in satellite phones,” she says.
Beg, borrow, and credit card
One time, Krishnan and team put a data center to the test by telling the data center team that a flood had forced the data center off the grid and onto a backup generator running on diesel fuel. They were instructed to purchase a massive quantity of fuel.
She was trying to get them to activate Google’s procedure to release emergency funds. Instead, no matter what the DiRT team threw at them, (a bigger flood, a fire in another room), the engineers came up with “creative solutions to find money.”
The engineers asked their local community to donate stuff or to loan them stuff, fees to be paid later. Someone even offered the use of a credit card with an enormously huge credit line to pay for things.
They never called the person that would send them the emergency fund money. But they never let the Google site go down, either.
“They take it seriously, this role playing stuff,” she says.
Another time, the team tested the HR department. The scenario was that a meteor had crashed into earth, stranding employees all over the world.
“The whole point of it was to completely bombard the HR department,” Krishnan says, particularly with questions they wouldn’t know, such as approval for expensing a $15,000 flight home, or for buying clothes due to lost luggage. The HR team “shocked” the DiRT team by quickly organising itself and handling the onslaught very well, she says.
Next up: automated ‘chaos engineering’
Google isn’t the only giant internet company in the Valley that needs to make sure it never goes down.
So a small team of testers from other big companies have started to work together to share best practices. They call their young discipline “chaos engineering,” Krishnan says.
These folks are currently working on ways to automate some of the tests.
“Right now, scale is our problem. We are doing hundreds of tests, but I cannot scale my team to hundreds of people. So we are exploring automating some of this. How do you constantly cause damage so systems are constantly recovering?”
That kind of thing has never been done before, but after nearly ten years, she’s used to that.
All the years of testing has taught Krishnan one important thing: it’s not enough to have disaster plans and backup technology in place. People must test, change and perfect them.
“We want people to practice enough so they get the right concepts. And then we trust them to wing it. Give them a lot of space for solving the problem,” she says.