When you’re running at Facebook’s scale, you’re going to run into problems that no other tech company has ever encountered before.
Which means that it falls on Facebook itself to build the tools it needs to handle the massive amounts of data it has to crunch every day.
Enter Facebook Presto, a data-crunching tool built in-house at the social network.
When Presto was first revealed in 2013, Facebook’s analysts and engineers were using it to ask questions of its then-300 petabyte large data warehouse and get answers fast.
Released by Facebook as open-source code, the technology has spread beyond the social network’s confines and into major organisations such as Netflix and NASDAQ, which value the tool’s flexibility when dealing with mountains of data. Its rapid adoption highlights Facebook’s growing influence and ability to shape the cutting-edge technology that powers today’s internet economy.
More than 90 outside developers have volunteered their time to improve Presto over the last two years, bolstering Facebook’s in-house efforts, according a blog post released today.
The magic of Presto is that it presents a massively more efficient way to deal with data at large scales, says Jay Tang, who leads Facebook’s “interactive analytics infrastructure.”
Hot open source technologies like Apache Hadoop and Apache Hive sparked the so-called “big data” revolution, giving companies a vastly more efficient way to process large quantities of information.
Facebook uses both of those technologies, Tang told Business Insider. But the problem is that Hadoop and Hive are optimised for reliability — not speed.
“Running a query,” the technical term for asking a question of a database, isn’t impossible on Hive, but it often requires copying the data elsewhere and processing it to make it more digestible by data experts.
Given how much emphasis Facebook puts on “moving fast,” it really harshes the vibe when engineers can only run a few queries a day on their data.
“Presto is trying to solve a very specific problem,” says Tang.
But Tang emphasises Presto’s “very unique architecture,” which brings the mountain to Mohammed, so to speak.
Rather than shuffling the data around, Presto can read Hadoop, Hive, and other databases, right where it sits. There’s no data shuffling to do; Presto can just read it and understand it, letting researchers use the SQL querying language they already know.
“Presto gives you the ability to query data wherever it lives,” Tang says.
Beyond Silicon Valley
When Facebook first released Presto, Tang says, its main appeal was to those few developers on the bleeding edge.
But thanks in large part to the mobile revolution, companies of all sizes are dealing with ever-growing sets of data, and are starting to run into the same problems that Facebook solved years ago.
“A lot of companies are facing the same problem,” Tang says.
For example, Airbnb has turned to Presto to build Airpal, a tool to quickly put access to data right in front of employees. Gree, a Japanese social gaming giant, uses Presto because it integrates more smoothly with the Hadoop and other data center infrastructure they have in place.
And NASDAQ and Netflix have combined Presto with Amazon Web Services to get more efficient usage of their cloud infrastructure.
Tang promises that really large “Fortune 50” companies are using Presto, too, but they’re gun-shy about sharing the details.
But companies like Teradata and MicroStrategy recently announced support for Presto in their commercial data software offerings, building out the stuff that can make it more appealing to the largest enterprises.
Crucially, Tang says, they contribute back the data connectors that they develop for Presto under that open source model, improving the core project and furthering its overall usefulness. Thanks to the Presto community’s efforts, it now has a “set of rich connectors,” Tang says.
“You definitely need a vibrant, open community,” Tang says.
Business Insider Emails & Alerts
Site highlights each day to your inbox.