A Quick History of Big Data & Hadoop
- Facebook, Yahoo! and Google found themselves collecting data on an unprecedented scale. They were the first massive companies collecting tons of data from millions of users.
- They quickly overwhelmed traditional data systems and techniques like Oracle and MySql. Even the best, most expensive vendors using the biggest hardware could barely keep up and certainly couldn’t give them tools to powerfully analyze their influx of data.
-
In the early 2000’s their armies of PhDs developed new techniques like MapReduce, BigTable and Google File System to handle their big data. Initially these techniques were held proprietary. But…
- Around 2005 Facebook, Yahoo! and Google started sharing whitepapers describing their big data technologies.
- In 2006 Doug Cutting starts the Hadoop project as an open source version of these technologies.
- Companies in every industry now find themselves with big data problems because their ability to collect data grows every day.
- A thriving ecosystem of companies, projects and individuals has emerged to tackle big data problems.
What is Big Data?
Big Data generally means having so much data that you overwhelm your traditional systems and techniques. Systems that worked last year, and that felt nimble when launched, suddenly feel sluggish as the burden of massive data loads crushes them. Systems work… engineers make heroic efforts to guarantee that they do. But they never feel agile or responsive again.
Although big data’s often in the many-terabyte, petabyte and exabyte range, there is no official size threshold. In fact, some of the best big data problems don’t involve massive amounts of data… they just require massive amounts of processing on that data.
Signs of a Big Data Problem
- Batch jobs that take too long to run… what if you had that business intelligence in a matter of minutes?
- CPU-bound database or datawarehouse servers
- Repeated emergency meetings to discuss scaling of the data systems
- Long waits just to move data around
- Business managers asking for insight that IT can’t provide
What is Hadoop?
Hadoop is an open source software platform that makes big data look like normal data. It makes it possible to do very complex analysis against very large data sets that would overwhelm even the biggest and most expensive database installations. The problem with traditional databases and techniques is that they invariably centralize data. A massive Java/MySQL app is architected such that many Java computing machines sit around a central MySQL machine (or even a MySQL cluster.) Scaling to thousands of Java compute machines means that your central MySQL installation gets hammered by requests. In essence you’re running a distributed denial of service (ddos) attack on your own systems! At a fundamental level it’s the disk IO bottleneck that prevents your system from scaling.
Hadoop changes the core architecture of computing problems. Instead of a centralized data store it chunks the massive data sets and stores those chunks all across a cluster of machines. Then, and this is key, it sends compute jobs *out to* the data! So the compute jobs run where the data is. This system leverages disk IO across the entire cluster by putting the data close to the CPU that needs it.
Hadoop is built on two main components… MapReduce and the Hadoop File System. MapReduce is what chunks processing code out to the cluster. The Hadoop File System is what chunks data out to the cluster.
MapReduce is a way of programming computational problems. While MapReduce jobs can be written in many languages, most are written in Java. So MapReduce isn’t a language. It’s a way of thinking about computing problems. It’s another way to skin a cat. If you have business computations encoded in PL/SQL, Java, stored procedures or some arcane XML BI syntax, MapReduce can accomplish the same task.
The Hadoop File System (HDFS) allows tens, hundreds or thousands of servers to share files. It’s like creating one hard drive from thousands. It’s redundant… mission-critical data is always stored on at least three machines… if one machine goes down HDFS automatically shuffles a new copy of your data to another machine. And it’s smart… HDFS knows how to move segments of data closer to computing processes that need it.
Hadoop provides the framework to use MapReduce and HDFS to run massive compute jobs. When you stand up a cluster of 1000 machines, Hadoop keeps track of which one is running which job, where its data is stored, etc.
Massive Investment Momentum in the Big Data Space
Big deals are flowing within the Big Data space as enterprises across all industries encounter the same data problems that Facebook, Yahoo! and Google did ten years ago. This is good for all enterprises as it means that the tools will continue to mature and industry-specific solutions will emerge.
- EMC to spend $3B on big data in 2011 after spending $3B in 2010
- IBM invests $100M in big data
- Yahoo! Spins HortonWorks out @ $200M valuation
- HP Acquires Vertica
The list of deals, big and small, goes on and on. Venture capital is pursuing the space more aggressively than it did the social media space because big data points of pain aren’t tied to discretionary marketing budgets… they’re core to an enterprise’s existence.
The space is still nascent with many investments being made in toolsets which will compete and often lose against open source community-developed solutions. SocketWare chooses a strategy of deep industry insight to create hard-to-replicate products.