You can’t have a discussion about big data or data science these days without mentioning Hadoop, but when you ask people what Hadoop is, they struggle to define it. I define Hadoop as a software framework for harnessing the power of distributed, parallel computing to store, process, and analyze large volumes of structured, semi-structured, and unstructured data.
Imagine having to write a 30 chapter book. You have two options — write it yourself or write the one chapter you know the most about and farm out the other 29 chapters to various experts on those topics. Which option would deliver the best book in the least amount of time? Obviously, distributing the work among 30 experts would expedite the process and result in a higher quality product. The same is true with distributed, parallel computing; instead of having one computer do all the work, you distribute tasks to dozens, hundreds, or even thousands of powerful “commodity” servers that perform the work as a collective. This group of servers is often referred to as a Hadoop cluster, and Hadoop coordinates their efforts.
Three Key Features
Hadoop has three key features that differentiate it from traditional relational databases, which make it an attractive option for working with big data:
- Support for semi-structured and unstructured data: With a traditional relationship database, any incoming data must be structured in a way to fit with the database’s existing structure (or schema). With Hadoop, data can be imported without consideration of structure. The structure is added when the data is read from the database.
- Hadoop Database File System (HDFS): Instead of storing data in interrelated tables, Hadoop stores it in a compressed file format, splits the data, stores it on different nodes, and replicates it. Storing identical data on separate nodes facilitates distributed, parallel computing while making the system more fault tolerant — if a node fails, another can pick up the slack.
- MapReduce: MapReduce is a Java-based programming model for distributed, parallel computing. It assigns work to various nodes in a Hadoop cluster. The work is broken down into two procedures — map and reduce. Map performs the filtering and sorting, followed by reduce, which pulls together the results from the map procedure and summarizes them.
Advantages of Hadoop
Hadoop has several advantages over traditional relational databases, including the following:
- Flexible: Hadoop supports every imaginable data type, so organizations can load diverse data into their data warehouse without spending a lot of time transforming it to match an existing schema.
- Scalable: If more storage or compute resources are required, Hadoop can simply bring more servers online to handle the load.
- Fast: By distributing workloads to numerous servers and coordinating their parallel processing, Hadoop significantly increases the speed of processing and analytics. Also, with the ability to add Hadoop clusters, performance doesn’t suffer when multiple users need to access the same data or when the system is engaged in different processes, such as loading data versus analyzing data.
- Economical: Cloud storage and processing are quickly becoming inexpensive commodities, and companies that sell these commodities often charge according to the amount of storage and compute resources used. With Hadoop, usage can be scaled up and down at a moment’s notice to increase performance when needed and reduce costs when less storage and compute are needed.
- Fault resistant: Because storage and compute is distributed, if one server fails, its workload can be transferred to another server with no loss of data or performance.
Disadvantages of Hadoop
Hadoop certainly helps to overcome many of the challenges of big data, but it is not, in itself, an ideal solution. It is weak in the following areas:
- Small files: Hadoop is geared to handle large files. HDFS cannot handle lots of files significantly smaller than the default storage block of 64MB.
- Security: Hadoop wasn’t designed with security in mind. Security features were added later, but the built-in security features aren’t considered adequate for most enterprises, especially in industries that require strict security and privacy, such as healthcare and finance.
- Data consolidation: Many organizations that run Hadoop essentially have two data warehouses — one for internal (structured) data and another for external (semi-structured and unstructured data). They struggle to combine data sets from the two warehouses to run queries or analytics.
- Ease of use: Most database users are familiar with structured query language (SQL), which enables them to easily query data in traditional databases. With Hadoop, users need to manually code each and every operation to develop in MapReduce, which is significantly more complicated. However, certain add-ons are available to simplify the process, such as Hive and Pig.
Commercial versions of Hadoop are available to help overcome these and other limitations of the open source Apache Hadoop, including Amazon Web Services Elastic MapReduce, Cloudera, Hortonworks, and IBM InfoSphere BigInsights.