You are about to become an expert on all that is Hadoop. I am going to explain the technology and then all the main projects of the Apache Software Foundation including HDFS, NameNode, DataNode, MapReduce, Hive, Impala, Drill, Parquet, Pig, Mahout, Yarn, Zookeeper, Sqoop, Flume, and Oozie.
The most important two concepts about Hadoop are commodity hardware servers and the Hadoop Distributed File System (HDFS).
Hadoop is built around the concept of commodity hardware servers. These are not sophisticated and expensive servers that you would buy from Teradata, Oracle, Microsoft, or IBM. These are inexpensive Linux servers that have cheap disk storage. You will often put 10 servers in a rack. Now imagine you have 10 racks in a computer center somewhere. You’d have 100 servers networked together!
The great thing about Hadoop is that you don’t need all your servers in one place. You might have a series of 100 racks (1000 servers) in different parts of the country. Now imagine that you have 500 racks totaling 5000 servers spread around the world. This is the hardware that comprises your entire Hadoop system, which is referred to as a cluster. The concept of Hadoop is to store and query massive amounts of data reliably and cheaply.
One of your servers will be the NameNode. All the other nodes are referred to as DataNodes. NameNode is also known as the Master. The NameNode only stores the metadata of HDFS, which is really just the directory tree of all files in the file system. The NameNode tracks these files across the cluster like a bloodhound. The NameNode is responsible for knowing what files (tables) are on what Data Node servers. The DataNodes are where the data is stored and queried.
The Hadoop Distributed File Server is designed to simply write data in either 64 MB blocks or 128 MB blocks to a specific server. Most relational databases like Teradata write their data in 1 MB blocks. This is done so Teradata can quickly insert, update or delete or index a row. Hadoop is writing its data in blocks designed to be 128 times as big, but Hadoop isn’t going to allow any updates, and all inserts are appended at the end of the last block written.
When a 128 MB data block is filled on Hadoop another block is opened and begins filling up. This continues as the file/table continues to load data. If you had a file that was 276 MB in size then it would contain 2 blocks of 128 MB and another smaller block containing 20 MB. The Named Node would know that these three blocks were stored on servers one, five, and 5000 (for example).
Then, each DataNode that contains a block of data copies that block to two other servers in case the node or disk fails. You can choose how many copies you want for redundancy, but the automatic default is three copies. This means that if a file is 50 GB then it is really 150 GB because of the redundancies.
Hadoop was first designed to not utilize SQL but instead use something called MapReduce. Let me give you a good picture of MapReduce. I want you to imagine that you have an Insurance Claims table that consists of 100 blocks on 100 servers. Now imagine you want to summarize all the claim amounts per state. The NameNode would get your query and know which servers held the Claims data. It would assign a Job Tracker on the NameNode to contact and monitor the MapReduce jobs on each node. Each DataNode involved would have a Task Tracker that reports back to the Job Tracker until the job was finished.
The first part of the MapReduce function would be to Map, and the second part would be to Reduce. The Map portion would command that all 100 nodes each go through their data block and summarize the Claims data they hold per state. Each DataNode would then be contacted to move their intermediate results to a different Data Node per state. Therefore, all the Ohio data would go to a specific server and the California data would go to another server until all 50 states were on 50 different servers. Then, the reduce part takes over and each DataNode totals up all the claims they have for their state. That is MapReduce.
Hive is the first SQL-on-Hadoop project, and it was incubated at Facebook and given to the Apache Foundation. It is written in Java, and all Hive SQL is translated under the hood into MapReduce. It has a Metastore (a database) to store table schemas, partitions, and locations, and it organizes datasets to have a look and feel like database/table/view conventions. Hive provides an excellent batch processing solution, but its high latency makes querying slower.
Impala is built by Cloudera, one of the biggest Hadoop vendors on the market, and they have provided it to the Apache Foundation. Like Hive, it is an SQL engine, but it does not query the HDFS file system using MapReduce. Impala takes SQL and queries HDFS like a traditional MPP parallel processing system by installing its own set of execution daemons alongside each DataNode. It is written in C++ and uses a lot of RAM memory so queries can be up to 5-80% faster than Hive. It is best implemented with Parquet, a columnar storage for HDFS.
When you hear the word Parquet, it means that you are turning your HDFS file system into columnar storage. Now, when an HDFS file is built, each column has its own data block. When a normal HDFS file/table is queried, no matter if you are only asking for a few columns, the entire block must be moved into memory. With a columnar design like Parquet, only the columns needed to satisfy the query are moved into memory, thus making many queries much faster. Most Cloudera implementations recommend using the Parquet columnar storage, but you can also use Parquet on a Hive system.
Here is the part that is really going to make things clear. Many companies use both Hive and Impala together on one Hadoop cluster, but this is the shocking part – the data underneath for both Hive and Impala is still the Hadoop Distributed File System (HDFS). You can load data onto Hadoop using Hive or Impala, and then you can delete the table on either Hive or Impala. Either way the table is gone! Both Hive and Impala are not a database. The database is HDFS, but Hive and Impala both query and present the HDFS tables differently!
You will also hear about another Apache project called Apache Drill. Drill is a SQL Engine like Hive and Impala, but both Hive and Impala are designed as SQL Engines for querying the HDFS file system. Drill can also query HDFS files, but Drill can also query Excel, CSV, Parquet, JSON, Linux Files, Amazon S3 or Azure Files, MongoDB or other NoSQL, or relational files/tables in a single query. Drill requires no data modeling or pre-defined schemas. You merely install Drill on all or just some of your Hadoop Data Nodes. Drill is fast, flexible, and extremely versatile in querying or joining data across different platforms. Drill is designed for speed because all processing in memory uses a columnar strategy inside the memory engine.
Another major Apache project is Apache Spark. This version of Hadoop relies heavily on processing data in memory, and it is best used for analytic computing. Spark is also used for big data workloads because it utilizes in-memory caching and optimized execution for fast performance. It supports general batch processing, streaming analytics, machine learning, graph databases, and ad hoc queries. It provides high-level APIs in Scala, Java, Python, and R.
Sqoop stands for SQL on Hadoop (Sqoop). This is how you load data from relational data into Hadoop HDFS or HBase. While Teradata uses TPT, and Oracle and SQL Server use Bulk Copy as their data movement tools, Hadoop uses Sqoop.
Flume is used to load log files from devices into Hadoop from the IoT world. A flume in the real world is a small river used to transport logs into a main logging factory where they are cut up and processed. Flume for Hadoop transports computer logs in data streams directly from a wide variety of gadgets. Where Sqoop moves data into Hadoop from relational databases, Flume is similar in moving data to Hadoop from devices.
Oozie is a Java based application that acts as a workflow scheduler to manage Hadoop jobs.
Mahout is a distributed and scalable project that builds machine learning algorithms on Hadoop. It provides recommendations based on the user’s tendencies on data you provide to the Hadoop system.
ZooKeeper is the Apache project that takes large implementations of Hadoop commodity servers and provides a distributed centralized coordination service that enable synchronization across large clusters. Distributed applications require coordination services, such as naming services that allow one node to find a specific server in a cluster of thousands of servers, or for serialized updates.
YARN is a platform responsible for managing computing resources in clusters and using them for scheduling users’ applications.
What is the best tool to query, convert, and move data to and from Hadoop and share results with other users? The Nexus Query Chameleon – which has been used in production by many of the largest corporations in the world.
Nexus has developed four different foundations that build upon themselves to provide large enterprises the perfect enterprise data strategy.
- Nexus converts table structures and data types between all systems automatically so anyone can move a single-table or an entire database between any system, whether on-premise or in the cloud.
- Nexus shows tables/views visually and their relationships with other tables and builds the SQL automatically as users point-and-click. Since Nexus can also automatically convert and move tables between systems, this allows users to perform automatic cross-system joins between any combination of systems, including Hadoop. Nexus even allows the user to choose on which system they want the cross-system joins processed.
- Nexus takes every returned answer set and places it in the Garden of Analysis where a user can join answer sets or re-query them with point-and-click templates to get analytics, graphs and charts, and additional reports that are processed inside the user’s PC. This allows a user to query a system once and then generate up to 50 additional reports with sub-second response time because all of the processing is done on the user’s PC.
- Nexus has BizStar, which allows users to receive reports, Excel spreadsheets, word documents, videos, and unstructured data so data can be shared. One person can run a query and thousands can see the results! The BizStar also has a series of menus that allow users to run queries with a single-click of a button. BizStar even has the Multi-Step Process Builder so users can perform a wide variety of repetitive task on data and automate the entire process from start to finish. This is the 8th wonder of the world!
It is the combination of these four foundations that provide the world’s largest financial institution, the world’s most successful PC maker, and the most prominent telecommunications company with the most sophisticated data strategy.
Imagine doing a cross-system join between any combination of on-premises or cloud computing systems made up of Teradata, Aster Data, Oracle, SQL Server, DB2, Amazon Redshift, Azure SQL Data Warehouse, Hive, Impala, and even Excel in a single query that has been automatically built by merely pointing on the tables needed and selecting the columns desired for the report.
Now imagine taking that answer set and using the Garden of Analysis to generate another 50 reports in minutes and then sharing those reports with hundreds or even thousands of team members in their BizStar menus.
Now imagine setting up this entire process in the BizStar Multi-Step Process Builder so it is automated and can be scheduled to run.
Our advice: Integrate your on-premise traditional systems with your cloud strategy and combine Hadoop among them all. Automate as much as possible and process additional analytics, graphs and charts, and reports on the user’s PC when it makes sense. And share information across entire teams so everyone can work as one cohesive unit. That is 22nd century processing the in 21st century!