Coffing Data Warehousing Software

Hadoop Explained

Tera-Tom here! Experts say that up to 75% of the entire world’s data will be on Hadoop by the year 2020. Yes – that’s right – 75%!! This means a huge influx of data over the next several years, which translates into a huge opportunity for your organization! This means data not captured today will become commonplace over the next several years.

What is Hadoop, where did it come from, and why is it so popular?

Hadoop was created by the genius minds at Google to analyze massive amounts of web data. It was then utilized and expanded by Yahoo! who began working with the open source community. Some of the engineers at Yahoo! eventually left and founded HortonWorks.

The name Hadoop came from Doug Cutting, chief architect of Cloudera and one of the original creators. Cutting’s son, then 2, was just beginning to talk and called his beloved stuffed yellow elephant “Hadoop”.

Why Hadoop is so popular is because it is all about unstructured data. As much as 80 percent of data created each day is unstructured—and impossible to mine as a result. Hadoop brings structure to the chaos, helping to store the data sets across distributed clusters of servers, and at a much lower cost than with legacy servers.

Here is the bottom line of how it works. Hadoop is all about parallel processing, full table scans, unstructured data and commodity hardware. There is a single server that is called a “Named Node”. Its job is to keep track of all of the data files on the “Data Nodes”. The named node sends out a heartbeat each minute and the data nodes respond, or they are deemed dead.

When a user wants to create a table, the table definition is copied to each data node. Data is loaded into small blocks directly to the data nodes, with each node holding a portion of the data. Just like a card dealer deals cards to the players, with each player having an equal amount yet different cards, the data nodes each hold their share of data blocks. When a user queries or mines the data each node reads their data blocks in parallel and an answer set is returned. For recovery purposes each data node copies their data blocks to two other nodes, thus all data is tripled for back-up purposes in case a node is deemed dead.

All of this is done with commodity hardware and cheap disks called JBOD (Just a Bunch Of Disks). This low cost and high-powered parallel processing capability makes it cost effective to store and process all kinds of data. This includes all types of social media, which is known as “The Internet of Things” (IoT). Experts estimate that the IoT will consist of up to 50 billion objects by 2020. This is an example of new terminology that is associated with Hadoop. I’ll use several more emerging terms in the following section.

Companies will use Sqoop to transfer data to and from Hadoop to legacy databases like Oracle, Teradata, SQL Server and DB2. Companies will use Flume to gather weblogs from places like Twitter, Facebook, LinkedIn or any mechanical logs (e.g., trucking logs, or smart thermostats). This will allow a company to mix structured data from legacy systems with unstructured data from websites, social media and logs. “Sentiment” is a Hadoop data-type term used to describe how your customers “feel”. Companies can analyze “Sentiment” by tracking twitter feeds to find out what society thinks of their branding. “Sensor Machine” is a Hadoop data-type term used to discover patterns in data streaming from remote sensors and machines. Companies can also track their vehicles’ driving patterns based on logs associated with each vehicle. The possibilities are endless!

Fifteen years ago I took a calculated risk, and focused my resources on a vision of the future. The vision was that massive amounts of data would need to be synthesized across open systems, as hardware costs decreased. As data gained power, new thinking was needed to enable new computing capabilities. To support this vision – we created new capabilities in our software solution. I named our software “The Nexus”. This represents the intersection of all databases – creating a Nexus of business intelligence capabilities. Hadoop has evolved into being one of the key strategic capabilities built into our Nexus software. This evolution will help companies take full advantage of Hadoop and the cloud. This will enable them to do so in conjunction with their legacy systems. Here’s a preview of what this enables for you and your company:

By the end of the year Nexus users will be able to join Oracle, SQL Server, Teradata, DB2, Hadoop, Amazon Redshift and Microsoft Azure SQL Data Warehouse tables in seconds! Nexus allows users to click on the tables and columns they want on their report from across all these systems and Nexus builds the SQL automatically. Nexus converts the table structures and data types and moves the data in the most efficient manner to produce the answer sets. Users can move data between any of these systems with the click of a button, and then graph and chart each answer set! It took 15 years to create these capabilities, yet there is tremendous satisfaction in having built what customers call the best tool available today.

Please feel free to respond and I will send you the Nexus technical document and the Nexus video. I am also available to show you a Nexus demo over the web. Remember – 75% of data captured by 2020 will be unstructured data. If your current tools aren’t ready to handle this revolution, I’ve got your back!



Tom Coffing
CEO, CoffingDW
Direct: 513 300-0341