The concept behind Teradata began almost 50 years ago, when researchers from the California Institute of Technology and Citibank’s advanced technology group began kicking ideas around. That was in the 1970s. By 1979, a few of them formed Teradata as an incorporated company in Brentwood, CA. Like another great tech company four decades earlier, they began working out of a residential car garage.
They named the company Teradata to reflect their ambition. They aimed high, with a revolutionary plan for managing huge amounts – a trillion bytes – of data. By 1980, a seed round and venture capital round allowed the founders to put an R&D team in place. They began to turn concepts into reality. Just before 1983 drew to a close, they shipped their first beta system to Wells Fargo.
From then, the list of awards and “firsts” kept growing. A very few of the early high points are noted below:
- 1986 – Fortune magazine names Teradata “Product of the Year.”
- 1992 – The first system over 1 terabyte goes into production for Wal-Mart.
- 1996 – The Data Warehouse Institute awards Teradata its Best Practices Award in data warehousing. A Teradata database sets a record for storage at 11 terabytes.
- 1997 – Teradata receives The Data Warehouse Institute’s Best Practices Award and DBMS Readers’ Choice Award. This year, a Teradata customer’s 24 terabyte database sets the record for world’s largest production database.
- 1999 – A Teradata customer sets the newest world’s record for largest production database at 130 terabytes.
- 2000 through 2006 – busy years of acquisitions, mergers, and new product launches.
- 2007 – Intelligent Enterprise magazine names Teradata the best global data warehouse business intelligence appliance.
You get the idea. Teradata quickly established itself as a unique innovator, especially in terms of scalability. In 1992, they made it possible to process a terabyte of data for the first time, and within 7 years were processing 130 terabytes. That’s scalability.
What is the kernel of genius that makes Teradata’s accomplishments possible? Let’s look a little deeper, beyond the Teradata basics, to see why it’s such a powerful solution.
Early computer systems worked on a very simple model. A CPU processed data. The data was stored in disk platters. When a user requested information, the data was input to the CPU and processed. When processing was complete, modified data was exported back to storage.
The system worked fine until stores of data grew too large to handle. Massive amounts of data meant long waits as the system read it in. Massive amounts of data could also overwhelm a single processor. Old queries could run indefinitely, for days or weeks. Sometimes, they failed to return answers.
Teradata solved those problems by designing systems that use multiple, parallel processers. The parallel processors are called, Access Module Processors (AMPs). Teradata distributes table rows across multiple processors. Input/output times improve vastly because processers are reading in less data. If you have three processors instead of one, input/output time is cut to a third of the original time.
Parallel processing is absolutely Teradata’s most important concept.
Even better, when you add more AMPs, you achieve predictable, linear system improvements. We call this linear scalability. Its limits, if any, are unknown. Today, systems with 4,000 AMPs are in production.
Every time you invest in hardware and add it to a Teradata system, you receive a predictable ROI. When you add hardware, you can handle the same amount of data faster. Or you can handle more data without any system degradation.
Parallel Processing, Then and Now
I remember walking into a huge data processing center 20 years ago. It was almost the size of a football field, with hardware everywhere!
All around me, hundreds of cabinets held platters of disks. The room was silent until suddenly, a green light on all the disks turned on. The disks rumbled together, shaking the room, then stopped.
Those were the days when we could see and hear parallel processing in action. A mainframe sent a request for information, and the AMPs all worked together to process their share of the data.
But those days are long gone. True to its innovative origins, Teradata aggressively evolved, embracing the VM revolution. Now, that room full of disks is held on something the size of a laptop. I’ll have more to say about that in the Teradata Architecture – Deep Dive section below.
Basic Teradata Architecture
I just mentioned AMPs, one of the three main elements of basic Teradata architecture.
Here are the elements on a high, conceptual level:
- Parsing Engine (PE): The parsing engine is the brains behind successful processing of a query. Parsing engines:
- Accept an SQL user query.
- Make sure the user has privileges to run the query.
- Check the SQL query syntax.
- Use primary indexes to allocate table rows to AMPs. More on this important function in the following section.
- BYNETs: BYNETs are the message-passing layer. They are part software and part hardware. The software element controls communications. The hardware (Enterprise Serial Bus) carries the communication.
Teradata systems always have two BYNETs – BYNET 0 and BYNET 1. Two BYNETs allow for a faster system and provide coverage if one fails.
- Access Module Processors (AMPs): AMPs are the processing power in the system. They accept data from the BYNET, process it, and return results via the BYNET. Each AMP has its own disk space for processing.
AMPs don’t share data or memory with each other. This is a “Shared Nothing Architecture.”
The following graphic shows how the basic architecture works. The lines between the parsing engine and the AMPs represent the BYNETs. Note that you can see the two BYNETs.
All About Primary Indexes
Every SQL table has a primary index, which you define when you set up the table. The primary index is very important because the parsing engine needs to hash it. Then, the parsing engine uses the hash results to find your data.
This is a critical feature of Teradata. The parsing engine can quickly hash a primary index and target the requested data. It finds the right AMP and the right row. One access. Less than a second, even if you have one of those 4,000 AMP warehouses. This feature is what makes Teradata capable of analysis that is impossible on other systems.
If you forget to set up a primary index, Teradata will probably use your first column as a non-unique primary index (more on that later). So, don’t forget. Make a choice so you don’t need to live with the default.
The primary index comes in four different forms.
Unique Primary Indexes (UPI)
A unique primary index (UPI) is exactly what it sounds like. It’s a primary index that’s unique in your table. Specify UNIQUE when you set up your SQL table. Then, you can’t enter another record with the same index. The system rejects your attempt to enter a duplicate UPI and sends an error message.
The graphic below shows a common example. Emp_No (employee number) is the unique primary index. Notice that all the table rows are spread evenly over all the AMPs. That’s another characteristic of UPIs. They provide a perfectly even distribution of rows.
When a user enters a SQL query, the parsing engine goes into action. It hashes the Emp_No in the query. The process works as described above and returns the correct row within a second.
Non-Unique Primary Index (NUPI)
What if you don’t look for employees based on their unique employee number? Maybe you want to query and report by department number. Department number isn’t unique: lots of employees work in the same department. In this case, you need a non-unique primary index (NUPI).
In this example, a user submits a SQL query that contains Dept_No. The parsing engine hashes each row’s department number and routes all rows with the same department to the same AMP. The rows are stored together, so you can find them in a single-AMP retrieve, just like an UPI. In this case, the parsing engine returns multiple rows.
When you create a NUPI SQL table, just leave out the UNIQUE modifier. Notice another difference between the UPI above and the NUPI below. In the NUPI, rows aren’t spread evenly across each AMP. They can’t be, because departments are different sizes.
But what happens if a NUPI results in wildly uneven distribution?
Multi-Column Primary Index
You can combine more than one column to create a multi-column primary index.
Here’s an example of a distribution that’s so uneven, it could begin to offset the benefits of Teradata design. If your parsing engine hashes on Smith, it sends all Smiths to one AMP.
Change that to a combination of last name and first name. Smiths now hash into smaller groups. John Smiths go to one AMP and Mary Smiths to a different AMP. Distribution becomes more even.
Multi-column primary indexes are also helpful if you tend to query on more than one field. Imagine that you often query by department and shift (day-night, for example).
There’s a small penalty associated with multi-column indexes. If you want that wonderful single-AMP retrieve, you’ll need both pieces of information (columns) for your SQL query.
No Primary Index (No PI)
Last, and certainly least, is the no primary index (No PI) table.
If you set up a No PI table, rows are evenly distributed over the AMPS. It’s a perfect distribution every time. But without a primary index, the query engine can’t hash to find rows. To run a query, you need a full-table search.
That makes No PI impractical for production. DBAs use it sometimes in staging and columnar design.
Teradata Architecture: Deep Dive
As I mentioned, the days of data centers the size of football fields are gone. Today, Teradata uses “nodes” as building blocks.
Teradata Node Architecture
A node combines four parsing engines and 40 AMPs! Each AMP gets the memory it needs, because it owns a virtual disk on a disk farm. Everything is laid out nice and even, so we call it a symmetric multiprocessing node (SMP).
To sum up, even though the AMPs are placed together in a node, each AMP has its own central memory, its own processing capabilities, and its own disk space in the disk farm. Modern Teradata architecture remains a share-nothing system.
Teradata stores nodes in what they call “cabinets.” They actually are about the size of a kitchen cabinet, yet they hold hundreds of times the processing power we had in the football-field size installations 20 years ago.
How do we scale this architecture? When you need to upgrade, use BYNETs to combine SMPs. You now have massive processing power – a massively parallel processing (MPP) system.
Inside Teradata Nodes
A nodes is a server. That server contains:
- a Linux operating system.
- PDE – parallel database extensions, which control the BYNETs.
- memory – which holds the parsing engine and AMPs. Each AMP contains a Vproc (virtual processor).
In the following graphic, you see that the node is attached to a mainframe. Unlike early architectures, the Teradata node is also attached to a LAN. Most queries will come from users, not the mainframe. Remember that each node contains four parsing engines? Well, each parsing engine can handle 120 users.
Here’s an important concept about parsing engines. A parsing engine needs to access every AMP that holds table rows. Think about performing a full table scan. The parsing engine responsible for that scan must control every AMP. The graphic below is a reminder of the basic architecture that makes this massively parallel processing possible.
Are you ready to harness the genius of Teradata for your company?
For more information on Teradata, check out our Teradata books (the Genius series). You’ll learn exactly what you need whether you’re a business user, developer, DBA, or executive.
If you’re interested in Teradata education or online courses, here’s a link to the full playlist for my Teradata Basics online training.
And of course, I’m always happy to come to your company and meet with you. I have been providing world-class Teradata training for over two decades.
Book me (TeraTom) directly by reaching out using the information below.
CEO, Coffing Data Warehousing
Direct: 513 300-0341