Jerry Nixon @Work: The ETL between OLAP and OLTP

Jerry Nixon on Windows

Thursday, February 26, 2009

The ETL between OLAP and OLTP

I am an armchair data modeler. That makes me about as dangerous as a cotton ball. Most developers never think about modeling. I'm trained, read-up, and think hard about modeling concepts. Now that some of my hair-brained ideas have made it into production, I have some strong opinions.

OLTP is about data-write performance. Yet, it is typically judged by its performance to read data. That's because 99% of applications running reports against the OLTP model. Sometimes it's from ignorance, but sometimes it from the desire for "real-time" numbers.

The fallacy of real-time is ridiculous. About .0000000001% of applications really require real-time. If NASA can operate a $5B land rover with two hours latency, analysts can handle a 15 or 30 minute lag.

This is where the gap between OLAP and OLTP comes to play.

ETL moves and transforms our OLTP data to our report-friendly OLAP store. But the greater the gap between OLTP and OLAP, the greater the ETL execution cost. The greater that cost, the less often it can be executed. The less often it's executed, the greater the report latency. And so, here we are.

What creates the gap between OLTP and OLAP models? Can you create a good OLTP model that is "nearly" an OLAP model? No. But, yes. And in this post, I am going to share with you one excellent tip that will speed up ETL, increase reporting options, and cost hardly anything.

Let's take a simple OLTP table called users. Here's a typical structure:

UserId int primary key identity(1, 1),
FirstName varchar(50),
LastName varchar(50)

There is nothing wrong with this structure. However, it provides no information to the downstream ETL. As a result, we must build time-based snapshots to understand changes. Instead, a simple schema change can simplify the ETL, shrinking that gap between the OLTP and OLAP models.

Let's take that OLTP table and update its structure:

UserId int primary key identity(1, 1),
FirstName varchar(50),
LastName varchar(50),
CreatedOn DateTime,
UpdatedOn DateTime


With two new columns, the ETL comparison logic can easily identify new rows, updated rows, and quickly scan for deleted rows. I know ETL frequency determines slowly changing dimension granularity. But, things can get too granular. Spread your ETL executions and you improve OLTP performance and make negligible impact on report currency.

When you begin to reduce the OLTP and OLAP gap, you improve ETL performance (and complexity) – and, for the first time, you have the strategic opportunity to increase ETL frequency for high-risk periods of the year. Then, dial it back to something reasonable the rest of the year.

The answer to ETL performance cannot be hardware because databases scale up, not out. Storage, opposed to popular myth, isn't cheap. High-availability, high-performance, secure storage is still expensive to procure, operate, and maintain. That may not change.

The answer to ETL performance is reducing the contention between the OLTP model and the OLAP. You can't do everything, but you can do a lot. I've seen ELT processes reduce from 10 major steps to 8. That's huge. With some more techniques, I think we could see another 20% improvement.

But why?

Until it's fast and easy to move OLTP data to OLAP, businesses will run reports and analysis against the OLTP store. Once a reasonable frequency is established in your environment, reports will perform better (against the OLAP store) and your application will perform better (against the OLTP store).

Two closing remarks

1. CreatedOn and UpdatedOn columns are useless until populated. When the warehouse cannot trust their values, all is lost, and ETL developers will return to their evil ways. Build this into your testing practice, or, like I do, build it into your DAL so you can't forget.

2. ETL itself can negatively impact OLTP performance. I know that. But techniques like this and refusing to calculate anything until you're in the warehouse stage makes it an easy trade off from the myriad reports killing the store when it needs to handle transactions.