Saturday, February 1, 2020

Better ETL design approach - Avoid overusing ETL tool

ETL plays a critical role in DW/BI project. Having witnessed many data warehousing projects failed duo to ETL implementation, I have realized that the most appropriate approach is to minimize the usage of ETL components. We should use only these ETL components if we can't perform tasks with SQL queries or stored procedures.

There are two categories of functions in ETL: 1. Functions that cannot be realized by relative simple SQL queries or stored procedures, and 2. Functions that CAN be achieved by relative simple SQL queries or stored procedures.

First of all, we need to define what are relative simple SQL queries or stored procedures.  These simple SQL queries or stored procedures should not use cursors, and they should not use many temporary tables. 

For case 1, we should use ETL components, such as workflow components. Some applicable situations are listed below:
1. Load data from different databases, such as load oracle database into SQL server.
2. Load data from data other than a database, such as web services, social network, XML files.
3. Transform data with complicated logic, such as fuzzy lookup, data mining, logic to be programmed.

For case 2, we should use SQL queries or stored procedures. However, we can embed these SQL queries or stored procedures into the ETL process to have the whole workflow managed. The most applicable situation is that both source and target are in the same database.

This approach has the following advantages:
1. SQL is a widely known skill; it is far easier to develop and to maintain. The high-level graphical flow of ETL is nice for documentation, but if someone needs to get into the guts, they'd better know ETL well.
2. SQL runs much faster if SQL queries have no cursors, no looping structures. The performance should be at least the same as ETL, as SQL is what the database engine specially designs to perform.
3. We don't have to spend all our time right-clicking and working in little bids windows trying to follow the flow of logic.

In summary, we use the combined method to get the best of two worlds.  We use ETL to get data from external sources and load it in parallel into the staging database; then, we use ETL packages to orchestrate pipelines and trigger appropriate queries or stored procedures inside control flow.  Any transformation logic is encapsulated into queries and stored procedures.

Based on my many years of consulting experience, most people who are heavy pro-ETL, or get all logic into GUI are inexperienced developers who have grown up with graphical tools.

1 comment: