Sunday, July 1, 2018

Is SCD2 really needed?

SCD 2 is naturally applied in data warehousing design, also Bitemporal design approach is also sometimes applied. When talking about high level picture, all these concepts are intended to resolve the relationship between dimension and fact.  Based on my experiences from many practical data warehouse projects, I am trying to ask myself whether the SCD 2, or even surrogate keys is really necessary or not.  This document is to argue that surrogate key, SCD2 and Bitemporal are not needed.

Referring back to the two design goals of a dimensional model per Kimball; create an easy to understand data structure while providing fast query response times.

First of all, let address the performance issue. Fast response times often mean efficient joins between facts and dimensions.  By using surrogate keys we can select a data type, integer, which offers high performance. However, the performance is insignificant given that many serious data warehousing use Teradata or Netezza.  To apply for the surrogate keys and SCD2, the development and maintenance cost are much higher than simple relationship. We need either to write complicated SQL, and to use ETL tool to generate both dimensions and facts during development. It is also difficult to query fact data as we need to join dimension tables to get data understandable. Because of  incremental IT personnel costs, and additional soft costs, there is no justification for using surrogate key and SCD2 for better performance.

Secondly, with respect to create an easy to understand data structure, the actually data structure without surrogate keys and SCD2 is still illustrated as star scheme, which still can be easily modeled with different BI tools. 

Below are a few methods to model data structures:

1. Simple attribute(s) ( or SCD0): 
we can simply consider these attributes as fact dimension. normally, the fact table has primary key, we can either leave these attributes in fact table, or separate them as fact dimension without any changes. If modeling it in Cognos, we can either model an individual attribute as a corresponded dimension, or make the all attribute as one fact dimension. The latter is better approach.

2. Dimension (SCD1):
 If we don’t use surrogate key, then we don’t need to process fact and dimension, or put surrogate key in fact and insert surrogate key in dimension.  We can simply keep the dimension related column as it in fact. With respect to dimension, there are two different cases:

2.1 Create a static reference table like dimension, with or without different hierarchies.

2.2 If this dimension is dynamic, then we can consider this column as distinct value as independent dimension. the dimension will be automatically extracted from fact.

3. Dimension (SCD2):
For the sake of discussion, we use customers dimension as sample. This table has snapshot month and customer ID with a lot of dimension attributes.  All change history of attributes in customer are automatically preserved. There are two ways to proceed:

3.1 Keep the customer dimension as is, then create relationship between customer and fact with two columns joined: customer ID and Snapshot month.
  • Advantage: once modeled, you will have all possible columns from customer for reports, and t is fully flexible
  • Disadvantage: the customer dimension could be huge, but only small percentage of customer data is used, this will result in a poor performance.

3.2 Bring only applied customer related attribute into fact table. for example, marriage status. We can add this attribute into fact table, and then apply the concept SCD0 and SCD1
  • Advantage: Very good performance and easy to understand.
  • Disadvantage:  It is not flexible, we need to bring new columns into fact table if we need to add new customer related new attribute.

No comments:

Post a Comment