Friday, August 10, 2012

How to handle “Late Arriving Dimension” – preserve the “natural key”

This is very general practice in warehouse project: Sometimes we will get the transactional data without dimensional data. It is called as “early arriving facts” or “Late Arriving Dimension”. There are a few of good solutions though, all of which are handled during ETL:
  1. Hold onto the fact until all dimensions arrive
  2. Create a dimension called ‘unknown’ or ‘not available’ with a primary key of -1
  3. Insert a dummy row in the dimension and populate it with whatever you can

For the sake of easy explanation, dimension location in project is used to demonstrate this idea.    For this project, approach 1) and 3) are applied.  The bottom line is that all Locations must be mapped before sales fact is populated into production database. In case when newly Location  map can’t be found, a dummy location map is generated. This case is very reasonable, as the sales arrive earlier than the outlet file (called banner file) arrives.  Sales data could be loaded 4 weeks earlier than banner file. Therefore, when banner file arrives, we need to have a process to allow user to resolve
1) dummy outlet, and
2) remap existed map when the map is confirmed wrong.  

The issue here is how to reprocess data.  Based on current ETL framework (or dataflow) below

INBOUND table == > HIST table == > BASE table == > AGGREAGTE table

There are two approaches:  Reload files, or reprocess impacted records

  1. Reload files
INBOUND table == > HIST table == > BASE table == > AGGREAGTE table

This approach has a lot of disadvantages:

  1. Difficult to identify all impacted files, currently based on business user’s expertise
  2. There are too many files, each week has 11 wholesalers
  3. Process is very heavy

  1. Reprocess impacted records only without reloading files
INBOUND table == > HIST table == > BASE table == > AGGREAGTE table

This is an obviously better idea, however, this approach can only work when the HIST table contains original location identifiers or natural key from each wholesaler.  

Detail for reprocessing outlets:
  • Based on newly maps and  HIST table (with original location identifier column using normalized column name, or natural key) , the incremental dataset are identified.
  • Rerun the incremental load, starting from
== > BASE table (with surrogate key)
== > AGGREAGTE table (with surrogate key)


The bottom line is that natural key (or original identifier) must be preserved for transaction reprocessing. The natural key should be saved into HIST table.

No comments:

Post a Comment