Wednesday, July 1, 2020

bitemporal vs. scd2 vs. accumulating snapshot

This is a very complicated topic to explore. The document itself is intended to explain the bitemporal model, and compare it with scd2 and accumulating snapshot.
Bitemporal model is clearly explained from the blog https://blog.cloudera.com/blog/2017/05/bi-temporal-data-modeling-with-envelope/
The concept is illustrated as below.


Customer ID1 has credit score changed to 500, 600,700 with different date ranges, respectively. The date range is defined with two columns AS_OF_START_DATE and
,AS_OF_END_DATE. We can get score based on any given date

SELECT * FROM FACT WHERE {Any given date} BETWEEN AS_OF_START_DATE AND AS_OF_END_DATE;

While this data model now captures the best known truth for credit scores over time, we have lost the information about the correction. This might not be important, and in that case the data model could stop here. But it is easy to imagine a scenario where this is important:
1. On Jun 15th credit score is reported as 600.
2. On Jun 18th a manager with access to the report declines a loan application because the credit score wasn’t 650 or above.
3. On Jun 19th the score for November 15th onwards is corrected to 650
4. In December an auditor at a financial regulator sees that the manager denied the loan despite meeting the criteria…

To track how the credit scores changed in the table, not just in the real world, we can also implement the same two field pattern to represent the range of system time that the record was considered true. This allows us to roll back the clock and see the full state of the table at any point in the past, and with only a straightforward filter query. This tracking of both the event and system time ranges is what is known as bi-temporal data modeling.

SELECT * FROM FACT WHERE {Any given date} BETWEEN AS_OF_START_DATE AND AS_OF_END_DATE AND {Any given date} BETWEEN SYSTEM_START_DATE and SYSTEM_END_DATE;

In data warehousing terminology this is a Type 2 slowly changing dimension, where new records are added not just for changes in the real world, but also for changes to the table itself. however, these columns are added to fact side, not in dimension side. and, there is no inferred member concept in bitemporal model.

With bi-temporal data modeling our table would now look like the below representation. We can find the customer credit scores across event time and across system time. For the regulatory scenario the auditor would be able to filter the system start and end dates by the date of the loan denial to see the report as it would have looked to the manager when they made the decision, and so avoiding unnecessary consequences for the bank.

However, it is very challenging to implement Bitemporal model in data warehousing, as it is not easy to aggregate. In my opinion, it is closer to operational side than data ware house side.

In real word, Accumulating snapshots could be a very good choice to resolve similar issues.  Accumulating snapshots work best where the milestones are generally linear and predicable.  If they are not, the design and maintenance will be significantly more complex.
    

No comments:

Post a Comment