Context
Many articles discuss the modern Data Warehouse / Business Intelligence concept. At the high level, we load source data into the data lake, process data with ELT to the data warehouse, and generate visualizations with analytics tools such as Power BI. The most advanced concept is probably a lake house instead of a data warehouse.
However, a big part of data is well-structured data, or the source data is mostly from databases or spreadsheets. This kind of data is not real big data. Should we need to move data into the data lake? The answer is probably no, as we don't want to move well structure data into not structured data. On the other side, should we develop a traditional data warehouse with the Kimball concept, along with ETL? Again, the answer is probably no, as we want to avoid all disadvantages of the conventional data warehouse.
This article shares a new pragmatic modern DW/BI solution with well-structured data. The diagram below illustrates this concept.
The critical points of this concept are 1) move to cloud; 2) get rid of data mart; 3) move business logic to semantic layer; 4) move from ETL to ELT
Move BI solutions to Cloud ( Azure)
It is evident to us that there are a lot of advantages with cloud against on-premise solutions. You will get a lot of articles on the internet in this regard. Nevertheless, I want to bring three points below:
1) Low cost: You pay for what you use, with little to no upfront setup costs. Great amounts of users mean lower costs per user.
2) Scalable: Cloud BI applications are easily scaled to accommodate the growth of the business and its number of users. Hosting BI in the cloud enables organizations to cope with growing data volumes without upgrading or migrating to other systems.
3) Power BI is a native cloud BI: Power BI has become so popular, as Power BI is native cloud BI. So we can put other BI tools, such as Cognos, into the cloud, but with pretty big effort and limitations. We have much less worry about system stability.
Get rid of data mart ( in data warehouse based on Kimball's surrogate keys)
We all are familiar with current or traditional data warehouse architecture. The high level consists of two significant steps: 1) extract data to staging tables from different data sources; and 2) transform and load data into data mart with surrogate keys.
The data mart layer initially provides two significant advantages: 1) we can get better performance using these surrogate keys to join the database. 2) we can handle slow changing dimension type 2, such as an employee changes sales department if we care department sales.
However, these two advantages are not very obvious anymore. 1) Surrogate keys are not necessary for boosting performance, given that database becomes more and more powerful, such as Netezza, Azure synapse. In addition, if we move these data into a memory engine, such as Power BI, Tableau and columnar-based parquet file, Surrogate keys become useless. 2) We can turn slow-changing dimension type 2 into type 1 if needed. For example, the employee dimension is a slow-changing dimension type 2 if we care for the department changes. The solution is to add a department column into the sales fact table naturally; in this case, we could consider the employee table as slow-changing dimension type 1 without a department column. Another solution is to make data with snapshot periods, and it will automatically track all department changes.
There are at least three advantages with this new concept:
1) Removal of data mart will reduce a significant step to load data; we can assume it will reduce process time by about 50%.
2) Compared with data mart, the new process becomes much easier to maintain and debug. We all know how difficult to troubleshoot data flow, as we must join all dimensions to get meaningful data.
3) The data mart definition is rigid and difficult to change. For example, if we want to add a new column or a new table, it will destabilize the whole data flow. The data mart concept is like the old OLAP ( multidimensional / MDX model); we need to move to a more flexible tabular model.
Move the business logic to the semantic layer ( VertiPaq Engine)
The BI semantic layer has been there for a long time, and almost all BI tools apply this concept. Meanwhile, this semantic layer becomes more and more powerful. You will figure out the winner is power bi in this area based on five criteria below:
1) Shared Business Logic : A semantic layer contains the core logic required for business analysis, transforming the underlying data model into familiar business definitions (dimensions, measures, hierarchies) and easy-to-understand terms. It can contain commonly used derived measures, such as year-over-year, month-over-month, month-to-date, etc. Users can consume the calculated measures directly and reuse the semantics in different downstream applications. Most BI tools satisfy this requirement.
2) Unified Security Policy: A semantic layer ensures users and data access controls are uniformly applied through the row-level security, so IT doesn't need to configure data access control for individual downstream systems. Most BI tools satisfy this requirement.
3) High-Performance Backend Engines: The semantic layer must have a powerful built-in memory engine. The unified semantic layer can bring businesses a more comprehensive view of their data so that businesses can conduct analysis on massive detailed datasets. This cannot be done without a powerful backend engine. Probably only three tools satisfy this requirement, Power BI, Tableau and Qikview
4) Behaves as a Database that Supports Multiple Applications :This requirement is satisfied by the Power BI dataset (or Power BI semantic layer), which can be used as a database based on XML/A. We can explore the Power BI dataset as below
a. We can develop reports based on this shared dataset;
b. We can directly create an excel spreadsheet based on this dataset;
c. we can use SQL server management studio / Azure data studio to query this dataset;
d. we can use DAX to query the dataset;
e. Power BI Premium provides open-platform connectivity for Power BI datasets, enabling customers to leverage a semantic layer compatible with a wide range of data-visualization tools from different vendors. This means any third party, such as tableau, can consume (read) the Power BI semantic layer by reading and writing XMLA endpoints.
5) Reusable :
Making semantic layers reusable is like making an object-oriented design. The semantic layer or data model is like "Polymorphism, Encapsulation, Data Abstraction and Inheritance". Power BI
releases a new functionality called DirectQuery for Power BI datasets and Analysis Services, a fundamental change for Power BI model design, significantly impacting processing and dataflow. This new feature brings a lot of advantages, as listed below:
1. Leveraged the existing central model
2. Ensured more consistency
3. Reused existing models instead of re-do
When comparing many tools together, we can move all those magic into the somatic layer for power BI. Power Query is a very powerful tool to transform data to create the semantic layer. This semantic layer can serve both enterprise reporting and self-serve BI.
Summary
The newly proposed concept simplifies overall data flow and enables us to have a flexible or agile BI system.
1) It is simple: We need to extract and load data into the staging area, then transform data to the semantic layer with a focus on entering users.
2) It supports both enterprise reporting and self-serve: As the semantic layer or Power BI dataset is reusable, we can use it in different ways as needed.
3) It is agile: if we want to enhance/change BI system, it does give us a way to react, again, as there are only two major steps to archive our goal without impact on other parts.
4) It is low cost: We use the cloud and simplify data flow.
5) It is low risk because we can change systems easily.
Note: I will continue enhancing this article moving forwards with all my further involvement in BI solutions.
No comments:
Post a Comment