Power BI incremental refresh speeds up processing large datasets by reloading only new or changed data instead of the entire table. It works by partitioning a table based on a date/time column and a user-defined policy. You configure this by creating RangeStart and RangeEnd parameters in Power Query, filtering the table, and then setting the refresh policy in the Power BI service. Subsequent refreshes then only process the data in the "refresh period" (e.g., the last 5 days), while the "archive period" holds static historical data.
Key concepts
Partitioning: Power BI automatically divides a table into smaller segments (partitions), typically by date (days, months, years), to manage and refresh data more efficiently.
Refresh Policy: You define a policy that specifies how much historical data to keep (archive period) and how much recent data to refresh each time (refresh period).
RangeStart and RangeEnd Parameters: These are special, case-sensitive parameters you create in Power Query to define the boundaries for filtering the data. The Power BI service uses these to dynamically filter the data during a refresh.
Rolling Window: With each refresh, the window of data being refreshed moves forward in time. The oldest data in the refresh window is then dropped, and new data is loaded, keeping the dataset up to date without re-processing everything.
Key concepts in detail:
1.The connection model must be import, while direct Query is not applied.
It also supports Hybrid Tables, letting Power BI store old data in Import mode and keep recent data in DirectQuery mode, giving you fast historical reporting and real-time updates in the same table.Hybrid Tables let Power BI store old data in Import mode and keep recent data in DirectQuery mode, giving you fast historical reporting and real-time updates in the same table.
2.Data source coming from a data source must support query folding.
There are 4 cases: 1) SQL Table – Connect directly to the table; query folding works automatically, so incremental refresh is fast. 2) SQL View – Recommended; all logic stays in SQL, and folding is preserved, giving strong performance. 3)M Steps That Still Fold – Simple Power Query steps (filter, rename, remove columns) still fold into SQL, so incremental refresh remains efficient. 4) M Steps That Break Folding – Complex M logic stops folding; Power BI must scan the whole table, so incremental refresh works but becomes slow.
3.The two parameters of RangeStart and RangeEnd must be setup.
The parameters must be used in a filter on the incremental-refresh table; otherwise refresh fails.
4.Incremental refresh applies for an individual table but shares with RangeStart and RangeEnd.
But RangeStart/RangeEnd are shared globally in the model.
5.The initial value for RangeStart and RangeEnd are used in desktop only, while the values of RangeStart and RangeEnd are dynamically determined with Power bi services.
Desktop: Uses sample values. Service: Automatically substitutes real values based on policy. Important clarification: Desktop preview typically returns only the last 5 days by default, unless using Power BI Desktop Preview mode with “Allow large dataset storage format”.
6.Users cannot download the PBIX file, if dataset is loaded with the incremental refresh.
The same behaviour applies also for "large models", the loaded file is only for "thin client", or report only.
How it works in practice
Initial Setup:
- In Power Query, create two parameters named RangeStart and RangeEnd with a date/time data type.
- Filter your table using these parameters so that rows are only included where the date column is greater than or equal to RangeStart AND less than RangeEnd.
- Publish the dataset to the Power BI service.
Defining the Policy:
- In the Power BI service, select your dataset and go to the incremental refresh settings.
- Set the "Archive data" period (e.g., "keep 1 year") and the "Incremental refresh" period (e.g., "refresh last 5 days").
- The service will now create partitions for your data based on these settings.
Subsequent Refreshes:
- When a refresh is triggered, Power BI will only execute queries for the data within the defined refresh period (e.g., the last 5 days).
- It will reload the data from the last 5 days, potentially updating existing rows, and adding new ones.
- The older data (e.g., older than 5 days) is not re-processed but remains in its existing partition.
Sample settings:
1.Identify Date column called OrderDate and filter data with RangeStart and RangeEnd, and then enable incremental refresh.
2.Specify stored rows with 5 years, and refresh rows with 10 days. Suppose that today Oct 1, 2020, then
2.1. Any data between Sep 21, 2020 and Oct 1, 2015 will be removed.
2.2 Any data change from Oct 1, 2015 to Sep 21, 2020, will NOT be refreshed, even if there are any changes, assuming that you don't enable Detect data changes, or you manually trigger historical refresh, or a model change forces a full refresh.
2.3. Any data change from Sep 21, 2020 to Oct 1, 2020 will be refreshed.
2.4 Behind the scene, Power BI issues 10 queries to database as
where OrderDate >='2020-09-30' and OrderDate < '2020-10-01'
where OrderDate >='2020-09-29' and OrderDate < '2020-09-30'
3.Enable Data Change detection, Data change detection does NOT refresh only changed rows. It refreshes whole partitions if Power BI detects a difference.
3.1. In traditional ETL, if you have a modified Date, then the process of incremental refresh can monitor that field, and only get rows that their date/time is after the latest date/time in that field in the previous refresh. This approach is still valid for a data warehouse as Power BI data source.
3.2. We need to specify the modified date from the table. BTW, This modified date can be derived from another a few columns. Notice that this is different from the OrderDate.
3.3 The logic is to check each day as a block, day by day. If the maximum last modified date from source is greater than the maximum last modified date from power BI cache, then all records will be refreshed, regardless of whether the last modified date is greater than the last modified date from power BI cache or not
4.Enable Only Refresh Complete Period: it depends on the period you selected when you set up the incremental refresh, you can only refresh when the period is complete. In this case, it is by day. Date time will be ignored.
5. More notes:
5.1.When data in the past has been deleted, there could be different solutions, such as using the soft delete or simply refresh the whole dataset.
5.2.When data model has been changed, a full refresh is an easy option to proceed. However, if the data amount is huge, then we need to figure out a better way to promote the dataset.
6. Further study:
Keep The Existing Data In Your Power BI Dataset And Add New Data To It Using Incremental Refresh
No comments:
Post a Comment