While we know how to handle incremental refresh from the Microsoft document, this concept could be much more powerful than documented. For example, we can apply incremental refresh for many cases which were or still not eligible from the document in general, including native queries, queries without query folding, text files, and Sharepoint lists and files, as long as you don't need a direct query. However, your query must be foldable if you need real-time analytics.
The key is to use two key date time parameters: RangeStart and RangeEnd. Based on my experience, we can do the incremental refresh as long as we can make an M query to use these two parameters. This solution is like magic to resolve many big data problems, regardless of whether data is from databases, files, or SharePoint. The link Keep The Existing Data In Your Power BI Dataset And Add New Data To It Using Incremental Refresh can be very helpful for such deep understanding.
The basic concept of the incremental refresh:
1.The two parameters of RangeStart and RangeEnd must be setup.
2. Incremental refresh applies for an individual table but shares with RangeStart and RangeEnd. we can have different policies for different tables.
3.The initial value for RangeStart and RangeEnd are used in desktop only, while the values of RangeStart and RangeEnd are dynamically determined with Power bi services based on the Policy defined.
4.The connection model can be extended to direct Query besides import.
5. Even if a data source coming from a data source doesn't support query folding. If you use a native query, you don't need to turn the native queries into the views to resolve this issue.
6. Users cannot download the PBIX file if the dataset is loaded with the incremental refresh.
The normal settings:
1. Identify the Date column called OrderDate and filter data with RangeStart and RangeEnd, and then enable incremental refresh.
2. Specify stored rows with 5 years, and refresh rows with 10 days. Suppose that today Oct 1, 2020, then
2.1. Any data before Oct 1, 2015 will be removed.
2.2 Any data change from Oct 1, 2015 to Sep 21, 2020, will NOT be refreshed, even if there are any changes
2.3. Any data change from Sep 21, 2020 to Oct 1, 2020 will be refreshed.
2.4 Behind the scene, Power BI issues 10 queries to database as
where OrderDate >='2020-09-30' and OrderDate < '2020-10-01'
where OrderDate >='2020-09-29' and OrderDate < '2020-09-30'
3. Enable Data Change detection:
3.1. In traditional ETL, if you have a last modified date, then the process of incremental refresh can monitor that field, and only get rows that their date/time is after the latest date/time in that field in the previous refresh. This approach is still valid for a data warehouse as Power BI data source. However, the major difference is that this change will only apply for period from Sep 21, 2020 to Oct 1, 2020.
3.2. We need to specify the modified date from the table. BTW, This modified date can be derived from another few columns. Notice that this is different from the OrderDate.
3.3 The logic is to check each day as a block, day by day. If the maximum last modified date from source is greater than the maximum last modified date from the power BI cache, then all records will be refreshed, regardless of whether the last modified date is greater than the last modified date from the power BI cache or not
4. Enable Only Refresh Complete Period: it depends on the period you selected when you set up the incremental refresh, you can only refresh when the period is complete. In this case, it is by day. Date time will be ignored.
5. More notes:
5.1.When data in the past has been deleted, there could be different solutions, such as using the soft delete or simply refreshing the whole dataset.
5.2.When the data model has been changed, a full refresh is an easy option to proceed. However, if the data amount is huge, then we need to figure out a better way to promote the dataset.
Incremental refresh with a large amount of data from databases
RangeStart
#datetime(2022, 7, 1, 0, 0, 0) meta [IsParameterQuery=true, Type="DateTime", IsParameterQueryRequired=true]
RangeEnd
#datetime(2022, 7, 31, 0, 0, 0) meta [IsParameterQuery=true, Type="DateTime", IsParameterQueryRequired=true]
DateKey
let
Source = (x as datetime) => Date.Year(x)*10000 + Date.Month(x)*100 + Date.Day(x)
in
Source
TableX - with DateKey
let
StartDateKey = DateKey ( RangeStart),
EndDateKey = DateKey ( RangeEnd),
SQLstmt = "SELECT COLUMNS...,YYYYMMDD FROM TableX where YYYYMMDD >= " & Text.From (StartDateKey) & " and YYYYMMDD < " & Text.From (EndDateKey),
Source = Sql.Database("SERVERNAME", "DATABASENAME", [Query=SQLstmt])
in
Source
TableX - without DateKey
let
SQLstmt = "SELECT
Coulmns ...
,PartitionDate
FROM TableX "
& " WHERE cast(PartitionDate as datetime) >= '" & DateTime.ToText(RangeStart)
& "' and cast(PartitionDate as datetime) < '" & DateTime.ToText(RangeEnd) & "'",
Source = Sql.Database("SERVERNAME", "DATABASENAME", [Query=SQLstmt])
in
Source
Incremental refresh with many large csv files
TableX
let
Source = Folder.Files(FILEFOLDER),
#"Filter File Name+Extension" = Table.SelectRows(Source, each Text.StartsWith([Name], FileType) and ([Extension] = ".csv") ),
#"Added Column FileDate YYYYMMDD" = Table.AddColumn(#"Filter File Name+Extension", "YYYYMMDD", each Text.Range([Name],Text.Length(FileType),8),Int64.Type),
#"Grouped Rows" = Table.Group(#"Added Column FileDate YYYYMMDD", {"YYYYMMDD"}, {{"FileWithTopVersion", each List.Max([Name]), type text}, {"All", each _, type table [Content=binary, Name=text, Extension=text, Date accessed=nullable datetime, Date modified=nullable datetime, Date created=nullable datetime, Attributes=nullable record, Folder Path=text, YYYYMMDD=number]}}),
#"Expanded All" = Table.ExpandTableColumn(#"Grouped Rows", "All", {"Content", "Name", "Extension", "Date accessed", "Date modified", "Date created", "Attributes", "Folder Path"}, {"Content", "Name", "Extension", "Date accessed", "Date modified", "Date created", "Attributes", "Folder Path"}),
#"Files with Top Versions" = Table.SelectRows(#"Expanded All", each ([FileWithTopVersion] = [Name])),
#"Changed Type0" = Table.TransformColumnTypes(#"Files with Top Versions",{{"YYYYMMDD", Int64.Type}}),
#"Filtered Rows" = Table.SelectRows(#"Changed Type0", each [YYYYMMDD] > DateKey(RangeStart) and [YYYYMMDD] <= DateKey(RangeEnd)),
#"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"Content"}),
...
#"Invoke Custom Function" = Table.AddColumn(#"Filtered Hidden Files1", "Transform File", each FUNCTIONX([Content])),
...
#"Merged Queries" = Table.NestedJoin
...
#"Keep columns as specified" = Table.SelectColumns ...
in
#"Keep columns as specified"
Questions and Answers:
Q1: What is the difference between Power BI Desktop and Power BI services in terms of incremental loading?
1. Power BI Desktop uses the specified parameters of RangeStart and RangeEnd to process, which will give developers a huge advantage in managing pbix file:
1) Dramatically reduce refresh time by 10-100 times.
2) Dramatically reduce the file size by 10-100 times.
3) Dramatically reduce the publish time, by 10+ times based on your network speed.
2. Power BI services / Cloud use the policy defined for RangeStart and RangeEnd to process, which brings many advantages, too, compared with the full refresh.
1) By publishing, Power BI dataset will be fully refreshed. However, the SQL statement sent to the SQL database is partitioned, which will reduce the database workload. (s. picture below)
2) It will load only incremental data by the policy when manually, or scheduled refresh, which is much smaller, such as the last 2 months, instead of the last 5 years. (s. picture below)
Q2: How is the last modified date time used?
First of all, please see the picture below. Refreshing will proceed only if the maximum modified date time from the database is greater than the maximum modified date time from Power BI memory.
When we can modify data in the past period, such as 5 years ago, we have to define incretamental refresh for 5 years. Then, the incremental refresh becomes a "FULL" refresh, except with partitions. Last modified date time can be used to reduce such refresh from all periods to a few periods. For example, if all data is modified within one year, we only need to refresh this year instead of all years.
The first time,
1. Promote DatasetX from DEV to TEST, where the configure Gateway with DEV connection, keep DBConnection parameter the same as DEV
4. Change Gateway to the correct connection; refresh
No comments:
Post a Comment