Monday, August 1, 2022

Unleash the power of Power BI incremental refresh

                                                                                                                            Check list of all posts

While we know how to handle incremental refresh from the Microsoft document, this concept could be much more powerful than documented. For example, we can apply incremental refresh for many cases which were or still not eligible from the document in general, including native queries, queries without query folding, text files, and Sharepoint lists and files, as long as you don't need a direct query. However, your query must be foldable if you need real-time analytics. 

The key is to use two key date time parameters: RangeStart and RangeEnd. Based on my experience, we can do the incremental refresh as long as we can make an M query to use these two parameters. This solution is like magic to resolve many big data problems, regardless of whether data is from databases, files, or SharePoint. The link Keep The Existing Data In Your Power BI Dataset And Add New Data To It Using Incremental Refresh can be very helpful for such deep understanding.

The basic concept of the incremental refresh:

1.The two parameters of RangeStart and RangeEnd must be setup.

2. Incremental refresh applies for an individual table but shares with RangeStart and RangeEnd. we can have different policies for different tables.

3.The initial value for RangeStart and RangeEnd are used in desktop only, while the values of RangeStart and RangeEnd are dynamically determined with Power bi services based on the Policy defined.

4.The connection model can be extended to direct Query besides import.

5. Even if a data source coming from a data source doesn't support query folding. If you use a native query, you don't need to turn the native queries into the views to resolve this issue.

6. Users cannot download the PBIX file if the dataset is loaded with the incremental refresh.

The normal settings:

1. Identify the Date column called OrderDate and filter data with RangeStart and RangeEnd, and then enable incremental refresh.

2. Specify stored rows with 5 years, and refresh rows with 10 days. Suppose that today Oct 1, 2020, then 

2.1. Any data before Oct 1, 2015 will be removed.

2.2 Any data change from Oct 1, 2015 to Sep 21, 2020, will NOT be refreshed, even if there are any changes

2.3. Any data change from Sep 21, 2020 to Oct 1, 2020 will be refreshed.

2.4 Behind the scene, Power BI issues 10 queries to database as 

where OrderDate >='2020-09-30' and OrderDate < '2020-10-01' 

where OrderDate >='2020-09-29' and OrderDate < '2020-09-30' 

3. Enable Data Change detection: 

3.1. In traditional ETL, if you have a last modified date, then the process of incremental refresh can monitor that field, and only get rows that their date/time is after the latest date/time in that field in the previous refresh. This approach is still valid for a data warehouse as Power BI data source. However, the major difference is that this change will only apply for period from Sep 21, 2020 to Oct 1, 2020.

3.2. We need to specify the modified date from the table. BTW, This modified date can be derived from another few columns. Notice that this is different from the OrderDate.

3.3 The logic is to check each day as a block, day by day. If the maximum last modified date from source is greater than the maximum last modified date from the power BI cache, then all records will be refreshed, regardless of whether the last modified date is greater than the last modified date from the power BI cache or not

4. Enable Only Refresh Complete Period: it depends on the period you selected when you set up the incremental refresh, you can only refresh when the period is complete. In this case, it is by day. Date time will be ignored.

5. More notes:

5.1.When data in the past has been deleted, there could be different solutions, such as using the soft delete or simply refreshing the whole dataset.

5.2.When the data model has been changed, a full refresh is an easy option to proceed. However, if the data amount is huge, then we need to figure out a better way to promote the dataset.

Incremental refresh with a large amount of data from databases

RangeStart

#datetime(2022, 7, 1, 0, 0, 0) meta [IsParameterQuery=true, Type="DateTime", IsParameterQueryRequired=true]


RangeEnd

#datetime(2022, 7, 31, 0, 0, 0) meta [IsParameterQuery=true, Type="DateTime", IsParameterQueryRequired=true]


DateKey

let

    Source = (x as datetime) => Date.Year(x)*10000 + Date.Month(x)*100 + Date.Day(x)

in

    Source

TableX  - with DateKey

let

    StartDateKey = DateKey ( RangeStart),

    EndDateKey = DateKey ( RangeEnd),

    SQLstmt = "SELECT COLUMNS...,YYYYMMDD FROM TableX where YYYYMMDD >= " & Text.From (StartDateKey) & " and YYYYMMDD < " & Text.From (EndDateKey),

    Source = Sql.Database("SERVERNAME", "DATABASENAME", [Query=SQLstmt])

in

    Source

TableX  - without DateKey

let

    SQLstmt = "SELECT  

    Coulmns ...

    ,PartitionDate

    FROM TableX  " 

    & " WHERE cast(PartitionDate as datetime) >= '" & DateTime.ToText(RangeStart) 

    & "' and cast(PartitionDate as datetime) < '" & DateTime.ToText(RangeEnd) & "'",

    Source = Sql.Database("SERVERNAME", "DATABASENAME", [Query=SQLstmt])

in

    Source

Incremental refresh with many large csv files

TableX


let

    Source = Folder.Files(FILEFOLDER),

    #"Filter File Name+Extension" = Table.SelectRows(Source, each Text.StartsWith([Name], FileType) and ([Extension] = ".csv") ),

    #"Added Column FileDate YYYYMMDD" = Table.AddColumn(#"Filter File Name+Extension", "YYYYMMDD", each Text.Range([Name],Text.Length(FileType),8),Int64.Type),

    #"Grouped Rows" = Table.Group(#"Added Column FileDate YYYYMMDD", {"YYYYMMDD"}, {{"FileWithTopVersion", each List.Max([Name]), type text}, {"All", each _, type table [Content=binary, Name=text, Extension=text, Date accessed=nullable datetime, Date modified=nullable datetime, Date created=nullable datetime, Attributes=nullable record, Folder Path=text, YYYYMMDD=number]}}),

    #"Expanded All" = Table.ExpandTableColumn(#"Grouped Rows", "All", {"Content", "Name", "Extension", "Date accessed", "Date modified", "Date created", "Attributes", "Folder Path"}, {"Content", "Name", "Extension", "Date accessed", "Date modified", "Date created", "Attributes", "Folder Path"}),

    #"Files with Top  Versions" = Table.SelectRows(#"Expanded All", each ([FileWithTopVersion] = [Name])),

    #"Changed Type0" = Table.TransformColumnTypes(#"Files with Top  Versions",{{"YYYYMMDD", Int64.Type}}),

    #"Filtered Rows" = Table.SelectRows(#"Changed Type0", each [YYYYMMDD] > DateKey(RangeStart) and [YYYYMMDD] <= DateKey(RangeEnd)),

    #"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"Content"}),

    ...

    #"Invoke Custom Function" = Table.AddColumn(#"Filtered Hidden Files1", "Transform File", each FUNCTIONX([Content])),


...   

#"Merged Queries" = Table.NestedJoin

     ...

    #"Keep columns as specified" = Table.SelectColumns ...

in

    #"Keep columns as specified"


Questions and Answers:

Q1: What is the difference between Power BI Desktop and Power BI services in terms of incremental loading?

1. Power BI Desktop uses the specified parameters of RangeStart and RangeEnd to process, which will give developers a huge advantage in managing pbix file:

1) Dramatically reduce refresh time by 10-100 times.

2) Dramatically reduce the file size by 10-100 times.

3) Dramatically reduce the publish time, by 10+ times based on your network speed.

2. Power BI services / Cloud use the policy defined for RangeStart and RangeEnd to process, which brings many advantages, too, compared with the full refresh.

1) By publishing, Power BI dataset will be fully refreshed. However, the SQL statement sent to the SQL database is partitioned, which will reduce the database workload. (s. picture below) 

2) It will load only incremental data by the policy when manually, or scheduled refresh, which is much smaller, such as the last 2 months, instead of the last 5 years. (s. picture below) 











Q2: How is the last modified date time used?

First of all, please see the picture below. Refreshing will proceed only if the maximum modified date time from the database is greater than the maximum modified date time from Power BI memory.

When we can modify data in the past period, such as 5 years ago, we have to define incretamental refresh for 5 years. Then, the incremental refresh becomes a "FULL" refresh, except with partitions. Last modified date time can be used to reduce such refresh from all periods to a few periods. For example, if all data is modified within one year, we only need to refresh this year instead of all years.






















Q3: What is the only complete period?
We refresh data to only the last complete month. So, for example, if we are in December, we will only refresh some data in December.
















Q4: How to determine the date column to refresh and another date column to detect data change?
As known, refreshing is deleting from Power BI memory and inserting data into Power BI memory. There are only two columns you can use to do the incremental refresh: 
1) one is partitioned by; and it is the partition date, such as order date, created date, or snapshot date, which should not be updated from the system.
2) the other is the data used to determine whether the partition is to refresh. The last modified date is the latest data where the record changed; it can be derived from multiple dates from the source. You can also use the ETL load or update date.

Q5 : Can incremental refresh be supported with a native query from SQL server?
No, based on Microsoft document, if you want to support real-time data. We need to retrieve real-time data with SQL statements. Therefore query must be foldable. If the query is not foldable, we can't retrieve real-time data.
Yes, if we don't need to support real-time data. In most cases, we don't need to have real-time data. However, Microsoft might enhance this functionality. This is the status as of now, end of 2022.

Q6: Can incremental refresh be supported with the Deployment pipeline?
No, if we want to change database connection with rules, such as DEV to TEST, then to the Production environment. We can see it from the Power BI desktop. We can't apply the database connection rule if you can't change the data source below. 















Yes, if we use a parameter database connection to specify database connection dynamically.
The first time,
1. Promote DatasetX  from DEV to TEST, where the configure Gateway with DEV connection, keep DBConnection parameter the same as DEV 
2. Change DBConnection to the TEST server name, applied, and save in the deployment pipeline with RULES. (trick:  change data source first, then change parameter, then remove chnage data source, and save.)
3. Redeploy DatasetX  from DEV to TEST;
4. Change Gateway to the correct connection; refresh
The second time, Redeploy DatasetX  from DEV to TEST, no any changes needed.

Reference:
https://www.thepoweruser.com/2020/01/19/incremental-refresh-for-files-in-a-folder-or-sharepoint-power-bi/
https://www.youtube.com/watch?v=x7q1DHf8wE4 /Natives Queries or ODBC with Incremental Refresh in Power BI
https://www.youtube.com/watch?v=8hjdOCni_ZY /Enable QUERY FOLDING for native queries in Power Query / Power BI
https://www.youtube.com/watch?v=5AWt6ijJG94  / Avoid the full refresh with Incremental Refresh in Power BI (Premium)
https://www.youtube.com/watch?v=JsJWBr1_ktQ / Power BI Incremental Refresh - Understanding Detect Data Changes

No comments:

Post a Comment