Executive Summary

Organizations increasingly rely on Power BI to analyze data stored in shared folders containing hundreds or even thousands of flat files produced by modern Big Data platforms. While this file-based architecture supports strong data management practices—such as lineage, auditability, and replayability—it creates a major refresh performance challenge: Power BI tries to read every file during every refresh unless incremental refresh is implemented correctly.

This blog provides:

A complete, practical blueprint for files-based incremental loading
Clear explanations of the three types of refresh behaviors
Deep insight into why performance issues occur
A comprehensive list of common pitfalls, with reasons and consequences

When implemented properly, incremental refresh transforms a slow, hour-long refresh into a fast, stable, 5–10 minute refresh—even with years of historical files.

1. Why Do Big Data Platforms Produce So Many Files?

Modern data platforms are designed around append-only ingestion. Instead of updating existing data, they continuously write small, time-stamped files—daily, hourly, or by batch. This pattern:

Ensures auditability
Preserves full historical lineage
Allows efficient reprocessing and rollback
Aligns with cloud data lake best practices

The result is inevitable: massive file counts in shared locations accessed by Power BI.

To make Power BI perform well against such sources, incremental loading is essential.

2. What Is Files-Based Incremental Loading?

Databases support incremental refresh naturally because they include date columns and optimized partition indexes. Flat files do not.
To mimic database-like behavior for folder sources, Power BI must:

Extract a date value from each file name (FileDateTime).
Use RangeStart and RangeEnd parameters to select only the relevant files.
Apply incremental refresh policies that freeze historical data and refresh only recent partitions.

This reduces unnecessary work and provides consistent, efficient refresh cycles.

3. High-Level Architecture

The architecture follows a clean, layered design:

Folder Source: Power BI connects to a shared folder containing many files.
Staging Layer: Extracts a date value from each file name, then filters files using RangeStart and RangeEnd—before expanding file contents.
Fact Layer: Expands and transforms only the filtered files.
Incremental Refresh Engine: Power BI Service partitions and manages historical vs. incremental data.

This separation ensures Power BI reads only what is needed.

4. Core Implementation Pattern

4.1 Create RangeStart and RangeEnd parameters

These DateTime parameters define the boundaries of the incremental refresh window. Desktop uses them literally; the Service replaces them dynamically.

4.2 Derive FileDateTime from the file name

For example, if a file name begins with a date such as “20230102,” interpret it as January 2, 2023. This becomes the date used for filtering and partitioning.

4.3 Filter early, before expanding file contents

Filter the list of files so only those whose FileDateTime is within the desired window are expanded and processed.
The correct pattern is:

FileDateTime greater than or equal to RangeStart
FileDateTime strictly less than RangeEnd

4.4 Apply incremental refresh

Define how many years of history to keep and how many days/weeks/months to refresh.
Power BI then creates partitions and ensures only recent partitions are refreshed going forward.

5. Understanding the Three Refresh Scenarios

Power BI handles refreshes differently in Desktop and in the Service. Misunderstanding these differences often leads to wrong expectations or incorrect troubleshooting.

Scenario 1 — Power BI Desktop: Development Mode

What actually happens

Desktop loads exactly the files that fall between the RangeStart and RangeEnd parameters you define.
It does not create partitions.
It does not perform dynamic date substitution.
It does not optimize file scanning.

Reason

Desktop is built for development and debugging—not for processing large historical datasets.

What to expect

Small date window → fast and smooth
Large date window → very slow, memory-heavy, or crashes

Result if misused

Developers mistakenly think incremental refresh “doesn’t work” because Desktop behaves like a full refresh when too many files are included.

Best practice

Use only a small date testing window (1–3 days).

Scenario 2 — Power BI Service: Initial Full Load (First Refresh After Publishing)

What actually happens

Power BI Service completely ignores the Desktop parameter values and instead loads:

The entire historical dataset required by your incremental refresh policy
All partitions needed for historical storage

This is the only time Power BI performs a complete historical load.

Reason

Power BI must build the entire partition structure for incremental refresh to work.

What to expect

This refresh may take a long time, depending on data volume.
It prepares the model for efficient future incremental refresh cycles.

Result if misunderstood

Teams think incremental refresh is “slow,” not realizing this is a required setup step.
Multiple republishing resets the partition structure and forces another full load.

Best practice

Allow the first refresh to complete and avoid unnecessary republishing.

Scenario 3 — Power BI Service: Ongoing Incremental Refresh (Normal Operation)

What actually happens

After partitions are built:

Only recent partitions (for example, the last 7 days) are refreshed
Older partitions remain frozen
Only new files are processed

Reason

Incremental refresh is designed to process only new data (“delta” refresh).

What to expect

Large performance improvement
Stable refresh durations
Minimal system load

Result if implemented incorrectly

If filtering is incorrect, file structures are bad, or partitions are too granular, Power BI may still scan all files or reprocess excessive partitions.

Best practice

Optimize filtering, folder structure, and partition granularity to maximize efficiency.

6. Benefits of Files-Based Incremental Loading

Massively reduced refresh time
Stable, predictable refresh performance
Historical data becomes immutable and reliable
Reduced workload on gateways and file systems
More efficient troubleshooting and data governance

7. Common Pitfalls & How to Avoid Them

Pitfall 1 — Incorrect Filter Pattern (Not Left-Inclusive, Right-Exclusive)

Reason

If the filter allows FileDateTime values on both boundaries, partitions overlap.

Result

Duplicate rows
Incorrect totals
Reprocessed files
Wrong business results

Fix

Always include the lower boundary and exclude the upper boundary.

Pitfall 2 — Too Many Partitions (Daily/Weekly/Monthly)

Reason

Power BI creates one partition per time unit.
If you define sixty daily partitions, Power BI scans the entire folder sixty times.

Result

Extremely long refresh
Heavy load on gateway and file server
Incremental refresh becomes slower than full refresh

Fix

Use weekly or monthly partitions for large file volumes, and keep the refresh window short.

Pitfall 3 — Keeping All Files in One Large Flat Folder

Reason

File systems do not support partition-aware scanning.
Power BI must enumerate every file repeatedly across partitions.

Result

Very slow metadata scanning
Refresh performance degrades as the folder grows
Service load increases exponentially
Incremental refresh loses effectiveness

Fix

Organize files into year/month subfolders so Power BI can skip entire folders.

Pitfall 4 — Filtering Files Too Late in the Query

Reason

If filtering is applied after expanding file contents, Power BI must load every file before deciding which ones to discard.

Result

Essentially becomes a full refresh
Very slow processing
High memory usage

Fix

Apply FileDateTime filtering immediately after reading folder metadata.

Pitfall 5 — Large Parameter Window in Desktop

Reason

Desktop uses the parameter values literally and loads all files in the range.

Result

Desktop freezes or crashes
Development slows dramatically

Fix

Keep Desktop ranges small (1–3 days).

8. Summary

Folder-based incremental refresh is powerful—but delicate.
When implemented correctly, it resolves one of the biggest performance barriers in Power BI: efficiently handling very large collections of flat files.

Success depends on:

Using the correct filtering logic
Managing partition granularity
Designing a smart folder structure
Filtering early
Using small development windows
Understanding Desktop vs. Service refresh behavior

With these principles, even large multi-year file archives refresh efficiently, reliably, and predictably.

Monday, December 8, 2025

Avoiding the Performance Trap: Files-Based Incremental Loading for Large Folder Power Query Sources

Executive Summary

1. Why Do Big Data Platforms Produce So Many Files?

2. What Is Files-Based Incremental Loading?

3. High-Level Architecture

4. Core Implementation Pattern

4.1 Create RangeStart and RangeEnd parameters

4.2 Derive FileDateTime from the file name

4.3 Filter early, before expanding file contents

4.4 Apply incremental refresh

5. Understanding the Three Refresh Scenarios

Scenario 1 — Power BI Desktop: Development Mode

What actually happens

Reason

What to expect

Result if misused

Best practice

Scenario 2 — Power BI Service: Initial Full Load (First Refresh After Publishing)

What actually happens

Reason

What to expect

Result if misunderstood

Best practice

Scenario 3 — Power BI Service: Ongoing Incremental Refresh (Normal Operation)

What actually happens

Reason

What to expect

Result if implemented incorrectly

Best practice

6. Benefits of Files-Based Incremental Loading

7. Common Pitfalls & How to Avoid Them

Pitfall 1 — Incorrect Filter Pattern (Not Left-Inclusive, Right-Exclusive)

Reason

Result

Fix

Pitfall 2 — Too Many Partitions (Daily/Weekly/Monthly)

Reason

Result

Fix

Pitfall 3 — Keeping All Files in One Large Flat Folder

Reason

Result

Fix

Pitfall 4 — Filtering Files Too Late in the Query

Reason

Result

Fix

Pitfall 5 — Large Parameter Window in Desktop

Reason

Result

Fix

8. Summary

No comments:

Post a Comment