Executive Summary
Power BI makes it easy to ingest data from folders containing many files, but when the volume reaches thousands of files and multiple gigabytes, refresh performance becomes a serious concern. A common misunderstanding is assuming that when multiple Power Query objects reference the same base folder query, Power BI automatically “reuses” that data. Unfortunately, this is not how the Power BI refresh engine works.
When downstream queries depend on a Power Query folder source—and all are enabled for load—Power BI evaluates each query independently. This means the folder is scanned multiple times during refresh, even though the queries appear to share the same base logic. The impact can be dramatic: refreshes become slow, unstable, and unnecessarily heavy on the underlying storage system.
This document explains why this happens, what the risks are, and how to redesign the dataset so the folder is loaded once, while all other logic is rebuilt efficiently inside the model using DAX calculated tables. This approach eliminates redundant IO, improves refresh performance, and clarifies a widely misunderstood aspect of Power Query behavior.
Requirements
The dataset contains three Power Query components:
-
Query 1
-
Reads from a folder containing ~2000 files totaling ~4 GB.
-
Represents the detailed transactional dataset.
-
-
Query 2
-
Referenced from Query 1, applying additional transformations.
-
-
Query 3
-
Also referenced from Query 1, extracting a different view of the same data.
-
All three queries are currently enabled for load into Power BI.
The main requirement is:
Refresh the dataset efficiently and ensure the large folder is not scanned multiple times.
Analysis
Although Queries 2 and 3 are visually created by referencing Query 1 within Power Query, this does not imply reuse or caching during refresh. Power BI treats every loaded query as an independent evaluation path.
Behind the scenes:
-
Each loaded query is evaluated separately
-
Each query’s full dependency chain is executed
-
Folder connectors do not share cached results
This means:
-
Query 1 triggers a full folder scan
-
Query 2 triggers another full folder scan
-
Query 3 triggers yet another full folder scan
Total effect:
-
Three full scans of ~2000 files
-
Three ingestion cycles of ~4 GB each
-
~12 GB of IO per refresh instead of 4 GB
This leads to:
-
Slow refresh duration
-
Higher chance of refresh timeout
-
Increased load on upstream storage
-
Poor scalability as logic grows
This is a common misunderstanding: referencing a query in Power Query does not guarantee shared evaluation.
Solution
To eliminate redundant folder scans, the dataset should be redesigned so that:
✔ Only one Power Query object loads the large folder
This ensures the folder is scanned once.
✔ Downstream logic is removed from Power Query
Queries derived from the folder source should not be loaded, preventing multiple evaluations.
✔ All additional “views” are rebuilt inside the model using DAX calculated tables
These tables mimic the original Power Query outputs but are computed in-memory, without touching the folder source again.
This design achieves:
-
A single ingestion point
-
Faster refresh performance
-
Reduced infrastructure load
-
A cleaner, more maintainable model
Implementation
1. Keep Query 1 as the only loaded Power Query object
Query 1 remains the sole location where the folder is read:
-
Query 1 → Enable Load = On
This becomes the single ingestion pipeline for the dataset.
2. Disable load for Query 2 and Query 3
These downstream queries should not trigger additional refresh paths.
In Power Query:
-
Right-click Query 2 → uncheck Enable Load
-
Right-click Query 3 → uncheck Enable Load
They remain as reference definitions, but no longer evaluate during refresh.
3. Rebuild the logic of Query 2 and Query 3 using DAX calculated tables
After Query 1 loads into the model, replicate the logic of the outdated Power Query transformations using DAX calculated tables.
This includes:
-
Summaries
-
Filters
-
“Latest record” logic
-
Grouping or aggregation
-
Business-specific views
These DAX tables are:
-
Computed once after refresh
-
Built from the in-memory data of Query 1
-
Functionally identical to the old Power Query outputs
-
Much faster and more reliable
-
Not dependent on folder scans
Once created, they replace the old Query 2 and Query 3 in the model.
4. Remap the data model and reports
With the new DAX tables in place:
-
Update relationships to point to the new calculated tables
-
Adjust measures and calculated columns if needed
-
Update report visuals to reference the new tables
-
Remove Query 2 and Query 3 from the model after migration is complete
From the user’s perspective, all visuals and logic behave exactly the same.
From the engine’s perspective, refresh is now far more efficient.
5. Validate the refresh and reporting behavior
After implementation:
-
Refresh the dataset
-
Confirm that refresh time is significantly reduced
-
Verify that the folder is scanned only once
-
Ensure all reports continue to function correctly
-
Finalize model cleanup
This completes the migration from a multi-scan design to a single-scan, optimized architecture.
Conclusion
A major misunderstanding in Power BI development is assuming that referencing a Power Query object automatically reuses previously loaded data. When dealing with large folder-based sources, this misunderstanding can lead to multiple unnecessary full-folder scans and severe performance degradation.
The correct architectural approach is:
-
Load the folder once
-
Disable downstream Power Query loads
-
Rebuild the additional views as DAX calculated tables
-
Remap the model and reports accordingly
This ensures a scalable, efficient, and maintainable dataset that refreshes quickly and avoids redundant IO.
No comments:
Post a Comment