It is often required to retrieve most recent record when building
ETL to create data mart. The source table is normally very big, which make us pay
attention to performance .The performance could be an issue even with data warehousing
appliance like Netezza database. Three methods are introduced in this document.
Though many people use the method 1 (Use
a Complete Subquery), it is not a good solution, as this method could results
in duplicated records, and poor performance. Based on my experience, the method
3 (Use Window Functions) is best option,
as it provides you no duplicates and best performance.
SELECT
F.*
FROM SourceFact F
INNER
JOIN (SELECT
SourceID,Max(TIMESTAMP_SUBMITTED)
AS LAST_TIMESTAMP_SUBMITTED
FROM SourceFact
GROUP BY SourceID) F1
ON
F.SourceID = F1.SourceID
AND
F.TIMESTAMP_SUBMITTED = F1.last_TIMESTAMP_SUBMITTED;
if more than one sourceID with the same TIMESTAMP_SUBMITTED,
then
more than records for this sourceID will
be selected, which breaks the uniqueness of this query. Also, the performance is
very poor when the table SourceFact is very big.
Use
Correlated Subqueries
Correlated subqueries are subqueries that depend on the
outer query. It’s like a for loop in SQL. The subquery will run once for each
row in the outer query:
SELECT
*
FROM SourceFact
WHERE ( SourceID, TIMESTAMP_SUBMITTED )
IN
(SELECT SourceID,Max(TIMESTAMP_SUBMITTED) AS LAST_TIMESTAMP_SUBMITTED
FROM SourceFact
GROUP BY SourceID);
Use
Window Functions
Window function is little complicated, but it yields much
better performance, 10-100 times for big table. In addition, it will grantee
the uniqueness, which is extremely important in data warehouse design.
SELECT
F.*
FROM (SELECT
ROW_NUMBER() OVER
(
PARTITION BY
SourceID
ORDER BY
TIMESTAMP_SUBMITTED desc) LAST_TIMESTAMP_SUBMITTED,
*
FROM SourceFact) F
WHERE F.last_TIMESTAMP_SUBMITTED = 1;
No comments:
Post a Comment