Wednesday, February 1, 2017

how to write a effective SQL query to retrieve most recent records

It is often required to retrieve most recent record when building ETL to create data mart. The source table is normally very big, which make us pay attention to performance .The performance could be an issue even with data warehousing appliance like Netezza database. Three methods are introduced in this document.  Though many people use the method 1 (Use a Complete Subquery), it is not a good solution, as this method could results in duplicated records, and poor performance. Based on my experience, the method 3 (Use Window Functions)  is best option, as it provides you no duplicates and best performance.

 Use a Complete Subquery

SELECT F.*

FROM   SourceFact F

INNER JOIN (SELECT SourceID,Max(TIMESTAMP_SUBMITTED) AS LAST_TIMESTAMP_SUBMITTED

            FROM   SourceFact

            GROUP  BY SourceID) F1

ON F.SourceID = F1.SourceID

AND F.TIMESTAMP_SUBMITTED = F1.last_TIMESTAMP_SUBMITTED;

 

if more than one sourceID with the same TIMESTAMP_SUBMITTED, then more than  records for this sourceID will be selected, which breaks the uniqueness of this query. Also, the performance is very poor when the table SourceFact is very big.

 
Use Correlated Subqueries
Correlated subqueries are subqueries that depend on the outer query. It’s like a for loop in SQL. The subquery will run once for each row in the outer query:

SELECT *

FROM   SourceFact

WHERE  ( SourceID, TIMESTAMP_SUBMITTED )

IN (SELECT SourceID,Max(TIMESTAMP_SUBMITTED) AS LAST_TIMESTAMP_SUBMITTED

    FROM   SourceFact

    GROUP  BY SourceID);

 This solution is even worse for big table, as it result in very poor performance.
 

Use Window Functions
Window function is little complicated, but it yields much better performance, 10-100 times for big table. In addition, it will grantee the uniqueness, which is extremely important in data warehouse design.

SELECT F.*

FROM   (SELECT ROW_NUMBER() OVER (

        PARTITION BY SourceID

ORDER BY TIMESTAMP_SUBMITTED desc)      LAST_TIMESTAMP_SUBMITTED,

          *

        FROM   SourceFact) F

WHERE  F.last_TIMESTAMP_SUBMITTED = 1;  

No comments:

Post a Comment