Monday, December 23, 2013

ROW_NUMBER(), RANK(), and DENSE_RANK() – Flexibility at a Price

One of the most handy features introduced in SQL 2005 were the ranking functions; ROW_NUMBER(), RANK(), and DENSE_RANK(). For anyone who hasn’t been introduced to these syntactic gems, here’s a quick rundown (for those of you who are very familiar with these functions already, feel free to read through, or skip right down to “There’s No Such Thing as a Free Ride” below).
OK – so the general syntax for any one of these commands is more or less the same:
ROW_NUMBER() OVER ([<partition_by_clause>] <order_by_clause>) RANK() OVER ([<partition_by_clause>] <order_by_clause>) DENSE_RANK() OVER ([<partition_by_clause>] <order_by_clause>) 
So the PARTITION BY part is optional, but everything else is required. An example of a non partitioned, and then a partitioned ROW_NUMBER() clause are listed below:
ROW_NUMBER() OVER (ORDER BY TotalDue DESC) ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY TotalDue DESC) 
The difference between the three functions is best explained using an example. Here’s the data I’m using for this example, in case you want to follow the bouncing ball at home ;-)
CREATE TABLE OrderRanking

   (

   OrderID INT IDENTITY(1,1) NOT NULL,

   CustomerID INT,

   OrderTotal decimal(15,2)

   )

   INSERT OrderRanking (CustomerID, OrderTotal)
SELECT 1, 1000
UNION 
SELECT 1, 500
UNION 
SELECT 1, 650
UNION 
SELECT 1, 3000
UNION 
SELECT 2, 1000
UNION 
SELECT 2, 2000
UNION 
SELECT 2, 500
UNION 
SELECT 2, 500
UNION 
SELECT 3, 500
I’ll use the following (admittedly ugly) query to demonstrate the difference between each function:
SELECT  *,

        ROW_NUMBER() OVER (ORDER BY OrderTotal DESC) AS RN,

        ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY OrderTotal DESC) AS RNP,

        RANK() OVER (ORDER BY OrderTotal DESC) AS R,

        RANK() OVER (PARTITION BY CustomerID ORDER BY OrderTotal DESC) AS RP,

        DENSE_RANK() OVER (ORDER BY OrderTotal DESC) AS DR,

        DENSE_RANK() OVER (PARTITION BY CustomerID ORDER BY OrderTotal DESC) AS DRP
FROM    OrderRanking
ORDER BY OrderTotal DESC
Excuse the terrible aliases. Anything longer and the code snippets and output in this blog entry get really, really ugly. When we run the query, this is what we get:




So from the example above, we can see that:
  • ROW_NUMBER() assigns sequential numbers to each partition in a result set (an unpartitioned result set simply has a single partition), based upon the order of the results as specified in the ORDER BY clause. If you look carefully, you’ll see that the values in column RN are based upon a simple sort of TotalDue, while the values in Column RNP (Row_Number partitioned) are first partitioned or “grouped” by CustomerID, and then numbered by TotalDue, with the row number resetting on change of customer.
  • Contrary to popular belief, RANK() does not sort rows based upon how bad they smell. RANK() does much the same thing as ROW_NUMBER(), only it acknowledges ties in the columns specified in the ORDER BY clause, and assigns them the same rank. Where a tie occurs (as was the case for orders 6/3, and 1/5/8), the numbers that would otherwise have been “used up” are skipped, and numbering resumes at the next available number. As you can see, RANK() leaves a gap whenever there is a tie.
  • DENSE_RANK() doesn’t like gaps. It’s more of an Abercrombie & Fitch kind of function (ba-dum-ching!). Ohhhhh…that was terrible. My sense of humour may give me up for lent. You might follow it. Anyway….DENSE_RANK() “fills in the gaps”. It starts from the next number after a tie occurs, so instead of 1, 2, 3, 3, 5 you get 1, 2, 3, 3, 4.
There’s No Such Thing as a Free Ride
Ranking functions are not only useful for simple ranking – they’re also great for solving complex problems. In fact, once you get to know them, you’ll find that you’re using them for waaaaaay more than just ranking. In anything from splitting strings to deleting duplicates, ranking functions are the cat’s meow.
But just like most things in life SQL Server, less is more, when you can get away with it. For example, let’s say that you need to get the top order (by TotalDue) for each Customer in AdventureWorks (AdventureWorks2008 in my examples below). You can definitely use ROW_NUMBER (or any of the other ranking functions, for that matter) to do this:
SELECT soh.*
FROM   (SELECT CustomerID, SalesOrderID, TotalDue, 
               ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY TotalDue DESC) AS RowNumber

       FROM    Sales.SalesOrderHeader) AS soh
WHERE  soh.RowNumber = 1
The WHERE soh.RowNumber = 1 restricts our results to the top order for each customer. Lovely. And the really beautiful thing about this is, if you need the top 2 orders for each customer, or 3, or 4, or x, all you need to do is replace the = 1 with <=2 (for example), and you’re good to go. Now, with that in mind, let’s look at this query:
SELECT soh.CustomerID, soh.TotalDue
FROM   Sales.SalesOrderHeader soh
JOIN   (SELECT     CustomerID, MAX(TotalDue) AS MaxTotalDue

       FROM        Sales.SalesOrderHeader

       GROUP BY CustomerID) AS ttls   ON soh.CustomerID = ttls.CustomerID

                                       AND soh.TotalDue = ttls.MaxTotalDue
If you plug this bad boy in, and run it, you might be surprised by the outcome. It’s actually about half as expensive as the Row_Number solution – but why? Well, as you may or may not know, sorts can be very, very expensive in SQL Server. If we’re only fetching the highest $ sales order for a given customer, the MAX solution does it without a sort, whereas the ROW_NUMBER solution needs to sort (the ORDER BY clause is mandatory, remember).
But there are some caveats to the MAX solution – most notably, how in the world can we get the top 5 orders for each customer? Well…the short answer is, we can’t. We need to change the query up, and in doing so, we’re once again going to incur a sort. Once we get beyond a query that the MAX or MIN tricks can satisfy – for instance, if we need to fetch the top 5 orders for each customer, we may as well take advantage of the ease of coding, and the improved readability of the Row_Number solution. If we want a solution for the “top 5” problem without invoking a ranking function, we’re going to end up with something like this:
SELECT soh.CustomerID, soh.TotalDue
FROM   Sales.SalesOrderHeader soh
WHERE  soh.SalesOrderID IN  
       (SELECT     TOP 5 SalesOrderID

       FROM        Sales.SalesOrderHeader soh2

       WHERE       soh2.CustomerID = soh.CustomerID

       ORDER BY TotalDue DESC)
Which in this case is a very, very crappy alternative to a ranking function. Not only is it uglier, but the query plan isn’t nearly as efficient, and the execution times were consistently about 20% longer in my tests.
Now that said, my tests are against a single data set only, and based upon the nature of your data, your mileage may vary. As a general rule, I would use aggregate functions if I’m only looking for the highest or lowest data point in a series, and a ranking function for anything that can’t be solved by simple aggregation.

No comments:

Post a Comment