MySQL - Select Top N Rows Per Group
It is a very common task to select top N rows from each group in MySQL when doing data analysis and reporting. You might use this to get the top 3 selling products for each product category, get the highest paid employee in each department, find out the top 10 popular posts in your blog, and so on. We will go over modern SQL techniques of answering top N rows per group questions using Row_Number, Rank, and Dense_Rank window functions.
Example Data
A school has collected student score data for Math and English classes. We would like to perform analysis of top student scores in each class.
create table student_score (
class_name VARCHAR(9),
student_login VARCHAR(50),
score INT
);
insert into student_score (class_name, student_login, score) values ('math', 'gacreman0', 100);
insert into student_score (class_name, student_login, score) values ('english', 'ewoodington1', 87);
insert into student_score (class_name, student_login, score) values ('math', 'ctilliard2', 72);
insert into student_score (class_name, student_login, score) values ('english', 'gzorer3', 81);
insert into student_score (class_name, student_login, score) values ('english', 'swyre4', 75);
insert into student_score (class_name, student_login, score) values ('math', 'ppadgett8', 82);
insert into student_score (class_name, student_login, score) values ('math', 'mreynalds9', 100);
insert into student_score (class_name, student_login, score) values ('math', 'dlettena', 100);
insert into student_score (class_name, student_login, score) values ('math', 'dbartellib', 86);
insert into student_score (class_name, student_login, score) values ('english', 'tlorinezc', 87);
insert into student_score (class_name, student_login, score) values ('english', 'ddeftied', 81);
Use
DB-Fiddle to execute SQL scripts on sample data.
Select Top N Rows Per Group with Row_Number
If the business requirements are to always return exactly 3 top students in each class, Row_Number
window function should be used.
When there are tied scores, sort student_login ascendingly to determine which student_login to pick.
Since the PARTITION BY clause is present, the rankings are reset for rows in each class.
SELECT *
,row_number() OVER (
PARTITION BY class_name ORDER BY score DESC
,student_login ASC
) AS row_num
FROM student_score
Here is the completed script to get exactly 3 top students in each class with Row_Number
.
WITH score_analysis
AS (
SELECT *,
row_number() OVER (
PARTITION BY class_name ORDER BY score DESC,
student_login ASC
) AS row_num
FROM student_score
)
SELECT class_name,
student_login,
score
FROM score_analysis
WHERE row_num <= 3;
Here is the output of the query.
class_name | student_login | score |
---|---|---|
english | ewoodington1 | 87 |
english | tlorinezc | 87 |
english | ddeftied | 81 |
math | dlettena | 100 |
math | gacreman0 | 100 |
math | mreynalds9 | 100 |
Select Top N Rows Per Group with Dense_Rank
If the business requirements are to always
report a first, second, and third place for each class, regardless of score ties for each place,
Dense_Rank
window function is required to achieve this. Each place could have 1 or multiple students.
If two or more rows tie, there will be no gap in the sequence of ranked values. Since the PARTITION BY clause is present, the rankings are reset for rows in each class.
SELECT *
,dense_rank() OVER (
PARTITION BY class_name ORDER BY score DESC
) AS dense_rank_num
FROM student_score
Here is the completed script to always have first, second, and third place students in each class with Dense_Rank
.
WITH score_analysis
AS (
SELECT *,
dense_rank() OVER (
PARTITION BY class_name ORDER BY score DESC
) AS dense_rank_num
FROM student_score
)
SELECT class_name,
student_login,
score
FROM score_analysis
WHERE dense_rank_num <= 3;
Here is the output of the query.
class_name | student_login | score |
---|---|---|
english | ewoodington1 | 87 |
english | tlorinezc | 87 |
english | gzorer3 | 81 |
english | ddeftied | 81 |
english | swyre4 | 75 |
math | gacreman0 | 100 |
math | mreynalds9 | 100 |
math | dlettena | 100 |
math | dbartellib | 86 |
math | ppadgett8 | 82 |
Select Top N Rows Per Group with Rank
If the business requirements are to allow the possibility of no second or third place, Rank
function can be used to achieve this. Each place could have 0, 1, or multiple students.
Rows with equal scores for the ranking criteria receive the same rank. The next rank is calculated by adding the number of tied rows to the tied rank. Therefore, Rank
function will leave gaps in the ranking values.
Since the PARTITION BY clause is present, the rankings are reset for rows in each class.
SELECT *
,rank() OVER (
PARTITION BY class_name ORDER BY score DESC
) AS rank_num
FROM student_score
Example 1: English class has 2 scores tied at first place so there is no second place. But there is a third place with 2 scores tied.
Example 2: Math class has 3 scores tied at first place, there will be no second or third place.
Here is the completed script to always have first, second, and third place students in each class with Dense_Rank
.
WITH score_analysis
AS (
SELECT *,
rank() OVER (
PARTITION BY class_name ORDER BY score DESC
) AS rank_num
FROM student_score
)
SELECT class_name,
student_login,
score
FROM score_analysis
WHERE rank_num <= 3;
Here is the output of the query.
class_name | student_login | score |
---|---|---|
english | ewoodington1 | 87 |
english | tlorinezc | 87 |
english | gzorer3 | 81 |
english | ddeftied | 81 |
math | gacreman0 | 100 |
math | mreynalds9 | 100 |
math | dlettena | 100 |
Conclusion
Hopefully through the examples discussed above, you can see that when a tie of two or more scores occurs, both Rank
and Dense_Rank
assign the same rank number to all records with the same value.
Where they differ is that Rank
function continues the rank value in line with the Row_Number
series so it skips the next available ranking value. This results in gaps in the ranking values.
Dense_Rank
will instead use the next chronological ranking value so it leaves no gaps in the ranking values.
As you can see that Row_Number
, Rank
, and Dense_Rank
window functions are very useful in analyzing data and answering Top N Rows Per Group types of questions.
It's good to have a good understanding on each of them so you can determine which one to use based on the business requirements.