MySQL - Island Analysis - Find Islands in Number Sequence
Understanding how to identify islands is useful in data analysis.
In this post, we will go over different techniques of using Self Join
and Window Functions Lead()
and Lag()
to identify islands in a number sequence.
Islands identified will be presented as a range.
Problem
The sample Test numbers table contains 4 islands. We need to create a SQL script to identify them.
number |
---|
1 |
2 |
5 |
6 |
7 |
8 |
9 |
10 |
12 |
13 |
14 |
15 |
20 |
Execution of the SQL script should output lower and upper bound of each island.
island_start | island_end |
---|---|
1 | 2 |
5 | 10 |
12 | 15 |
20 | 20 |
Example Data
create table Test (
number INT
);
-- Island 1
insert into Test (number) values (1);
insert into Test (number) values (2);
-- Island 2
insert into Test (number) values (5);
insert into Test (number) values (6);
insert into Test (number) values (7);
insert into Test (number) values (8);
insert into Test (number) values (9);
insert into Test (number) values (10);
-- Island 3
insert into Test (number) values (12);
insert into Test (number) values (13);
insert into Test (number) values (14);
insert into Test (number) values (15);
-- Island 4
insert into Test (number) values (20);
Use
DB-Fiddle to execute SQL scripts on sample data.
Solution - Find Islands Using Self Join
Find the Start Number of Each Island
We can left join
the table with itself on a join key offset by 1 to check if the current number is the start of an island.
This is true if we cannot find a previous number that's exactly 1 less than the current number. prev_num
column would be null
in this case.
SELECT l.number AS curr_num
,r.number AS prev_num
FROM Test l
-- Do we have a previous number in the sequence that's 1 less than the current number?
LEFT JOIN Test r ON l.number = r.number + 1
ORDER BY l.number
In the following illustration, we can see that curr_num
is the start number of an island if prev_num
is null.
Find the End Number of Each Island
We can left join
the table with itself on a join key offset by 1 to check if the current number is the end of an island.
This is true if we cannot find a next number that's exactly 1 greater than the current number. next_num
column would be null
in this case.
SELECT l.number AS curr_num
,r.number AS next_num
FROM Test l
-- Do we have a next number in the sequence that's 1 greater than the current number?
LEFT JOIN Test r ON l.number = r.number - 1
ORDER BY l.number
In the following illustration, we can see that curr_num
is the end number of an island if next_num
is null.
Present Islands as Range
The first CTE island_start_nums
stores start numbers of islands. Each number is sorted ascendingly and assigned an unique row_num
value.
The second CTE island_end_nums
stores end numbers of islands. Each number is sorted ascendingly and assigned an unique row_num
value.
Finally, we inner join the first and second CTE on row_num
key to present islands as range.
WITH island_start_nums
AS (
SELECT l.number AS island_start
,row_number() OVER (
ORDER BY l.number ASC
) AS row_num
FROM Test l
LEFT JOIN Test r ON l.number = r.number + 1
WHERE r.number IS NULL
)
,island_end_nums
AS (
SELECT l.number AS island_end
,row_number() OVER (
ORDER BY l.number ASC
) AS row_num
FROM Test l
LEFT JOIN Test r ON l.number = r.number - 1
WHERE r.number IS NULL
)
SELECT s.island_start
,e.island_end
FROM island_start_nums s
INNER JOIN island_end_nums e ON s.row_num = e.row_num;
The following screenshot illustrates that island_start_nums
CTE inner join island_end_nums
CTE on row_num
Key to present islands as range.
Here is the query output with the number range of each island.
island_start | island_end |
---|---|
1 | 2 |
5 | 10 |
12 | 15 |
20 | 20 |
Solution - Find Islands Using Row_Number
The first column number
contains a list of numbers that we need to find islands.
The second column row_num
contains the row number which makes use of the Row_Number
window function to generate a sequential numbers for our numbers.
The third column diff_group
contains the difference between column number
and column row_num
.
Each time the difference increases, we have identified a new island as the row number falls further behind our number.
SELECT number
,row_number() OVER (
ORDER BY number
) AS row_num
,number - row_number() OVER (
ORDER BY number
) AS diff_group
FROM Test t
The following screenshot illustrates the logics we described to identify islands in our number sequence.
We have successfully identified all the number islands. To present islands as range, we simply group by diff_group
and use Min
function to get the lower bound of each island and use Max
function to get the upper bound of each island.
WITH island_analysis
AS (
SELECT number
,row_number() OVER (
ORDER BY number
) AS row_num
,number - row_number() OVER (
ORDER BY number
) AS diff_group
FROM Test t
)
SELECT min(number) AS island_start
,max(number) AS island_end
FROM island_analysis i
GROUP BY diff_group;
Here is the same output with the number range of each island.
island_start | island_end |
---|---|
1 | 2 |
5 | 10 |
12 | 15 |
20 | 20 |
Solution - Find Islands Using Dense_Rank
The Row_Number
window function is useful in identifying islands within a number sequence that does NOT have duplicate numbers.
However, our results would be incorrect if we have duplicate numbers in our number sequence as you can see in the following illustration.
This is because Row_Number
always increments each row by 1 even for identical numbers.
This screenshot illustrates that our island results is now incorrect when a duplicate number 7 is introduced in the number sequence. Number 7 now belongs to two different islands! How do we adjust our solution to manage duplicates in the number sequence?
We could remove duplicates first before performing island analysis.
Alternatively, we can utilize the Dense_Rank
window function that returns the rank of each row with no gaps in the ranking values.
The rank of a specific row is one plus the number of distinct rank values that come before that specific row.
SELECT number
,dense_rank() OVER (
ORDER BY number
) AS row_num
,number - dense_rank() OVER (
ORDER BY number
) AS diff_group
FROM Test t
The screenshot illustrates that by simply substituting Row_Number
window function with Dense_Rank
window function, our island results will no longer be impacted by duplicates!
With the help of Dense_Rank
window function, we have again successfully identified all the number islands even if duplicate numbers.
To present islands as range, we simply group by diff_group
and use Min
function to get the lower bound of each island and use Max
function as before to get the upper bound of each island.
WITH island_analysis
AS (
SELECT number
,dense_rank() OVER (
ORDER BY number
) AS row_num
,number - dense_rank() OVER (
ORDER BY number
) AS diff_group
FROM Test t
)
SELECT min(number) AS island_start
,max(number) AS island_end
FROM island_analysis i
GROUP BY diff_group;
Here is the same output again with the number range of each island and we will be able to manage duplicates in case there is any in the number sequence.
island_start | island_end |
---|---|
1 | 2 |
5 | 10 |
12 | 15 |
20 | 20 |