Skip to main content

MySQL - Island Analysis - Find Islands in Number Sequence

Understanding how to identify islands is useful in data analysis. In this post, we will go over different techniques of using Self Join and Window Functions Lead() and Lag() to identify islands in a number sequence. Islands identified will be presented as a range.

Problem


The sample Test numbers table contains 4 islands. We need to create a SQL script to identify them.

number
1
2
5
6
7
8
9
10
12
13
14
15
20

Execution of the SQL script should output lower and upper bound of each island.

island_startisland_end
12
510
1215
2020

Example Data


create table Test (
number INT
);

-- Island 1
insert into Test (number) values (1);
insert into Test (number) values (2);

-- Island 2
insert into Test (number) values (5);
insert into Test (number) values (6);
insert into Test (number) values (7);
insert into Test (number) values (8);
insert into Test (number) values (9);
insert into Test (number) values (10);

-- Island 3
insert into Test (number) values (12);
insert into Test (number) values (13);
insert into Test (number) values (14);
insert into Test (number) values (15);

-- Island 4
insert into Test (number) values (20);

Use

DB-Fiddle to execute SQL scripts on sample data.

Solution - Find Islands Using Self Join


Find the Start Number of Each Island

We can left join the table with itself on a join key offset by 1 to check if the current number is the start of an island.

This is true if we cannot find a previous number that's exactly 1 less than the current number. prev_num column would be null in this case.

SELECT l.number AS curr_num
,r.number AS prev_num
FROM Test l
-- Do we have a previous number in the sequence that's 1 less than the current number?
LEFT JOIN Test r ON l.number = r.number + 1
ORDER BY l.number

In the following illustration, we can see that curr_num is the start number of an island if prev_num is null. Find the start of each island using self join

Find the End Number of Each Island

We can left join the table with itself on a join key offset by 1 to check if the current number is the end of an island.

This is true if we cannot find a next number that's exactly 1 greater than the current number. next_num column would be null in this case.

SELECT l.number AS curr_num
,r.number AS next_num
FROM Test l
-- Do we have a next number in the sequence that's 1 greater than the current number?
LEFT JOIN Test r ON l.number = r.number - 1
ORDER BY l.number

In the following illustration, we can see that curr_num is the end number of an island if next_num is null. Find the end of each island using self join

Present Islands as Range

The first CTE island_start_nums stores start numbers of islands. Each number is sorted ascendingly and assigned an unique row_num value.

The second CTE island_end_nums stores end numbers of islands. Each number is sorted ascendingly and assigned an unique row_num value.

Finally, we inner join the first and second CTE on row_num key to present islands as range.

WITH island_start_nums
AS (
SELECT l.number AS island_start
,row_number() OVER (
ORDER BY l.number ASC
) AS row_num
FROM Test l
LEFT JOIN Test r ON l.number = r.number + 1
WHERE r.number IS NULL
)
,island_end_nums
AS (
SELECT l.number AS island_end
,row_number() OVER (
ORDER BY l.number ASC
) AS row_num
FROM Test l
LEFT JOIN Test r ON l.number = r.number - 1
WHERE r.number IS NULL
)
SELECT s.island_start
,e.island_end
FROM island_start_nums s
INNER JOIN island_end_nums e ON s.row_num = e.row_num;

The following screenshot illustrates that island_start_nums CTE inner join island_end_nums CTE on row_num Key to present islands as range. Inner Join island_start_nums and island_end_nums CTE on row_num Key to Present Islands as Range

Here is the query output with the number range of each island.

island_startisland_end
12
510
1215
2020

Solution - Find Islands Using Row_Number


The first column number contains a list of numbers that we need to find islands.

The second column row_num contains the row number which makes use of the Row_Number window function to generate a sequential numbers for our numbers.

The third column diff_group contains the difference between column number and column row_num. Each time the difference increases, we have identified a new island as the row number falls further behind our number.

SELECT number
,row_number() OVER (
ORDER BY number
) AS row_num
,number - row_number() OVER (
ORDER BY number
) AS diff_group
FROM Test t

The following screenshot illustrates the logics we described to identify islands in our number sequence. Inner Join island_start_nums and island_end_nums CTE on row_num Key to Present Islands as Range

We have successfully identified all the number islands. To present islands as range, we simply group by diff_group and use Min function to get the lower bound of each island and use Max function to get the upper bound of each island.

WITH island_analysis
AS (
SELECT number
,row_number() OVER (
ORDER BY number
) AS row_num
,number - row_number() OVER (
ORDER BY number
) AS diff_group
FROM Test t
)
SELECT min(number) AS island_start
,max(number) AS island_end
FROM island_analysis i
GROUP BY diff_group;

Here is the same output with the number range of each island.

island_startisland_end
12
510
1215
2020

Solution - Find Islands Using Dense_Rank


The Row_Number window function is useful in identifying islands within a number sequence that does NOT have duplicate numbers.

However, our results would be incorrect if we have duplicate numbers in our number sequence as you can see in the following illustration. This is because Row_Number always increments each row by 1 even for identical numbers.

This screenshot illustrates that our island results is now incorrect when a duplicate number 7 is introduced in the number sequence. Number 7 now belongs to two different islands! Duplicates affecting island results when using Row_Number function How do we adjust our solution to manage duplicates in the number sequence?

We could remove duplicates first before performing island analysis.

Alternatively, we can utilize the Dense_Rank window function that returns the rank of each row with no gaps in the ranking values. The rank of a specific row is one plus the number of distinct rank values that come before that specific row.

SELECT number
,dense_rank() OVER (
ORDER BY number
) AS row_num
,number - dense_rank() OVER (
ORDER BY number
) AS diff_group
FROM Test t

The screenshot illustrates that by simply substituting Row_Number window function with Dense_Rank window function, our island results will no longer be impacted by duplicates! Duplicates no longer affecting island results when using Dense_Rank function

With the help of Dense_Rank window function, we have again successfully identified all the number islands even if duplicate numbers. To present islands as range, we simply group by diff_group and use Min function to get the lower bound of each island and use Max function as before to get the upper bound of each island.

WITH island_analysis
AS (
SELECT number
,dense_rank() OVER (
ORDER BY number
) AS row_num
,number - dense_rank() OVER (
ORDER BY number
) AS diff_group
FROM Test t
)
SELECT min(number) AS island_start
,max(number) AS island_end
FROM island_analysis i
GROUP BY diff_group;

Here is the same output again with the number range of each island and we will be able to manage duplicates in case there is any in the number sequence.

island_startisland_end
12
510
1215
2020