MySQL - Gap Analysis - Find Gaps in Number Sequence
Understanding how to identify gaps is useful in data analysis.
In this post, we will go over different techniques of using Self Join
and Window Functions Lead()
and Lag()
to identify gaps in a number sequence.
Instead of simply outputting a list of missing values, gaps will be presented as a range so it is more useful.
Problem
The sample Test numbers table contains 3 gaps. We need to create a SQL script to identify them.
number |
---|
1 |
2 |
5 |
6 |
7 |
8 |
9 |
10 |
12 |
13 |
14 |
15 |
20 |
Execution of the SQL script should output lower and upper bound of each gap.
gap_start | gap_end |
---|---|
3 | 4 |
11 | 11 |
16 | 19 |
Example Data
create table Test (
number INT
);
insert into Test (number) values (1);
insert into Test (number) values (2);
-- Gap 1
insert into Test (number) values (5);
insert into Test (number) values (6);
insert into Test (number) values (7);
insert into Test (number) values (8);
insert into Test (number) values (9);
insert into Test (number) values (10);
-- Gap 2
insert into Test (number) values (12);
insert into Test (number) values (13);
insert into Test (number) values (14);
insert into Test (number) values (15);
-- Gap 3
insert into Test (number) values (20);
Use
DB-Fiddle to execute SQL scripts on sample data.
Solution - Find Gaps Using Self Join
Our solution should identify 3 gaps in the test data for rows where a row value does not sequentially follow another.
1
2
5
6
7
8
9
10
12
13
14
15
20
We can left join the table with itself on a join key offset by 1 to check if there is a gap between the current number and the next number.
SELECT l.number
,r.number
FROM Test l
LEFT JOIN Test r ON l.number = r.number - 1 -- Do we have a next number in the sequence that's 1 greater than the current number?
WHERE l.number != (
SELECT max(number) -- Gap will never start after the maximum sequence number
FROM Test
)
AND r.number IS NULL -- Gap identified
We can left join the table with itself again on a join key offset by 1 to check if there is a gap between the current number and the previous number.
SELECT l.number
,r.number
FROM Test l
LEFT JOIN Test r ON l.number = r.number + 1 -- Do we have a previous number in the sequence that's 1 less than the current number?
WHERE l.number != (
SELECT min(number) -- Gap will never end before the minimum sequence number
FROM Test
)
AND r.number IS NULL -- Gap identified
After getting gap start points and end points, we can use a subquery to put together the upper bound and the lower bound of each gap in the same row.
WITH gap_starts
AS (
-- The end of consecutive numbers + 1 is the start of the gap
SELECT l.number + 1 AS gap_start
FROM Test l
LEFT JOIN Test r ON l.number = r.number - 1
WHERE l.number != (
SELECT max(number)
FROM Test
)
AND r.number IS NULL
)
,gap_ends
AS (
-- The start of consecutive numbers - 1 is the end of the gap
SELECT l.number - 1 AS gap_end
FROM Test l
LEFT JOIN Test r ON l.number = r.number + 1
WHERE l.number != (
SELECT min(number)
FROM Test
)
AND r.number IS NULL
)
SELECT *
,(
SELECT e.gap_end
FROM gap_ends e
WHERE e.gap_end >= s.gap_start
ORDER BY e.gap_end limit 1
) AS gap_end
FROM gap_starts s;
Solution - Find Gaps Using Lead
Window Function
Using Window function to identify gaps can make your codes more intuitive, succint, and readable to other developers.
By appying Lead()
function we can have a pair of the current number and the next number in the sequence for each row.
SELECT t.number
,lead(t.number, 1) OVER (
ORDER BY t.number
) AS next_number
FROM Test t
A gap is identified when the difference between the current and the next number is greater than 1.
By filtering only pairs where the difference between the two is greater than 1, add 1 to the current number and substract 1 from the next number to generate gap information.
WITH gap_analysis
AS (
SELECT t.number
,lead(t.number, 1) OVER (
ORDER BY t.number
) AS next_number
FROM Test t
)
SELECT g.number + 1 AS gap_start
,g.next_number - 1 AS gap_end
FROM gap_analysis g
WHERE g.next_number - g.number > 1;
Solution - Find Gaps Using Lag
Window Function
We could modify the above query slightly to use Lag()
function to produce a pair of the current number
and the previous number in the sequence for each row.
SELECT t.number
,lag(t.number, 1) OVER (
ORDER BY t.number
) AS prev_number
FROM Test t
A gap is identified when the difference between the current and the previous number is greater than 1.
By filtering only pairs where the difference between the two is greater than 1, add 1 to the previous number and substract 1 from the current number to generate gap information.
WITH gap_analysis
AS (
SELECT t.number
,lag(t.number, 1) OVER (
ORDER BY t.number
) AS prev_number
FROM Test t
)
SELECT g.prev_number + 1 AS gap_start
,g.number - 1 AS gap_end
FROM gap_analysis g
WHERE g.number - g.prev_number > 1;