MySQL - Find and Remove Duplicate Rows Based on Multiple Columns
It is common to face the issue of duplicate rows in database tables. Duplicate rows could exist in the data source we are importing the data from. It's best practice to perform pre-processing validation checks during import to detect and eliminate these dupliacte rows. In addition, we should set up appropriate table constraints to check and prevent duplicate records.
However, if we already have duplicate rows in existing tables, we'll need to clean them up to ensure data quality as the data will likely be used for downstream reporting.
This post will go through the different ways we can use to find and remove duplicate data from the MySQL EventTracking table we'll be using for this demo. The same techniques can be applied to other relational database management systems such as SQL Server, PostgreSQL, or Oracle.
Problem
We have an EventTracking table that tracks different customer actions so we can calculate various conversion rates such as install/download, purchase/install, purchase/download ratios.
The worker process that writes event logs into the EventTracking has a bug that intermittently insert duplicate rows for any of the customer actions. Fortunately, we can detect duplicate rows based on customer_id, event_name, and event_datetime columns.
For examples, 2 rows with exactly the same values in customer_id
, event_name
, and event_datetime
columns are considered as duplicates.
We have to clean up the table so we can produce accurate conversion rate reports.
Example Data
create table EventTracking (
event_tracking_id INT, -- Primary key
customer_id INT,
event_name VARCHAR(8),
event_datetime DATETIME
);
-- Duplicate rows are highlighted
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (1, 1, 'download', '2021-08-02 09:10:11');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (2, 1, 'download', '2021-08-02 09:10:11');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (3, 1, 'install', '2021-08-02 09:30:25');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (4, 1, 'install', '2021-08-02 09:30:25');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (5, 1, 'install', '2021-08-02 09:30:25');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (6, 1, 'purchase', '2021-08-02 11:20:53');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (7, 2, 'download', '2022-02-02 11:20:11');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (8, 2, 'install', '2022-02-02 11:22:54');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (9, 2, 'install', '2022-02-02 11:22:54');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (10, 2, 'purchase', '2022-02-02 11:30:32');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (11, 2, 'purchase', '2022-02-02 11:30:32');
Use
DB-Fiddle to execute SQL scripts on sample data.
The sample EventTracking table contains duplicate rows for download, install and purchase events for both customers.
event_tracking_id | customer_id | event_name | event_datetime |
---|---|---|---|
1 | 1 | download | 2021-08-02 09:10:11 |
2 | 1 | download | 2021-08-02 09:10:11 |
3 | 1 | install | 2021-08-02 09:30:25 |
4 | 1 | install | 2021-08-02 09:30:25 |
5 | 1 | install | 2021-08-02 09:30:25 |
6 | 1 | purchase | 2021-08-02 11:20:53 |
7 | 2 | download | 2022-02-02 11:20:11 |
8 | 2 | install | 2022-02-02 11:22:54 |
9 | 2 | install | 2022-02-02 11:22:54 |
10 | 2 | purchase | 2022-02-02 11:30:32 |
11 | 2 | purchase | 2022-02-02 11:30:32 |
We will walk through 3 different ways to find and remove duplicate rows from the table so it will turn out like this.
event_tracking_id | customer_id | event_name | event_datetime |
---|---|---|---|
1 | 1 | download | 2021-08-02 09:10:11 |
3 | 1 | install | 2021-08-02 09:30:25 |
6 | 1 | purchase | 2021-08-02 11:20:53 |
7 | 2 | download | 2022-02-02 11:20:11 |
8 | 2 | install | 2022-02-02 11:22:54 |
10 | 2 | purchase | 2022-02-02 11:30:32 |
Solution: Delete Duplicate Rows Using Group By
Firstable, we create a virtual table with CTE to hold a list of event_tracking_id values we want to keep.
MIN()
function is used to get the minimum event_tracking_id of each group (customer_id, event_name, and event_datetime
).
For example, the exact same download was logged twice for customer_id 1. We will remove event_tracking_id 2 as a duplicate and only keep event_tracking_id 1.
Next, we execute a DELETE
statement to remove duplicate rows.
WITH tracking_ids_to_keep
AS (
SELECT MIN(event_tracking_id)
FROM EventTracking
GROUP BY customer_id
,event_name
,event_datetime
)
DELETE
FROM EventTracking
WHERE event_tracking_id NOT IN (
SELECT *
FROM tracking_ids_to_keep
);
Solution: Delete Duplicate Rows Using Row_Number()
Firstable, we create a virtual table with CTE that holds results of duplicate row detection.
We use Row_Number()
function to generate a row number for each row along with a PARTITION BY
clause on [customer_id], [event_name] and [event_datetime]
columns.
Row number restarts from 1 when the combined value of these 3 columns changes. Row number greater than 1 is then flagged as duplicate rows.
Next, we execute a DELETE
statement to remove rows flagged as duplicate.
WITH Duplicate_Rows_Detection
AS (
SELECT *
,CASE
WHEN (
row_number() OVER (
PARTITION BY customer_id
,event_name
,event_datetime
)
) > 1
THEN TRUE
ELSE FALSE
END AS is_duplicate
FROM EventTracking
)
DELETE et
FROM EventTracking et
INNER JOIN Duplicate_Rows_Detection dup ON et.event_tracking_id = dup.event_tracking_id
WHERE dup.is_duplicate = TRUE;
Solution: Remove Duplicate Rows Using Self Join
Another common solution is to self join on [customer_id], [event_name] and [event_datetime]
columns from the same EventTracking table
and then use the WHERE
clause condition to find event_tracking_id values to delete.
This is a good solution if you are using a MySQL version earlier than 8. It does not require CTE or Row_Number() Window Function that's supported only in MySQL version 8.
DELETE L
FROM EventTracking L
INNER JOIN EventTracking R ON L.customer_id = R.customer_id
AND L.event_name = R.event_name
AND L.event_datetime = R.event_datetime
WHERE L.event_tracking_id > R.event_tracking_id;