Skip to main content

MySQL - Find and Remove Duplicate Rows Based on Multiple Columns

It is common to face the issue of duplicate rows in database tables. Duplicate rows could exist in the data source we are importing the data from. It's best practice to perform pre-processing validation checks during import to detect and eliminate these dupliacte rows. In addition, we should set up appropriate table constraints to check and prevent duplicate records.

However, if we already have duplicate rows in existing tables, we'll need to clean them up to ensure data quality as the data will likely be used for downstream reporting.

This post will go through the different ways we can use to find and remove duplicate data from the MySQL EventTracking table we'll be using for this demo. The same techniques can be applied to other relational database management systems such as SQL Server, PostgreSQL, or Oracle.

Problem


We have an EventTracking table that tracks different customer actions so we can calculate various conversion rates such as install/download, purchase/install, purchase/download ratios.

The worker process that writes event logs into the EventTracking has a bug that intermittently insert duplicate rows for any of the customer actions. Fortunately, we can detect duplicate rows based on customer_id, event_name, and event_datetime columns.

For examples, 2 rows with exactly the same values in customer_id, event_name, and event_datetime columns are considered as duplicates.

We have to clean up the table so we can produce accurate conversion rate reports.

Example Data


Duplicate Rows Highlighted
create table EventTracking (
event_tracking_id INT, -- Primary key
customer_id INT,
event_name VARCHAR(8),
event_datetime DATETIME
);
-- Duplicate rows are highlighted
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (1, 1, 'download', '2021-08-02 09:10:11');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (2, 1, 'download', '2021-08-02 09:10:11');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (3, 1, 'install', '2021-08-02 09:30:25');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (4, 1, 'install', '2021-08-02 09:30:25');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (5, 1, 'install', '2021-08-02 09:30:25');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (6, 1, 'purchase', '2021-08-02 11:20:53');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (7, 2, 'download', '2022-02-02 11:20:11');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (8, 2, 'install', '2022-02-02 11:22:54');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (9, 2, 'install', '2022-02-02 11:22:54');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (10, 2, 'purchase', '2022-02-02 11:30:32');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (11, 2, 'purchase', '2022-02-02 11:30:32');

Use

DB-Fiddle to execute SQL scripts on sample data.

The sample EventTracking table contains duplicate rows for download, install and purchase events for both customers.

event_tracking_idcustomer_idevent_nameevent_datetime
11download2021-08-02 09:10:11
21download2021-08-02 09:10:11
31install2021-08-02 09:30:25
41install2021-08-02 09:30:25
51install2021-08-02 09:30:25
61purchase2021-08-02 11:20:53
72download2022-02-02 11:20:11
82install2022-02-02 11:22:54
92install2022-02-02 11:22:54
102purchase2022-02-02 11:30:32
112purchase2022-02-02 11:30:32

We will walk through 3 different ways to find and remove duplicate rows from the table so it will turn out like this.

event_tracking_idcustomer_idevent_nameevent_datetime
11download2021-08-02 09:10:11
31install2021-08-02 09:30:25
61purchase2021-08-02 11:20:53
72download2022-02-02 11:20:11
82install2022-02-02 11:22:54
102purchase2022-02-02 11:30:32

Solution: Delete Duplicate Rows Using Group By


Firstable, we create a virtual table with CTE to hold a list of event_tracking_id values we want to keep. MIN() function is used to get the minimum event_tracking_id of each group (customer_id, event_name, and event_datetime).

For example, the exact same download was logged twice for customer_id 1. We will remove event_tracking_id 2 as a duplicate and only keep event_tracking_id 1.

Solution: Delete Duplicate Rows Using Group By

Next, we execute a DELETE statement to remove duplicate rows.

Delete Duplicate Rows Group_By
WITH tracking_ids_to_keep
AS (
SELECT MIN(event_tracking_id)
FROM EventTracking
GROUP BY customer_id
,event_name
,event_datetime
)
DELETE
FROM EventTracking
WHERE event_tracking_id NOT IN (
SELECT *
FROM tracking_ids_to_keep
);

Solution: Delete Duplicate Rows Using Row_Number()


Firstable, we create a virtual table with CTE that holds results of duplicate row detection. We use Row_Number() function to generate a row number for each row along with a PARTITION BY clause on [customer_id], [event_name] and [event_datetime] columns. Row number restarts from 1 when the combined value of these 3 columns changes. Row number greater than 1 is then flagged as duplicate rows.

Solution: Delete Duplicate Rows Using Row_Number()

Next, we execute a DELETE statement to remove rows flagged as duplicate.

Delete Duplicate Rows with Row_Number()
WITH Duplicate_Rows_Detection
AS (
SELECT *
,CASE
WHEN (
row_number() OVER (
PARTITION BY customer_id
,event_name
,event_datetime
)
) > 1
THEN TRUE
ELSE FALSE
END AS is_duplicate
FROM EventTracking
)
DELETE et
FROM EventTracking et
INNER JOIN Duplicate_Rows_Detection dup ON et.event_tracking_id = dup.event_tracking_id
WHERE dup.is_duplicate = TRUE;

Solution: Remove Duplicate Rows Using Self Join


Another common solution is to self join on [customer_id], [event_name] and [event_datetime] columns from the same EventTracking table and then use the WHERE clause condition to find event_tracking_id values to delete.

Solution: Remove Duplicate Rows Using Self Join

This is a good solution if you are using a MySQL version earlier than 8. It does not require CTE or Row_Number() Window Function that's supported only in MySQL version 8.

Delete Duplicate Rows with Self Join
DELETE L
FROM EventTracking L
INNER JOIN EventTracking R ON L.customer_id = R.customer_id
AND L.event_name = R.event_name
AND L.event_datetime = R.event_datetime
WHERE L.event_tracking_id > R.event_tracking_id;