MySQL - Find and Remove Duplicate Rows Based on Multiple Columns

It is common to face the issue of duplicate rows in database tables. Duplicate rows could exist in the data source we are importing the data from. It's best practice to perform pre-processing validation checks during import to detect and eliminate these dupliacte rows. In addition, we should set up appropriate table constraints to check and prevent duplicate records.

However, if we already have duplicate rows in existing tables, we'll need to clean them up to ensure data quality as the data will likely be used for downstream reporting.

This post will go through the different ways we can use to find and remove duplicate data from the MySQL EventTracking table we'll be using for this demo. The same techniques can be applied to other relational database management systems such as SQL Server, PostgreSQL, or Oracle.

Problem

We have an EventTracking table that tracks different customer actions so we can calculate various conversion rates such as install/download, purchase/install, purchase/download ratios.

The worker process that writes event logs into the EventTracking has a bug that intermittently insert duplicate rows for any of the customer actions. Fortunately, we can detect duplicate rows based on customer_id, event_name, and event_datetime columns.

For examples, 2 rows with exactly the same values in customer_id, event_name, and event_datetime columns are considered as duplicates.

We have to clean up the table so we can produce accurate conversion rate reports.

Example Data

Duplicate Rows Highlighted
create table EventTracking (
  	event_tracking_id INT, -- Primary key
	customer_id INT,
	event_name VARCHAR(8),
	event_datetime DATETIME
);
-- Duplicate rows are highlighted
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (1, 1, 'download', '2021-08-02 09:10:11');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (2, 1, 'download', '2021-08-02 09:10:11');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (3, 1, 'install', '2021-08-02 09:30:25');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (4, 1, 'install', '2021-08-02 09:30:25');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (5, 1, 'install', '2021-08-02 09:30:25');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (6, 1, 'purchase', '2021-08-02 11:20:53');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (7, 2, 'download', '2022-02-02 11:20:11');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (8, 2, 'install', '2022-02-02 11:22:54');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (9, 2, 'install', '2022-02-02 11:22:54');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (10, 2, 'purchase', '2022-02-02 11:30:32');
insert into EventTracking (event_tracking_id, customer_id, event_name, event_datetime) values (11, 2, 'purchase', '2022-02-02 11:30:32');

Use

DB-Fiddle to execute SQL scripts on sample data.

The sample EventTracking table contains duplicate rows for download, install and purchase events for both customers.

event_tracking_id	customer_id	event_name	event_datetime
1	1	download	2021-08-02 09:10:11
2	1	download	2021-08-02 09:10:11
3	1	install	2021-08-02 09:30:25
4	1	install	2021-08-02 09:30:25
5	1	install	2021-08-02 09:30:25
6	1	purchase	2021-08-02 11:20:53
7	2	download	2022-02-02 11:20:11
8	2	install	2022-02-02 11:22:54
9	2	install	2022-02-02 11:22:54
10	2	purchase	2022-02-02 11:30:32
11	2	purchase	2022-02-02 11:30:32

We will walk through 3 different ways to find and remove duplicate rows from the table so it will turn out like this.

event_tracking_id	customer_id	event_name	event_datetime
1	1	download	2021-08-02 09:10:11
3	1	install	2021-08-02 09:30:25
6	1	purchase	2021-08-02 11:20:53
7	2	download	2022-02-02 11:20:11
8	2	install	2022-02-02 11:22:54
10	2	purchase	2022-02-02 11:30:32

Solution: Delete Duplicate Rows Using Group By

Firstable, we create a virtual table with CTE to hold a list of event_tracking_id values we want to keep. MIN() function is used to get the minimum event_tracking_id of each group (customer_id, event_name, and event_datetime).

For example, the exact same download was logged twice for customer_id 1. We will remove event_tracking_id 2 as a duplicate and only keep event_tracking_id 1.

Solution: Delete Duplicate Rows Using Group By

Next, we execute a DELETE statement to remove duplicate rows.

Delete Duplicate Rows Group_By
WITH tracking_ids_to_keep
AS (
	SELECT MIN(event_tracking_id)
	FROM EventTracking
	GROUP BY customer_id
		,event_name
		,event_datetime
	)
DELETE
FROM EventTracking
WHERE event_tracking_id NOT IN (
		SELECT *
		FROM tracking_ids_to_keep
		);

Solution: Delete Duplicate Rows Using Row_Number()

Firstable, we create a virtual table with CTE that holds results of duplicate row detection. We use Row_Number() function to generate a row number for each row along with a PARTITION BY clause on [customer_id], [event_name] and [event_datetime] columns. Row number restarts from 1 when the combined value of these 3 columns changes. Row number greater than 1 is then flagged as duplicate rows.

Solution: Delete Duplicate Rows Using Row_Number()

Next, we execute a DELETE statement to remove rows flagged as duplicate.

Delete Duplicate Rows with Row_Number()
WITH Duplicate_Rows_Detection
AS (
	SELECT *
		,CASE
			WHEN (
					row_number() OVER (
						PARTITION BY customer_id
						,event_name
						,event_datetime
						)
					) > 1
				THEN TRUE
			ELSE FALSE
			END AS is_duplicate
	FROM EventTracking
	)
DELETE et
FROM EventTracking et
INNER JOIN Duplicate_Rows_Detection dup ON et.event_tracking_id = dup.event_tracking_id
WHERE dup.is_duplicate = TRUE;

Solution: Remove Duplicate Rows Using Self Join

Another common solution is to self join on [customer_id], [event_name] and [event_datetime] columns from the same EventTracking table and then use the WHERE clause condition to find event_tracking_id values to delete.

Solution: Remove Duplicate Rows Using Self Join

This is a good solution if you are using a MySQL version earlier than 8. It does not require CTE or Row_Number() Window Function that's supported only in MySQL version 8.

Delete Duplicate Rows with Self Join
DELETE L
FROM EventTracking L
INNER JOIN EventTracking R ON L.customer_id = R.customer_id
	AND L.event_name = R.event_name
	AND L.event_datetime = R.event_datetime
WHERE L.event_tracking_id > R.event_tracking_id;

Problem​

Example Data​

Solution: Delete Duplicate Rows Using Group By​

Solution: Delete Duplicate Rows Using Row_Number()​

Solution: Remove Duplicate Rows Using Self Join​

Problem

Example Data

Solution: Delete Duplicate Rows Using Group By

Solution: Delete Duplicate Rows Using Row_Number()

Solution: Remove Duplicate Rows Using Self Join