Removing Duplicate Rows

Mistakes happen. Sometimes it may not even be a mistake. We’re left with the unfortunate requirement to remove exact duplicates from a table. We never would have been in this predicament if we were using a surrogate key or a primary key, but nevertheless, we’re here.

Relational algebra suggests duplicate rows (or tuples) shouldn’t exist in the first place. However, most RDBMSs still allow for duplicates.

Demo

I’ll create a table and insert four rows.

DROP TABLE IF EXISTS dbo.DuplicateThis;

CREATE TABLE dbo.DuplicateThis
(
    UserName varchar (16) NOT NULL,
    EmailAddress varchar (320) NOT NULL
);

INSERT INTO dbo.DuplicateThis (UserName, EmailAddress)
VALUES ('Tyler', 'tyler@example.com'),
       ('Tyler', 'tyler@example.com'),
       ('Mary', 'mary@example.com'),
       ('Brennan', 'brennan@example.com');

Our results from the table look like this:

UserNameEmailAddress
Tylertyler@example.com
Tylertyler@example.com
Marymary@example.com
Brennanbrennan@example.com
Yay, I’ve managed to duplicate myself.

How can we fix it?

Unfortunately, we can’t simply perform a DELETE statement like this:

DELETE FROM dbo.DuplicateThis
WHERE UserName = 'Tyler' 
    AND EmailAddress = 'tyler@example.com';

Executing that statement would remove both rows. We still want to keep one! Plus, if there were multiple rows that were duplicated, that would be a nightmare.

I’ve always personally preferred the CTE with ROW_NUMBER() method. That method would look something like this:

WITH CTE (UserName, EmailAddress, RowNum)
AS
(
    SELECT UserName,
           EmailAddress,
           ROW_NUMBER() OVER (PARTITION BY UserName, EmailAddress ORDER BY (SELECT 1 / 0))
    FROM dbo.DuplicateThis
)
DELETE FROM CTE
WHERE RowNum > 1;

How does that work?

Partitioning by the eligible columns in the table allows for the ROW_NUMBER() to be generated properly. The ORDER BY in the ROW_NUMBER() function is irrelevant here as we don’t really care about order. Outside of the CTE, we’re checking to remove only the duplicate rows. Performing the DELETE on the CTE removes the rows from our base table.

I’ve often had to perform this in scenarios where staging data went wrong. It’s important to fix these issues quickly if the duplicates truly don’t need to be there. Awhile back, I even created a GitHub repository for a stored procedure to attempt this dynamically.

Hopefully, this helps if you find yourself in this scenario! Thanks for reading!

Leave a Reply

Your email address will not be published. Required fields are marked *