Mistakes happen. Sometimes it may not even be a mistake. We’re left with the unfortunate requirement to remove exact duplicates from a table. We never would have been in this predicament if we were using a surrogate key or a primary key, but nevertheless, we’re here.
Relational algebra suggests duplicate rows (or tuples) shouldn’t exist in the first place. However, most RDBMSs still allow for duplicates.
I’ll create a table and insert four rows.
DROP TABLE IF EXISTS dbo.DuplicateThis; CREATE TABLE dbo.DuplicateThis ( UserName varchar (16) NOT NULL, EmailAddress varchar (320) NOT NULL ); INSERT INTO dbo.DuplicateThis (UserName, EmailAddress) VALUES ('Tyler', 'firstname.lastname@example.org'), ('Tyler', 'email@example.com'), ('Mary', 'firstname.lastname@example.org'), ('Brennan', 'email@example.com');
Our results from the table look like this:
How can we fix it?
Unfortunately, we can’t simply perform a DELETE statement like this:
DELETE FROM dbo.DuplicateThis WHERE UserName = 'Tyler' AND EmailAddress = 'firstname.lastname@example.org';
Executing that statement would remove both rows. We still want to keep one! Plus, if there were multiple rows that were duplicated, that would be a nightmare.
I’ve always personally preferred the CTE with
ROW_NUMBER() method. That method would look something like this:
WITH CTE (UserName, EmailAddress, RowNum) AS ( SELECT UserName, EmailAddress, ROW_NUMBER() OVER (PARTITION BY UserName, EmailAddress ORDER BY (SELECT 1 / 0)) FROM dbo.DuplicateThis ) DELETE FROM CTE WHERE RowNum > 1;
How does that work?
Partitioning by the eligible columns in the table allows for the
ROW_NUMBER() to be generated properly. The
ORDER BY in the
ROW_NUMBER() function is irrelevant here as we don’t really care about order. Outside of the CTE, we’re checking to remove only the duplicate rows. Performing the
DELETE on the CTE removes the rows from our base table.
I’ve often had to perform this in scenarios where staging data went wrong. It’s important to fix these issues quickly if the duplicates truly don’t need to be there. Awhile back, I even created a GitHub repository for a stored procedure to attempt this dynamically.
Hopefully, this helps if you find yourself in this scenario! Thanks for reading!