Quick and Easy Deduping
SQL Server 2000: Not Supported
SQL Server 2005: Tested
SQL Server 2008: Tested
SQL Server 2008R2: Tested
SQL Server 2012: Not Tested
Date: 16 Jun 2011
These are two techniques that utilise a CTE to allow identification and deletion of duplicate rows in one statement.
Using ROW_NUMBER() and PARTITION BY functions
Using ROW_NUMBER() and RANK functions
Prior to this, depduplication of data invariably required the creation and use of a temporary table along with some artificial way or applying an order or surrogate unique identifier to the duplicate data to allow the record to be retained to be easily identified.
In the past, with larger data sets, I have often needed to restore to recreating the table and copying the data via a table with a deduping index (a unique index with IGNORE_DUP_KEY attribute set). This was always challenging to achieve on a production system.
Alternatively, the temp table would be used to identify the records to be retained, and the deletions would then be carried out using RBAR logic (thanks to Jeff Moden for providing a concise acronym for such a painful process)
The new features of 2005 allow the identification of duplicates, application of surrogate unique identifier and deletion to be carried out in one concise command
The basic template for this is shown below, along with an example.
The one prerequisite is that there is a unique data key in the data to be deduped.
This template deals with a single table with a single column data key.
If determination of the uniqueness requires a join between tables, the original method may be required, as the CTE cannot process deletes if it is comprised of multiple base tables (similar to the restrictions on updating/deleting through views).
Using ROW_NUMBER() and PARTITION BY:
Using ROW_NUMBER() and RANK: