Quick and Easy Deduping
Applicability:
SQL Server 2000: Not Supported
SQL Server 2005: Tested
SQL Server 2008: Tested
SQL Server 2008R2: Tested
SQL Server 2012: Not Tested
Credits:
Author: Unknown
Date: 16 Jun 2011
Description
These are two techniques that utilise a CTE to allow identification and deletion of duplicate rows in one statement.
Using ROW_NUMBER() and PARTITION BY functions
Using ROW_NUMBER() and RANK functions
Prior to this, depduplication of data invariably required the creation and use of a temporary table along with some artificial way or applying an order or surrogate unique identifier to the duplicate data to allow the record to be retained to be easily identified.
In the past, with larger data sets, I have often needed to restore to recreating the table and copying the data via a table with a deduping index (a unique index with IGNORE_DUP_KEY attribute set). This was always challenging to achieve on a production system.
Alternatively, the temp table would be used to identify the records to be retained, and the deletions would then be carried out using RBAR logic (thanks to Jeff Moden for providing a concise acronym for such a painful process)
The new features of 2005 allow the identification of duplicates, application of surrogate unique identifier and deletion to be carried out in one concise command
The basic template for this is shown below, along with an example.
The one prerequisite is that there is a unique data key in the data to be deduped.
This template deals with a single table with a single column data key.
If determination of the uniqueness requires a join between tables, the original method may be required, as the CTE cannot process deletes if it is comprised of multiple base tables (similar to the restrictions on updating/deleting through views).
Code
Using ROW_NUMBER() and PARTITION BY:
WITH Duplicates AS
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY <Data Key> ORDER BY <Data Key>) AS RowNum
FROM <Table with Duplicates>
)
DELETE FROM Duplicates WHERE RowNum > 1;
Example:
CREATE TABLE #prod(
Product_Code varchar(10),
Product_Name varchar(100)
)
INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('123','Product_1')
INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('234','Product_2')
INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('345','Product_3')
INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('345','Product_3')
INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('456','Product_4')
INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('567','Product_5')
INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('678','Product_6')
INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('789','Product_7')
SELECT *
FROM #prod;
WITH Dups AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY Product_Code ORDER BY Product_Code) AS RowNum
FROM #prod
)
DELETE FROM Dups WHERE rownum > 1;
--Note duplicate record 345 Product_3 has been removed.
SELECT *
FROM #prod;
DROP TABLE #prod
Using ROW_NUMBER() and RANK:
;WITH RankedandNumberedRows(Rank, RowNumber, AlphaKey)
AS
(
SELECT
RANK() OVER (ORDER BY <PrimaryKey>) AS Rank
,ROW_NUMBER() OVER (ORDER BY <PrimaryKey>) AS RowNumber
,<PrimaryKey>
FROM <TableWithDuplicates>
)
SELECT <PrimaryKey>
FROM RankedandNumberedRows
WHERE Rank=RowNumber
Example:
CREATE TABLE #prod(
Product_Code varchar(10),
Product_Name varchar(100)
)
INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('123','Product_1')
INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('234','Product_2')
INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('345','Product_3')
INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('345','Product_3')
INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('456','Product_4')
INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('567','Product_5')
INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('678','Product_6')
INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('789','Product_7')
SELECT *
FROM #prod;
;WITH RankedandNumberedRows(Rank, RowNumber, AlphaKey)
AS
(
SELECT
RANK() OVER (ORDER BY Product_Code) AS Rank
,ROW_NUMBER() OVER (ORDER BY Product_Code) AS RowNumber
,Product_Code
FROM #prod
)
DELETE FROM RankedandNumberedRows WHERE Rank <> RowNumber;
--Note duplicate record 345 Product_3 has been removed.
SELECT *
FROM #prod;
DROP TABLE #prod