Quick and Easy Deduping

Applicability:

SQL Server 2000: Not Supported

SQL Server 2005: Tested

SQL Server 2008: Tested

SQL Server 2008R2: Tested

SQL Server 2012: Not Tested

Credits:

Author: Unknown

Date: 16 Jun 2011

Description

These are two techniques that utilise a CTE to allow identification and deletion of duplicate rows in one statement.

- Using ROW_NUMBER() and PARTITION BY functions
- Using ROW_NUMBER() and RANK functions

Prior to this, depduplication of data invariably required the creation and use of a temporary table along with some artificial way or applying an order or surrogate unique identifier to the duplicate data to allow the record to be retained to be easily identified.

In the past, with larger data sets, I have often needed to restore to recreating the table and copying the data via a table with a deduping index (a unique index with IGNORE_DUP_KEY attribute set). This was always challenging to achieve on a production system.

Alternatively, the temp table would be used to identify the records to be retained, and the deletions would then be carried out using RBAR logic (thanks to Jeff Moden for providing a concise acronym for such a painful process)

The new features of 2005 allow the identification of duplicates, application of surrogate unique identifier and deletion to be carried out in one concise command

The basic template for this is shown below, along with an example.

The one prerequisite is that there is a unique data key in the data to be deduped.

This template deals with a single table with a single column data key.

If determination of the uniqueness requires a join between tables, the original method may be required, as the CTE cannot process deletes if it is comprised of multiple base tables (similar to the restrictions on updating/deleting through views).

Code

Using ROW_NUMBER() and PARTITION BY:

WITH Duplicates AS

(

SELECT

ROW_NUMBER() OVER (PARTITION BY <Data Key> ORDER BY <Data Key>) AS RowNum

FROM <Table with Duplicates>

)

DELETE FROM Duplicates WHERE RowNum > 1;

Example:

CREATE TABLE #prod(

Product_Code varchar(10),

Product_Name varchar(100)

)

INSERT INTO #prod(Product_Code, Product_Name)

VALUES ('123','Product_1')

INSERT INTO #prod(Product_Code, Product_Name)

VALUES ('234','Product_2')

INSERT INTO #prod(Product_Code, Product_Name)

VALUES ('345','Product_3')

INSERT INTO #prod(Product_Code, Product_Name)

VALUES ('345','Product_3')

INSERT INTO #prod(Product_Code, Product_Name)

VALUES ('456','Product_4')

INSERT INTO #prod(Product_Code, Product_Name)

VALUES ('567','Product_5')

INSERT INTO #prod(Product_Code, Product_Name)

VALUES ('678','Product_6')

INSERT INTO #prod(Product_Code, Product_Name)

VALUES ('789','Product_7')

SELECT *

FROM #prod;

WITH Dups AS

(

SELECT *, ROW_NUMBER() OVER (PARTITION BY Product_Code ORDER BY Product_Code) AS RowNum

FROM #prod

)

DELETE FROM Dups WHERE rownum > 1;

--Note duplicate record 345 Product_3 has been removed.

SELECT *

FROM #prod;

DROP TABLE #prod

Using ROW_NUMBER() and RANK:

;WITH RankedandNumberedRows(Rank, RowNumber, AlphaKey)

(

SELECT

RANK() OVER (ORDER BY <PrimaryKey>) AS Rank

,ROW_NUMBER() OVER (ORDER BY <PrimaryKey>) AS RowNumber

,<PrimaryKey>

FROM <TableWithDuplicates>

)

SELECT <PrimaryKey>

FROM RankedandNumberedRows

WHERE Rank=RowNumber

Example:

CREATE TABLE #prod(

Product_Code varchar(10),

Product_Name varchar(100)

)

INSERT INTO #prod(Product_Code, Product_Name)

VALUES ('123','Product_1')

INSERT INTO #prod(Product_Code, Product_Name)

VALUES ('234','Product_2')

INSERT INTO #prod(Product_Code, Product_Name)

VALUES ('345','Product_3')

INSERT INTO #prod(Product_Code, Product_Name)

VALUES ('345','Product_3')

INSERT INTO #prod(Product_Code, Product_Name)

VALUES ('456','Product_4')

INSERT INTO #prod(Product_Code, Product_Name)

VALUES ('567','Product_5')

INSERT INTO #prod(Product_Code, Product_Name)

VALUES ('678','Product_6')

INSERT INTO #prod(Product_Code, Product_Name)

VALUES ('789','Product_7')

SELECT *

FROM #prod;

;WITH RankedandNumberedRows(Rank, RowNumber, AlphaKey)

(

SELECT

RANK() OVER (ORDER BY Product_Code) AS Rank

,ROW_NUMBER() OVER (ORDER BY Product_Code) AS RowNumber

,Product_Code

FROM #prod

)

DELETE FROM RankedandNumberedRows WHERE Rank <> RowNumber;

--Note duplicate record 345 Product_3 has been removed.

SELECT *

FROM #prod;

DROP TABLE #prod

Page updated

Google Sites

Report abuse