Cool Tips‎ > ‎Data Deduplication‎ > ‎

Data Deduplication - Quick and Easy Deduping

Applicability:

                 SQL Server 2000:        Not Supported
                 SQL Server 2005:        Tested
                 SQL Server 2008:        Tested
                 SQL Server 2008R2:    Tested
                 SQL Server 2012:        Not Tested        

Credits:

Author:    Unknown
Date:        16 Jun 2011

Description

These are two techniques that utilise a CTE to allow identification and deletion of duplicate rows in one statement.
  • Using ROW_NUMBER() and PARTITION BY functions
  • Using ROW_NUMBER() and RANK functions
 
Prior to this, depduplication of data invariably required the creation and use of a temporary table along with some artificial way or applying an order or surrogate unique identifier to the duplicate data to allow the record to be retained to be easily identified.
In the past, with larger data sets, I have often needed to restore to recreating the table and copying the data via a table with a deduping index (a unique index with IGNORE_DUP_KEY attribute set).  This was always challenging to achieve on a production system.
Alternatively, the temp table would be used to identify the records to be retained, and the deletions would then be carried out using RBAR logic  (thanks to Jeff Moden for providing a concise acronym for such a painful process)
 
The new features of 2005 allow the identification of duplicates, application of surrogate unique identifier and deletion to be carried out in one concise command
 
The basic template for this is shown below, along with an example. 
The one prerequisite is that there is a unique data key in the data to be deduped.
 
This template deals with a single table with a single column data key.
If determination of the uniqueness requires a join between tables, the original method may be required, as the CTE cannot process deletes if it is comprised of multiple base tables (similar to the restrictions on updating/deleting through views).
 

Code

Using ROW_NUMBER() and PARTITION BY:

WITH Duplicates AS
(
  
SELECT
      
*,
      
ROW_NUMBER() OVER (PARTITION BY <Data Key> ORDER BY <Data Key>) AS RowNum
  
FROM <Table with Duplicates>
)
DELETE FROM Duplicates WHERE RowNum > 1;

 
 

Example:

CREATE TABLE #prod(
Product_Code varchar(10),
Product_Name varchar(100)
)

INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('123','Product_1')

INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('234','Product_2')

INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('345','Product_3')

INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('345','Product_3')

INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('456','Product_4')

INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('567','Product_5')

INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('678','Product_6')

INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('789','Product_7')

SELECT *
FROM #prod;


WITH Dups AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY Product_Code ORDER BY Product_Code) AS RowNum
FROM #prod
)

DELETE FROM Dups WHERE rownum > 1;
  

--Note duplicate record 345 Product_3 has been removed.
SELECT *
FROM #prod;

DROP TABLE #prod

 
 

Using ROW_NUMBER() and RANK:

;WITH RankedandNumberedRows(Rank, RowNumber, AlphaKey)
AS
  
(
      
SELECT
            
RANK() OVER (ORDER BY <PrimaryKey>) AS Rank
          
,ROW_NUMBER() OVER (ORDER BY <PrimaryKey>) AS RowNumber
          
,<PrimaryKey>
      
FROM <TableWithDuplicates>
   )
SELECT <PrimaryKey>
FROM RankedandNumberedRows
WHERE Rank=RowNumber

 
 
Example:
 

CREATE TABLE #prod(
Product_Code varchar(10),
Product_Name varchar(100)
)

INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('123','Product_1')

INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('234','Product_2')

INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('345','Product_3')

INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('345','Product_3')

INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('456','Product_4')

INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('567','Product_5')

INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('678','Product_6')

INSERT INTO #prod(Product_Code, Product_Name)
VALUES ('789','Product_7')

SELECT *
FROM #prod;


;
WITH RankedandNumberedRows(Rank, RowNumber, AlphaKey)
AS
  
(
      
SELECT
            
RANK() OVER (ORDER BY Product_Code) AS Rank
          
,ROW_NUMBER() OVER (ORDER BY Product_Code) AS RowNumber
          
,Product_Code
      
FROM #prod
  
)
DELETE FROM RankedandNumberedRows WHERE Rank <> RowNumber;



--Note duplicate record 345 Product_3 has been removed.
SELECT *
FROM #prod;

DROP TABLE #prod

 
 

ċ
Deduping using a CTE 2.sql
(1k)
Andy Hughes,
Jun 23, 2012, 1:28 AM
ċ
Deduping using a CTE.sql
(1k)
Andy Hughes,
Jun 23, 2012, 1:28 AM
Comments