Você está na página 1de 16

databasestar.

com

How to Remove Duplicate Records in SQL

Do you need to use SQL to remove duplicates in your tables? Learn how to write an SQL query
to remove duplicate data, and see the performance of each way you can do it, in this article.

The Problem – Removing Duplicates in SQL


Let’s say you have a table with some data in it. You’ve found out that there is some duplicate
data in this table. And you want to get rid of the duplicates.

The way you define duplicate data could be dependent on your data.

 Is it a duplicate if all of the columns are the same?


 Is it a duplicate if all columns except for the primary key are the same?
 Is it a duplicate if only a few columns are the same?

In any case, identifying and removing duplicates is possible in Oracle SQL.

There are several ways to do it. I’ll explain each of these methods and show you how they
perform on a random sample of data.

The sample of data has 1,000,000 records in a single table, which looks like this:

CUSTOMER FIRST_ LAST_ MAX_ CREATED_ UPDATED_ UPDATED_


_ID NAME NAME CREDIT DATE DATE COUNT
1 Carolyn Hudson 6567 13/Jun/17 (null) 0
2 Judy Ward 2535 15/Dec/16 (null) 0
3 James Medina 2738 26/May/17 (null) 0
4 Peter Greene 270 03/Dec/16 (null) 0
5 Sean Ortiz 2452 17/May/17 (null) 0

Let’s say that a record is a duplicate if it contains the same first_name, last_name, and
created_date values.

Let’s take a look at the different ways to remove duplicates in SQL.

Get Your SQL Cheat Sheet


Three pages of SQL commands to save/print for easy reference
A Note on Query Times

In each of these examples I explain the code I am using, what it does, and show how long it takes
to run the DELETE statement.

However, the query times should only be taken as a guide, and may be different to the
performance you get:

 Your table structure, unique data business rules, and number of records will be different
 Your database version and specs will be different (I’m running Oracle 11g Express
Edition on my laptop)
 Several other factors may impact performance such as query cache or background
processes on my computer.

So, I guess my point is that you might not get the same performance as I do. But, if you try one
method and it has poor performance for your data, try one of the other methods.

Method 1 – ROW_NUMBER Analytic Function


The first method I’ll show you is using an analytic function called ROW_NUMBER. It’s been
recommended in several places such as StackOverflow questions and an AskTOM thread.

It involves several steps:

1. Using a subquery to find each ROWID (which is a unique number given to each row in an
Oracle table) and the ROW_NUMBER function to find a sequential number for that row,
grouped by the fields you specify as unique.
2. Find the ROWID values that are identified as duplicates.
3. Delete rows that match these ROWIDs.

The query looks like this:

DELETE FROM table a


WHERE a.ROWID IN
(SELECT ROWID FROM
(SELECT
ROWID,
ROW_NUMBER() OVER
(PARTITION BY unique_columns ORDER BY ROWID) dup
FROM table)
WHERE dup > 1);

The ROW_NUMBER function here is used as an analytic function. It uses the PARTITION BY
to create partitions or groups based on the fields I’ve mentioned in the PARTITION BY (which
are first_name, last_name, and created_date). I’ve then ordered them by the ROWID.
This means that the first occurrence of those fields will get the number of 1. The second
occurrence of all of those fields will get a number of 2, and so on.

The ROWIDs are then returned to the DELETE statement at the top, which only deletes records
where the ROW_NUMBER function (which has an alias of “dup” in this example) are greater
than one. (The AskTOM thread uses “WHERE dup <> 1” but it achieves the same thing).

It’s a good idea to check the records you’re deleting first, by running a SELECT statement using
this criteria. For the purposes of these examples, I’ll check the COUNT of the records about to be
deleted, by replacing the DELETE with a SELECT COUNT(*).

SELECT COUNT(*)
FROM all_customer_dupes a
WHERE a.ROWID IN
(SELECT ROWID FROM
(SELECT
ROWID,
ROW_NUMBER() OVER
(PARTITION BY first_name, last_name, created_date ORDER BY ROWID) dup
FROM all_customer_dupes)
WHERE dup > 1);

Result:

40,267

It shows there are 40,267 duplicate records.

If I run this command as a DELETE statement:

DELETE FROM all_customer_dupes a


WHERE a.ROWID IN
(SELECT ROWID FROM
(SELECT
ROWID,
ROW_NUMBER() OVER
(PARTITION BY first_name, last_name, created_date ORDER BY ROWID) dup
FROM all_customer_dupes)
WHERE dup > 1);

40,267 records deleted.

This statement runs in 12.73 seconds.

The explain plan looks like this:

------------------------------------------------------------------------------
---------------------
| Id | Operation | Name | Rows | Bytes |
Cost (%CPU)| Time |
------------------------------------------------------------------------------
---------------------
| 0 | DELETE STATEMENT | | 1 | 150
| 1454 (1)| 00:00:18 |
| 1 | DELETE | ALL_CUSTOMER_DUPES
| | | | |
| 2 | NESTED LOOPS | | 1 | 150
| 1454 (1)| 00:00:18 |
| 3 | VIEW | VW_NSO_1 | 1 | 12
| 1452 (1)| 00:00:18 |
| 4 | SORT UNIQUE | | 1 | 25
| | |
|* 5 | VIEW | | 1 | 25
| 1452 (1)| 00:00:18 |
| 6 | WINDOW SORT | | 1 | 125
| 1452 (1)| 00:00:18 |
| 7 | TABLE ACCESS FULL | ALL_CUSTOMER_DUPES | 1 | 125
| 1451 (1)| 00:00:18 |
| 8 | TABLE ACCESS BY USER ROWID| ALL_CUSTOMER_DUPES | 1 | 138
| 1 (0)| 00:00:01 |
------------------------------------------------------------------------------
---------------------

This is the method I would use if I needed to delete duplicate records from a table. It uses a
ROWID filter which is usually a fast way to access a table.

Method 2: Use a Subquery with ANY


The next method we’ll look at is using a subquery to identify and delete duplicate data.

I’ll show you the query first, then explain how it works.

DELETE FROM tablename a


WHERE a.rowid > ANY (
SELECT b.rowid
FROM tablename b
WHERE a.column1 = b.column1
);

Now, there are a few things to note about this query.

Both the subquery and the outer query should be selecting from the same table, but use different
table aliases. In this example I have used a and b.

The WHERE clause in the subquery is used to define how you identify a duplicate. You write
your WHERE clause that joins on every column that you consider when you identify duplicates.
This could be one field, two fields, or every field in the table.

The rowid pseudocolumn is used to identify rows in the subquery because this is guaranteed to be
unique.
The WHERE clause of the outer query uses a > ANY condition to check for duplicates. It will
delete any row that has a rowid greater than at least one other row. This ensures that all but one of
the rows that match your conditions is met, therefore removing all duplicates.

So, how does this method perform?

Using this query on our sample data looks like this:

DELETE
FROM all_customer_dupes a
WHERE a.rowid > ANY (
SELECT b.rowid
FROM all_customer_dupes b
WHERE a.first_name = b.first_name
AND a.last_name = b.last_name
AND a.created_date = b.created_date
);

If we run this as a SELECT COUNT(*), we can see the number of records impacted:

SELECT COUNT(*)
FROM all_customer_dupes a
WHERE a.rowid > ANY (
SELECT b.rowid
FROM all_customer_dupes b
WHERE a.first_name = b.first_name
AND a.last_name = b.last_name
AND a.created_date = b.created_date
);

Result:

This query has found 40,267 duplicate values. It’s the same as Example 1.

Now, let’s run it as a DELETE query.

DELETE
FROM all_customer_dupes a
WHERE a.rowid > ANY (
SELECT b.rowid
FROM all_customer_dupes b
WHERE a.first_name = b.first_name
AND a.last_name = b.last_name
AND a.created_date = b.created_date
);

40,267 rows deleted.

This query ran in 16.76 seconds.

The explain plan for the query looks like this:


------------------------------------------------------------------------------
------------
| Id | Operation | Name | Rows | Bytes | Cost
(%CPU)| Time |
------------------------------------------------------------------------------
------------
| 0 | DELETE STATEMENT | | 1 | 263
| 1453 (1)| 00:00:18 |
| 1 | DELETE | ALL_CUSTOMER_DUPES
| | | | |
| 2 | NESTED LOOPS SEMI | | 1 | 263
| 1453 (1)| 00:00:18 |
| 3 | TABLE ACCESS FULL| ALL_CUSTOMER_DUPES | 1 | 138
| 1451 (1)| 00:00:18 |
|* 4 | INDEX RANGE SCAN | IDX_CSTDUP_FNLNCD | 1 | 125
| 2 (0)| 00:00:01 |
------------------------------------------------------------------------------
------------

Method 3 – DENSE_RANK
This method is similar to method 1 but uses the DENSE_RANK method.

In fact, it’s almost exactly the same.

The syntax looks like this:

DELETE FROM table a


WHERE a.ROWID IN (
SELECT rid FROM (
SELECT ROWID rid,
DENSE_RANK() OVER(PARTITION BY unique_columns ORDER BY ROWID) dup
FROM table
)
WHERE dup > 1
);

It uses the same concept as ROW_NUMBER, but uses the DENSE_RANK function. There are
some differences between using these functions, but in this case, they give the same output.

Let’s find the COUNT(*) of records first.

SELECT COUNT(*) FROM all_customer_dupes


WHERE ROWID IN (
SELECT rid FROM (
SELECT ROWID rid,
DENSE_RANK() OVER(
PARTITION BY first_name, last_name, created_date ORDER BY ROWID) dup
FROM all_customer_dupes
)
WHERE dup > 1
);
Result:

40,267 rows.

This shows the same number of records as method 1 and 2.

Now, let’s try the DELETE statement.

DELETE FROM all_customer_dupes


WHERE ROWID IN (
SELECT rid FROM (
SELECT ROWID rid,
DENSE_RANK() OVER(
PARTITION BY first_name, last_name, created_date ORDER BY ROWID) dup
FROM all_customer_dupes
)
WHERE dup > 1
);

Result:

40,267 rows deleted.

This deletes all of the duplicate records we found. It runs in 13.809 seconds.

In this method, you can also use a RANK function instead of DENSE_RANK. It should show the
same results.

The explain plan looks like this:

------------------------------------------------------------------------------
---------------------
| Id | Operation | Name | Rows | Bytes |
Cost (%CPU)| Time |
------------------------------------------------------------------------------
---------------------
| 0 | DELETE STATEMENT | | 1 | 150
| 1454 (1)| 00:00:18 |
| 1 | DELETE | ALL_CUSTOMER_DUPES
| | | | |
| 2 | NESTED LOOPS | | 1 | 150
| 1454 (1)| 00:00:18 |
| 3 | VIEW | VW_NSO_1 | 1 | 12
| 1452 (1)| 00:00:18 |
| 4 | SORT UNIQUE | | 1 | 25
| | |
|* 5 | VIEW | | 1 | 25
| 1452 (1)| 00:00:18 |
| 6 | WINDOW SORT | | 1 | 125
| 1452 (1)| 00:00:18 |
| 7 | TABLE ACCESS FULL | ALL_CUSTOMER_DUPES | 1 | 125
| 1451 (1)| 00:00:18 |
| 8 | TABLE ACCESS BY USER ROWID| ALL_CUSTOMER_DUPES | 1 | 138
| 1 (0)| 00:00:01 |
------------------------------------------------------------------------------
---------------------

Method 4 – MIN or MAX Function


This method uses either the MIN or MAX function to find duplicates inside a subquery.

It’s similar to earlier examples, but it uses less subqueries.

This example here uses the MIN function, but you can use the MAX function to get the same
result.

The syntax looks like this:

DELETE FROM your_table


WHERE rowid not in (
SELECT MIN(rowid)
FROM your_table
GROUP BY column1, column2, column3
);

The columns in the GROUP BY statement are those that are checked for duplicates. In the
example I’ve been using, this is the first_name, last_name, and created_date.

If I run it as a SELECT COUNT(*) first, I can see the number of rows impacted.

SELECT COUNT(*)
FROM all_customer_dupes
WHERE rowid NOT IN (
SELECT MIN(rowid)
FROM all_customer_dupes
GROUP BY first_name, last_name, created_date
);

Result:

40,267 rows.

Now, I can run the DELETE statement.

DELETE FROM all_customer_dupes


WHERE rowid NOT IN (
SELECT MIN(rowid)
FROM all_customer_dupes
GROUP BY first_name, last_name, created_date
);

Result:
40,267 rows deleted.

This query ran in 22.985 seconds. It’s a little slower than the other versions but it could just be
the data that I am using.

The explain plan looks like this:

------------------------------------------------------------------------------
--------------
| Id | Operation | Name | Rows | Bytes | Cost
(%CPU)| Time |
------------------------------------------------------------------------------
--------------
| 0 | DELETE STATEMENT | | 1 | 150
| 2903 (1)| 00:00:35 |
| 1 | DELETE | ALL_CUSTOMER_DUPES
| | | | |
|* 2 | HASH JOIN ANTI | | 1 | 150
| 2903 (1)| 00:00:35 |
| 3 | TABLE ACCESS FULL | ALL_CUSTOMER_DUPES | 1 | 138
| 1451 (1)| 00:00:18 |
| 4 | VIEW | VW_NSO_1 | 1 | 12
| 1452 (1)| 00:00:18 |
| 5 | SORT GROUP BY | | 1 | 125
| 1452 (1)| 00:00:18 |
| 6 | TABLE ACCESS FULL| ALL_CUSTOMER_DUPES | 1 | 125
| 1451 (1)| 00:00:18 |
------------------------------------------------------------------------------
--------------

Method 5 – Correlated Subquery with MIN or MAX


The next method I’ll share is similar to method 4, but uses a correlated subquery to match on
columns.

DELETE
FROM table a
WHERE ROWID NOT IN (
SELECT MAX(ROWID)
FROM table b
WHERE a.col1 = b.col1
AND a.col2 = b.col2
AND a.col3 = b.col3
);

It’s similar to the earlier query, but instead of using a GROUP BY clause, we use a WHERE
clause. This WHERE clause joins the table inside the subquery to the table outside the subquery.
The tables are joined on the columns that match.

Let’s see how many records are found.

SELECT COUNT(*)
FROM all_customer_dupes a
WHERE ROWID NOT IN (
SELECT MAX(ROWID)
FROM all_customer_dupes b
WHERE a.first_name = b.first_name
AND a.last_name = b.last_name
AND a.created_date = b.created_date
);

Result:

40,267 rows.

Now, let’s run this as a DELETE query:

DELETE
FROM all_customer_dupes a
WHERE ROWID NOT IN (
SELECT MAX(ROWID)
FROM all_customer_dupes b
WHERE a.first_name = b.first_name
AND a.last_name = b.last_name
AND a.created_date = b.created_date
);

Result:

40,267 rows deleted.

This query took 18.141 seconds to run.

The explain plan looks like this:

------------------------------------------------------------------------------
--------------
| Id | Operation | Name | Rows | Bytes | Cost
(%CPU)| Time |
------------------------------------------------------------------------------
--------------
| 0 | DELETE STATEMENT | | 1 | 263
| 1454 (1)| 00:00:18 |
| 1 | DELETE | ALL_CUSTOMER_DUPES
| | | | |
| 2 | NESTED LOOPS | | 1 | 263
| 1454 (1)| 00:00:18 |
| 3 | VIEW | VW_SQ_1 | 1 | 125
| 1452 (1)| 00:00:18 |
| 4 | SORT GROUP BY | | 1 | 125
| 1452 (1)| 00:00:18 |
| 5 | TABLE ACCESS FULL| ALL_CUSTOMER_DUPES | 1 | 125
| 1451 (1)| 00:00:18 |
|* 6 | INDEX RANGE SCAN | IDX_CSTDUP_FNLNCD | 1 | 138
| 2 (0)| 00:00:01 |
------------------------------------------------------------------------------
--------------

Other Methods You Might Come Across


I’ve only found five different ways for deleting duplicate data in SQL. I’ve listed a few more
below, which don’t actually work as well as the others, if at all.

I’ve added these so you can see if specific methods work, for example if someone at work
mentions a method they have tried. Or, if you’ve found a method that works on SQL Server and
want to try it on an Oracle database.

So, here are the other methods I found to delete duplicate data.

Just to be clear, these don’t work, and they are just here for reference.

Method 6 – DELETE WHERE IN

This method uses a WHERE clause with specified columns which are checked against a
subquery.

It seems very similar to method 4, with the addition of the three columns instead of using
ROWID.

The method looks like this:

DELETE FROM table WHERE


(col1, col2, col3) IN (
SELECT MAX(col1), col2, col3
FROM table
GROUP BY col2, col3
);

I ran this query for a few minutes but then stopped it without getting a result. I imagine it runs
slowly because of the WHERE clause on the three fields, rather than the ROWID mentioned
earlier.

This query may work for you if one of the fields you’re looking at is an ID or some kind of
number (to get the MAX value of), and if you have indexes on these fields.

But, for me, this query did not work.

Method 7 – DELETE with LEFT OUTER JOIN

This is a method that may work in SQL Server, but does not work in Oracle.
It uses a LEFT OUTER JOIN from the table that has the records to delete, to a subquery that
contains the duplicate records.

DELETE (
SELECT d.*, d.rowid
FROM all_customer_dupes d
LEFT OUTER JOIN (
SELECT MIN(RowId) AS MinRowId,
first_name, last_name, created_date
FROM all_customer_dupes
GROUP BY first_name, last_name, created_date
) KeepRows ON d.RowId = KeepRows.MinRowId
WHERE KeepRows.MinRowId IS NULL;

Let’s check the COUNT first:

SELECT COUNT(*)
FROM all_customer_dupes
LEFT OUTER JOIN (
SELECT MIN(RowId) AS MinRowId,
first_name, last_name, created_date
FROM all_customer_dupes
GROUP BY first_name, last_name, created_date
) KeepRows ON all_customer_dupes.RowId = KeepRows.MinRowId
WHERE KeepRows.MinRowId IS NULL;

Now, let’s delete the records by running the DELETE command above.

DELETE (
SELECT d.*, d.rowid
FROM all_customer_dupes d
LEFT OUTER JOIN (
SELECT MIN(RowId) AS MinRowId,
first_name, last_name, created_date
FROM all_customer_dupes
GROUP BY first_name, last_name, created_date
) KeepRows ON d.RowId = KeepRows.MinRowId
WHERE KeepRows.MinRowId IS NULL

Result:

SQL Error: ORA-01752: cannot delete from view without exactly one key-
preserved table
01752. 00000 - "cannot delete from view without exactly one key-preserved
table"
*Cause: The deleted table had
- no key-preserved tables,
- more than one key-preserved table, or
- the key-preserved table was an unmerged view.
*Action: Redefine the view or delete it from the underlying base tables.

This query did not work for me and showed this error. I suspect this method works on SQL
Server but not on Oracle.
Method 8 – DELETE with WITH Clause

This method is also based on a concept that works in SQL Server called CTE or Common Table
Expressions. They work very much the same as Subquery Refactoring, or the WITH clause, in
Oracle.

The query looks like this:

WITH cte AS
(SELECT ROW_NUMBER()
OVER (PARTITION BY first_name, last_name, created_date
ORDER BY (customer_id)) AS RN
FROM all_customer_dupes)
DELETE FROM cte
SELECT COUNT(*) FROM cte
WHERE rn > 1;

I’ve used the same logic as earlier queries, and when I run this query I get this:

ORA-00928: missing SELECT keyword


00928. 00000 - "missing SELECT keyword"
*Cause:
*Action:

This is because, in Oracle, the WITH clause needs to have a SELECT query that follows it, and
not a DELETE query. It doesn’t work with DELETE statements.

Method 9 – Two Tables Joined

This method uses two tables joined to each other on the matching columns, and a ROWID check
to only find those that are unique. I’ve specified the table to delete after the DELETE keyword so
Oracle will know which table I want to delete from.

DELETE (
SELECT t1.rowid
FROM all_customer_dupes t1
INNER JOIN all_customer_dupes t2
ON t1.first_name = t2.first_name
AND t1.last_name = t2.last_name
AND t1.created_date = t2.created_date
AND t1.rowid > t2.rowid
);

Let’s try getting a count first:

SELECT COUNT(t1.customer_id)
FROM all_customer_dupes t1
INNER JOIN all_customer_dupes t2
ON t1.first_name = t2.first_name
AND t1.last_name = t2.last_name
AND t1.created_date = t2.created_date
AND t1.rowid > t2.rowid;

Records:

40,267 rows.

It shows the right amount of rows. Now, let’s try to delete them.

DELETE (
SELECT t1.rowid
FROM all_customer_dupes t1
INNER JOIN all_customer_dupes t2
ON t1.first_name = t2.first_name
AND t1.last_name = t2.last_name
AND t1.created_date = t2.created_date
AND t1.rowid > t2.rowid
);

Result:

SQL Error: ORA-01752: cannot delete from view without exactly one key-
preserved table
01752. 00000 - "cannot delete from view without exactly one key-preserved
table"
*Cause: The deleted table had
- no key-preserved tables,
- more than one key-preserved table, or
- the key-preserved table was an unmerged view.
*Action: Redefine the view or delete it from the underlying base tables.

It shows a similar error as the earlier method.

This happens because Oracle does not support two tables in a DELETE statement. Even though
we have specified which one we want to delete from, it’s not allowed in Oracle.

Method 10 – Inline View and Analytic Function

This method is similar to earlier methods that work, but uses an inline view instead of an actual
table.

DELETE
FROM (SELECT c.*,
ROW_NUMBER()
OVER (PARTITION BY first_name, last_name, created_date
ORDER BY ROWID DESC) AS RN
FROM all_customer_dupes c
) sub
WHERE rn > 1;

Instead of saying DELETE FROM all_customer_dupes, I am deleting from the subquery itself.
Let’s try to COUNT the number of records.

SELECT COUNT(*)
FROM (SELECT c.*,
ROW_NUMBER()
OVER (PARTITION BY first_name, last_name, created_date
ORDER BY ROWID DESC) AS RN
FROM all_customer_dupes c
) sub
WHERE rn > 1;

Result:

40,267 rows.

The count is correct. Now, let’s try a delete.

DELETE
FROM (SELECT c.*,
ROW_NUMBER()
OVER (PARTITION BY first_name, last_name, created_date
ORDER BY ROWID DESC) AS RN
FROM all_customer_dupes c
) sub
WHERE rn > 1;

Result:

Error report –

SQL Error: ORA-01732: data manipulation operation not legal on this view
01732. 00000 - "data manipulation operation not legal on this view"
*Cause:
*Action:

This error occurs because Oracle does not allow deletions on inline views like this. You must
delete from an actual table.

Method 11 – Use RANK and Primary Key

This method is another one I found online that does not work:

DELETE FROM table


WHERE ROWID IN (
SELECT "rowid"
FROM (
SELECT "rowid", rank_n
FROM (
SELECT RANK() OVER (
PARTITION BY unique_columns ORDER BY rowid) rank_n,
rowid as "rowid"
FROM table
WHERE primary_key IN (
SELECT primary_key
FROM table
GROUP BY all_columns
HAVING COUNT(*) > 1
)
)
)
WHERE rank_n > 1
);

This query looks like it uses a RANK function as an analytic function with a primary key like
earlier examples, but it adds on a subquery in the WHERE clause. This subquery causes an error,
because the GROUP BY cannot be used with columns that are not in the SELECT clause.

So, this function does not even run as it has syntax errors.

Conclusion
In conclusion, there are a few ways to delete duplicate records from your table with SQL:

Method Name Run Time (s) Cost


1 ROW_NUMBER Analytic Function 12.730 1454
2 Subquery with Any 27.397 1453
3 DENSE_RANK 13.809 1454
4 MIN or MAX Function 22.985 2903
5 Correlated Subquery with MIN or MAX 18.141 1454

Method 1 was the best performing method for me, followed by method 3.

Você também pode gostar