Checking for duplicate data in SQL Server

Question

Please don't ask me why but there is a lot of duplicate data where every field is duplicated.

For example

alex, 1
alex, 1
liza, 32
hary, 34

I will need to eliminate from this table one of the alex, 1 rows

I know this algorithm will be very ineffecient, but it does not matter. I will need to remove duplicate data.

What is the best way to do this? Please keep in mind I do not have 2 fields, I actually have about 10 fields to check on.

Countless questions on SO already ask & answer this, but what which of the duplicates do you wish to keep? What other columns are available to indicate which duplicate should be kept... — OMG Ponies
– OMG Ponies, Commented Jun 2, 2011 at 4:38
@naveen no there is no non-duplicated key that is the problem — Alex Gordon
– Alex Gordon, Commented Jun 2, 2011 at 4:40
@omg it does not matter which one to eliminate since they are the same — Alex Gordon
– Alex Gordon, Commented Jun 2, 2011 at 4:40
Once you've mopped up, remember to add a unique constraint against these columns in this table, so you don't have to do this again. — Damien_The_Unbeliever
– Damien_The_Unbeliever, Commented Jun 2, 2011 at 7:10

m.s. · Accepted Answer · 2015-10-25 15:48:11Z

6

As you said, yes this will be very inefficient, but you can try something like

DECLARE @TestTable TABLE(
        Name VARCHAR(20),
        SomeVal INT
)
INSERT INTO @TestTable SELECT 'alex', 1
INSERT INTO @TestTable SELECT 'alex', 1
INSERT INTO @TestTable SELECT 'liza', 32
INSERT INTO @TestTable SELECT 'hary', 34

SELECT  *
FROM    @TestTable

;WITH DuplicateVals AS (
    SELECT  *,
            ROW_NUMBER() OVER (PARTITION BY Name, SomeVal ORDER BY (SELECT NULL)) RowID
    FROM    @TestTable
)
DELETE FROM DuplicateVals WHERE RowID > 1

SELECT *
FROM    @TestTable

edited Oct 25, 2015 at 15:48

m.s.

16.4k7 gold badges58 silver badges94 bronze badges

answered Jun 2, 2011 at 4:41

Adriaan Stander

167k32 gold badges294 silver badges288 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Alex Gordon Over a year ago

this is a very dangerous query. if you run this twice. it will delete ALL data

Adriaan Stander Over a year ago

How do you figure that? We are checking for RowID > 1, which should limit this, no?

Brian Webster · Accepted Answer · 2011-06-02 04:47:23Z

3

I understand this does not answer the specific question (eliminating dupes in SAME table), but I'm offering the solution because it is very fast and might work best for the author.

Speedy solution, if you don't mind creating a new table, create a new table with the same schema named NewTable.

Execute this SQL

 Insert into NewTable
 Select 
   name, 
   num 
 from
   OldTable
 group by
   name,
   num

Just include every field name in both the select and group by clauses.

edited Jun 2, 2011 at 4:47

answered Jun 2, 2011 at 4:41

Brian Webster

31k51 gold badges158 silver badges227 bronze badges

Comments

tofutim · Accepted Answer · 2011-06-02 05:24:10Z

Method A. You can get a deduped version of your data using

SELECT field1, field2, ...
INTO Deduped
FROM Source
GROUP BY field1, field2, ...

for example, for your sample data,

SELECT name, number
FROM Source
GROUP BY name, number

yields

alex    1
hary    34
liza    32

then simply delete the old table, and rename the new one. Of course, there are a number of fancy in-place solutions, but this is the clearest way to do it.

Method B. An in-place method is to create a primary key and delete duplicates that way. For example, you can

ALTER TABLE Source ADD sid INT IDENTITY(1,1);

which makes Source look like this

alex    1   1
alex    1   2
liza    32  3
hary    34  4

then you can use

DELETE FROM Source
WHERE  sid NOT IN
  (SELECT MIN(sid)
   FROM  Source
   GROUP BY name, number)

which will give the desired result. Of course, "NOT IN" is not exactly the most efficient, but it will do the job. Alternatively, you can LEFT JOIN the grouped table (maybe stored in a TEMP table), and do the DELETE that way.

Alex Aza · Accepted Answer · 2011-06-02 15:14:42Z

2

create table DuplicateTable(name varchar(10), number int)

insert DuplicateTable
values
    ('alex', 1),
    ('alex', 1),
    ('liza', 32),
    ('hary', 34);

with cte
as
(
    select *, row_number() over(partition by name, number order by name) RowNumber
    from DuplicateTable
)
delete cte
where RowNumber > 1

edited Jun 2, 2011 at 15:14

answered Jun 2, 2011 at 4:41

Alex Aza

78.8k26 gold badges157 silver badges134 bronze badges

2 Comments

Mikael Eriksson Over a year ago

and you also need something after order by in the over clause.

Alex Aza Over a year ago

@marc_s and @Mikael Eriksson - thank you for point out the typos. I shouldn't have edited the answer in the first place. It was correct, before I added 'setup' lines.

Alex Gordon · Accepted Answer · 2011-06-03 15:17:50Z

2

A bit different solution which requires primary key(or unique index): Suppose you have a table your_table(id - PK, name, and num)

DELETE 
FROM your_table     
FROM your_table AS t2
WHERE 
(select  COUNT(*) FROM your_table y 
  where t2.name = y.name and  t2.num = y.num) >1
AND t2.id != 
(SELECT top 1 id FROM your_table z 
 WHERE t2.name = z.name and  t2.num = z.num);

I assumed that name and num are NOT NULL, if they can contain NULL values, you need to change wheres in sub-queries.

edited Jun 3, 2011 at 15:17

Alex Gordon

61.5k307 gold badges710 silver badges1.1k bronze badges

answered Jun 2, 2011 at 5:10

a1ex07

37.4k12 gold badges93 silver badges103 bronze badges

3 Comments

Alex Aza Over a year ago

I don't know for sure why you got the downvote, but I assume this is because you are assuming that table has unique id, while the table structure in the question shows that there is no id. Also, your assumption about name and num does not sound legit either. Basically what I can tell, you made to many unnecessary assumptions, while there are already answers that don't have those limitations.

a1ex07 Over a year ago

@Alex Aza: I read all previous solutions and posted this one just to show a different approach that works. I also mentioned limitations, and I believe in real world [almost]all the tables have at least one unique index. Concerning nullabilities, it's not a real limitation, I just didn't wrap comparing with ISNULL. Anyway, I'd not complain if I got downvote with explanation you gave, or any other...

Alex Aza Over a year ago

I absolutely agree with you about downvotes without any comment. This is not nice, imho. Also, I understand that you could miss it but OP specifies in the comment that there is no unique key in this table. And yes, this is something that should have been rather specified in the question in the first place.

Collectives™ on Stack Overflow

Checking for duplicate data in SQL Server

5 Answers 5

2 Comments

Comments

Comments

2 Comments

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

Comments

Comments

2 Comments

3 Comments

Related