4

Please don't ask me why but there is a lot of duplicate data where every field is duplicated.

For example

alex, 1
alex, 1
liza, 32
hary, 34

I will need to eliminate from this table one of the alex, 1 rows

I know this algorithm will be very ineffecient, but it does not matter. I will need to remove duplicate data.

What is the best way to do this? Please keep in mind I do not have 2 fields, I actually have about 10 fields to check on.

5
  • 2
    Countless questions on SO already ask & answer this, but what which of the duplicates do you wish to keep? What other columns are available to indicate which duplicate should be kept... Commented Jun 2, 2011 at 4:38
  • is there a non duplicated primary key field at all? Commented Jun 2, 2011 at 4:39
  • @naveen no there is no non-duplicated key that is the problem Commented Jun 2, 2011 at 4:40
  • @omg it does not matter which one to eliminate since they are the same Commented Jun 2, 2011 at 4:40
  • 2
    Once you've mopped up, remember to add a unique constraint against these columns in this table, so you don't have to do this again. Commented Jun 2, 2011 at 7:10

5 Answers 5

6

As you said, yes this will be very inefficient, but you can try something like

DECLARE @TestTable TABLE(
        Name VARCHAR(20),
        SomeVal INT
)
INSERT INTO @TestTable SELECT 'alex', 1
INSERT INTO @TestTable SELECT 'alex', 1
INSERT INTO @TestTable SELECT 'liza', 32
INSERT INTO @TestTable SELECT 'hary', 34

SELECT  *
FROM    @TestTable

;WITH DuplicateVals AS (
    SELECT  *,
            ROW_NUMBER() OVER (PARTITION BY Name, SomeVal ORDER BY (SELECT NULL)) RowID
    FROM    @TestTable
)
DELETE FROM DuplicateVals WHERE RowID > 1

SELECT *
FROM    @TestTable
Sign up to request clarification or add additional context in comments.

2 Comments

this is a very dangerous query. if you run this twice. it will delete ALL data
How do you figure that? We are checking for RowID > 1, which should limit this, no?
3

I understand this does not answer the specific question (eliminating dupes in SAME table), but I'm offering the solution because it is very fast and might work best for the author.

Speedy solution, if you don't mind creating a new table, create a new table with the same schema named NewTable.

Execute this SQL

 Insert into NewTable
 Select 
   name, 
   num 
 from
   OldTable
 group by
   name,
   num

Just include every field name in both the select and group by clauses.

Comments

2

Method A. You can get a deduped version of your data using

SELECT field1, field2, ...
INTO Deduped
FROM Source
GROUP BY field1, field2, ...

for example, for your sample data,

SELECT name, number
FROM Source
GROUP BY name, number

yields

alex    1
hary    34
liza    32

then simply delete the old table, and rename the new one. Of course, there are a number of fancy in-place solutions, but this is the clearest way to do it.

Method B. An in-place method is to create a primary key and delete duplicates that way. For example, you can

ALTER TABLE Source ADD sid INT IDENTITY(1,1);

which makes Source look like this

alex    1   1
alex    1   2
liza    32  3
hary    34  4

then you can use

DELETE FROM Source
WHERE  sid NOT IN
  (SELECT MIN(sid)
   FROM  Source
   GROUP BY name, number)

which will give the desired result. Of course, "NOT IN" is not exactly the most efficient, but it will do the job. Alternatively, you can LEFT JOIN the grouped table (maybe stored in a TEMP table), and do the DELETE that way.

Comments

2
create table DuplicateTable(name varchar(10), number int)

insert DuplicateTable
values
    ('alex', 1),
    ('alex', 1),
    ('liza', 32),
    ('hary', 34);

with cte
as
(
    select *, row_number() over(partition by name, number order by name) RowNumber
    from DuplicateTable
)
delete cte
where RowNumber > 1

2 Comments

and you also need something after order by in the over clause.
@marc_s and @Mikael Eriksson - thank you for point out the typos. I shouldn't have edited the answer in the first place. It was correct, before I added 'setup' lines.
2

A bit different solution which requires primary key(or unique index): Suppose you have a table your_table(id - PK, name, and num)

DELETE 
FROM your_table     
FROM your_table AS t2
WHERE 
(select  COUNT(*) FROM your_table y 
  where t2.name = y.name and  t2.num = y.num) >1
AND t2.id != 
(SELECT top 1 id FROM your_table z 
 WHERE t2.name = z.name and  t2.num = z.num);

I assumed that name and num are NOT NULL, if they can contain NULL values, you need to change wheres in sub-queries.

3 Comments

I don't know for sure why you got the downvote, but I assume this is because you are assuming that table has unique id, while the table structure in the question shows that there is no id. Also, your assumption about name and num does not sound legit either. Basically what I can tell, you made to many unnecessary assumptions, while there are already answers that don't have those limitations.
@Alex Aza: I read all previous solutions and posted this one just to show a different approach that works. I also mentioned limitations, and I believe in real world [almost]all the tables have at least one unique index. Concerning nullabilities, it's not a real limitation, I just didn't wrap comparing with ISNULL. Anyway, I'd not complain if I got downvote with explanation you gave, or any other...
I absolutely agree with you about downvotes without any comment. This is not nice, imho. Also, I understand that you could miss it but OP specifies in the comment that there is no unique key in this table. And yes, this is something that should have been rather specified in the question in the first place.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.