Course
Duplicate records are a common issue that can compromise data integrity and database performance. Removing these duplicates is essential for maintaining data accuracy, optimizing storage, and improving query performance. In this article, we will explore various techniques for removing duplicate rows in SQL, tailored to various use cases and database management systems.
As we get started, I highly recommend taking DataCamp’s Introduction to SQL and Learn SQL courses to learn foundational knowledge of extracting and analyzing data using SQL. Also, I find the SQL Basics Cheat Sheet, which you can download, is a helpful reference because it has all the most common SQL functions.
TL;DR
- Use
SELECT DISTINCTorGROUP BYto retrieve unique rows without modifying the table - Use
ROW_NUMBER()with a CTE andDELETEfor precise control over which duplicates to remove permanently - Use
DELETEwith a subquery (NOT IN/MIN()) for a straightforward deduplication approach in any DBMS - For large datasets, use temporary tables to batch-process duplicate removal safely
- Prevent duplicates proactively with primary keys, unique constraints, and proper database normalization
Understanding Duplicate Rows in SQL
Duplicate rows in SQL refer to records within a table that contain identical values across all or selected columns. The common causes of duplicate rows in SQL include the following:
- Missing Primary Keys: When tables lack a defined primary key or unique constraint, there is no mechanism to prevent the insertion of duplicate data. This can happen when a table is not normalized and/or there are transitive dependency issues.
- Data Integration Issues: When merging datasets from different sources, improper joins or inconsistencies in data formats can accidentally introduce duplicates.
- Manual Data Entry Errors: Human error, such as entering the same record multiple times, is another common cause of duplicate rows.
In the rest of the article, we will look at how to remove duplicates in SQL, and we will divide the article into two blocks. In the first section, we will cover how to remove duplicates in the data that you are retrieving for a report or dashboard; in the second section, we will look at how to remove duplicates in the database.
How to Identify Duplicate Rows
Before removing duplicates, identify which rows are duplicated. Use GROUP BY with HAVING COUNT(*) > 1 to find rows that appear more than once:
SELECT Name, COUNT(*) AS duplicate_count
FROM customers
GROUP BY Name
HAVING COUNT(*) > 1;
This query returns each Name that appears more than once, along with the number of occurrences. You can extend this to multiple columns by adding them to both the SELECT and GROUP BY clauses.
To see all rows with a rank indicating their position within each duplicate group, use ROW_NUMBER():
SELECT ID, Name,
ROW_NUMBER() OVER (PARTITION BY Name ORDER BY ID) AS row_num
FROM customers;
Rows where row_num > 1 are duplicates. Once identified, choose the appropriate removal method from the sections below.
Methods for Removing Duplicates in the Data You Retrieve
There are different methods of removing duplicates while retrieving records in SQL. Each method depends on the DBMS, such as SQL Server, MySQL, and PostgreSQL. In this section, we will look at the methods of removing duplicates while highlighting any special consideration for each database. Keep in mind, these methods filter the data and return unique records and they do not modify the underlying table.
Using DISTINCT keyword
The DISTINCT keyword is used in a SELECT statement to retrieve unique rows. The DISTINCT keyword syntax for removing duplicates is similar for MySQL, PostgreSQL, and SQL Server databases. The query below will retrieve unique customer names from the customers table.
SELECT DISTINCT Name
FROM customers;
Using GROUP BY with aggregate functions
The GROUP BY clause, combined with other aggregate functions like MAX(), MIN(), or COUNT(), can help remove duplicate records from tables. The GROUP BY clause helps select specific records to retain while deleting other duplicates.
Suppose you want to delete duplicate customer records but keep the one with the highest ID. You will use the GROUP BY clause with the MAX() function, as shown below.
-- Delete duplicate rows from the 'customers' table (aliased as c1)
DELETE c1
FROM customers c1
-- Find the maximum ID for each unique Name
JOIN (
SELECT Name, MAX(ID) AS MaxID
FROM customers
GROUP BY Name
) c2
-- Match rows based on 'Name' and keep the row with the maximum ID
ON c1.Name = c2.Name
AND c1.ID < c2.MaxID;
MySQL and SQL Server support the above syntax of GROUP BY with aggregate functions and the JOIN clause.
Using ROW_NUMBER() with Common Table Expressions (CTE)
With the ROW_NUMBER() function combined with a Common Table Expression (CTE), you can filter out duplicates based on your criteria. The ROW_NUMBER function, when used with PARTITION BY and ORDER BY clauses, assigns a unique sequential number to each row. This method allows for filtering out the rows that do not meet the required criteria.
The following query identifies duplicates and removes all but the first occurrence.
-- Common Table Expression (CTE) to rank rows based on 'Name'
WITH CTE AS (
SELECT ID, Name, ROW_NUMBER() OVER (PARTITION BY Name ORDER BY ID ASC) AS RowNum
FROM customers
)
-- Select only the unique records where RowNum = 1
SELECT ID, Name
FROM CTE
WHERE RowNum = 1;
This method works well for modern versions of SQL Server, MySQL, and PostgreSQL. It is useful for larger datasets or more complex conditions, as it allows you to specify exactly which duplicate to keep.
Removing duplicates using self-JOIN
A self-join allows you to compare a table to itself, making it helpful in identifying and removing duplicate rows by comparing records based on specific criteria. The following example uses the self-join to delete the row with the higher ID, keeping only the first occurrence of each name.
-- Delete duplicate rows using self-join
DELETE c1
FROM customers c1
JOIN customers c2
ON c1.Name = c2.Name AND c1.ID > c2.ID;
The above method works in major databases, including SQL Server, MySQL, and PostgreSQL. Check out our Intermediate SQL course to learn more about using aggregate functions and joins to filter data.
Methods for Removing Duplicates in the Database
While you can remove duplicate records using queries, you can also permanently delete them from the database. This approach is important for maintaining data quality. The following methods are used to remove duplicates from the database.
Using ROW_NUMBER() and DELETE
The ROW_NUMBER() function assigns a sequential number to rows within a defined partition. When used with the DELETE statement, it helps identify duplicates by ranking rows based on specific columns and removing unwanted records. This method applies to modern versions of MySQL (from 8.0), PostgreSQL, and SQL Server.
Suppose you want to remove duplicate customer records based on the Name column, keeping only the first occurrence (smallest ID):
-- Common Table Expression (CTE) to rank rows based on 'Name'
WITH CTE AS (
SELECT ID, Name, ROW_NUMBER() OVER (PARTITION BY Name ORDER BY ID ASC) AS RowNum
FROM customers
)
-- Delete rows from the 'customers' table where the row number is greater than 1
DELETE FROM customers
WHERE ID IN (SELECT ID FROM CTE WHERE RowNum > 1);
Using DELETE with subquery
Sometimes, a simple DELETE operation using a subquery can remove duplicates from the database. This method is suitable for older versions of MySQL or PostgreSQL where ROW_NUMBER() might not be available.
The query below deletes rows from the customers table where the ID is not the minimum for each Name, keeping only the row with the smallest ID for each unique Name.
-- Delete rows from the 'customers' table
DELETE FROM customers
WHERE ID NOT IN (
-- Subquery to find the minimum ID for each unique Name
SELECT MIN(ID)
FROM customers
GROUP BY Name
);
Using GROUP BY with HAVING clause
When you need to check for duplicate values in specific columns, the GROUP BY clause combined with the HAVING clause can be used to identify duplicates. This method allows you to delete specific rows based on the given criteria. This method is compatible with SQL Server, MySQL, and PostgreSQL.
The following queries first identify which names appear more than once, then delete the duplicates while keeping the row with the smallest ID for each Name.
-- Step 1: Identify which Names have duplicates
SELECT Name, COUNT(*) AS duplicate_count
FROM customers
GROUP BY Name
HAVING COUNT(*) > 1;
-- Step 2: Delete duplicate rows, keeping the smallest ID for each Name
DELETE FROM customers
WHERE ID NOT IN (
SELECT MIN(ID)
FROM customers
GROUP BY Name
);
Using temporary tables for batch processing
Temporary tables are efficient for batch processing and removing duplicates in large datasets. This method is useful where single queries can cause performance issues. The following query creates a temporary table to store the minimum ID for each Name and deletes rows from the customers table where the ID is not in the temp_customers table.
-- Create a temporary table with unique records
CREATE TEMPORARY TABLE temp_customers AS
SELECT MIN(ID) AS KeepID, Name
FROM customers
GROUP BY Name;
-- Delete duplicates not in the temporary table
DELETE FROM customers
WHERE ID NOT IN (SELECT KeepID FROM temp_customers);
-- Clean up
DROP TABLE temp_customers;
The above syntax using CREATE TEMPORARY TABLE is only supported in MySQL and PostgreSQL databases.
Remove Duplicates in SQL Server
SQL Server offers different methods of removing duplicate records from the database. These methods include using DISTINCT with INTO, ROW_NUMBER(), and temporary tables.
Using DISTINCT with INTO
You can use the DISTINCT keyword in a SELECT statement to create a new table with unique records. You can drop the old table once you verify the new table has the specified records. The following example creates the unique_customers table with unique records from the customers table.
-- Select distinct rows from 'customers' and create a new table 'unique_customers'
SELECT DISTINCT *
INTO unique_customers
FROM customers;
-- Drop the original 'customers' table to remove it from the database
DROP TABLE customers;
-- Rename the 'unique_customers' table to 'customers' to replace the original table
EXEC sp_rename 'unique_customers', 'customers';
Using ROW_NUMBER()
You can also use the ROW_NUMBER() function to remove duplicate records from the SQL Server. Assume you have a Customers table with duplicate rows based on the CustomerName column, and you want to delete all but the first occurrence for each duplicate group.
-- Common Table Expression (CTE) to assign a row number to each customer
WITH CTE AS (
SELECT CustomerID, CustomerName, ROW_NUMBER() OVER (PARTITION BY CustomerName ORDER BY CustomerID ASC) AS RowNum
FROM Customers
)
-- Delete rows from the CTE
DELETE FROM CTE
WHERE RowNum > 1;
Using temporary table
Since SQL Server does not support the CREATE TEMPORARY TABLE function, you use the SELECT INTO function. Temporary tables in SQL Server use # as a prefix for the table name.
-- Create a temporary table
SELECT MIN(CustomerID) AS ID, CustomerName
INTO #temp_customers
FROM customers
GROUP BY CustomerName;
-- Delete rows from the 'customers' table where the ID is not in the temporary table
DELETE FROM customers
WHERE CustomerID NOT IN (SELECT ID FROM #temp_customers);
-- Optionally drop the temporary table after use
DROP TABLE #temp_customers;
I suggest trying our SQL Server Fundamentals skill track to improve your joining tables and data analysis skills. The SQL Server Developer career track will equip you with the skills to write, troubleshoot, and optimize your queries using SQL Server.
Quick Reference: SQL Deduplication Methods
The table below summarizes all the deduplication methods covered in this article, so you can quickly pick the right approach for your situation.
| Method | Use Case | Modifies Data? | Database Support |
|---|---|---|---|
SELECT DISTINCT | Retrieve unique rows from query results | No | All DBMS |
GROUP BY + aggregates | Retrieve unique rows with aggregate values | No | All DBMS |
ROW_NUMBER() + CTE (SELECT) | Flexible duplicate filtering in queries | No | SQL Server, MySQL 8.0+, PostgreSQL |
ROW_NUMBER() + CTE (DELETE) | Permanently remove duplicates with fine control | Yes | SQL Server, MySQL 8.0+, PostgreSQL |
DELETE with subquery | Remove duplicates using NOT IN / MIN() | Yes | All DBMS |
Self-JOIN + DELETE | Remove duplicates by comparing rows pairwise | Yes | All DBMS |
| Temporary table approach | Batch processing for large datasets | Yes | MySQL, PostgreSQL (#temp for SQL Server) |
SELECT DISTINCT INTO | Create a clean copy of the table | Yes (replaces table) | SQL Server |
Best Practices
Duplicate rows are a common problem affecting data quality and database performance. Consider the following best practices to prevent duplicate records from being inserted in your database.
- Use Primary Keys: The primary key column ensures that each record contains unique information, preventing duplicate values from entering the table.
- Implement Unique Constraints: Applying unique constraints to any column ensures no duplicates exist across non-primary key columns, such as email addresses or phone numbers.
- Proper Database Design and Normalization: Effective schema design and database normalization help reduce redundancy and duplicate data. This approach ensures each record is stored in specific tables.
- Use Unique Indexes: Use unique indexes to ensure that certain column combinations are unique without requiring full table-level constraints across the entire dataset.
- Regular Data Audits: Perform regular data audits by running queries to identify potential duplicates based on your business rules.
Conclusion
Identifying and removing duplicate rows is important to maintaining database efficiency and data accuracy. It is always a best practice to back up your data before making modifications to ensure no accidental data loss occurs.
If you are interested in becoming a proficient data analyst, check out our Associate Data Analyst in SQL career track to learn the necessary skills. The Reporting in SQL course is also appropriate if you want to learn how to build professional dashboards using SQL. Finally, I recommend obtaining the SQL Associate Certification to demonstrate your mastery of using SQL for data analysis and stand out among other data professionals.
Earn a Top SQL Certification
Frequently Asked SQL Questions
What causes duplicate rows in SQL databases?
Duplicate rows can occur due to several factors, including improper database design, missing primary keys, data integration from multiple sources, manual data entry errors, or data migration issues where validation isn’t properly enforced.
Can I prevent duplicates based on multiple columns?
Yes, you can enforce uniqueness across multiple columns using composite keys or unique constraints. This ensures that combinations of values across those columns remain unique.
How does the DISTINCT keyword remove duplicate rows?
Using the DISTINCT keyword only removes duplicates in the query results and does not alter the underlying data.
Which method can you use to permanently delete duplicate records from the database?
You can use ROW_NUMBER() with DELETE, DELETE with subquery, GROUP BY with HAVING clause, and temporary tables for batch processing to permanently delete duplicate rows from the database.
Can duplicates affect the performance of my database?
Yes, duplicates can negatively impact performance by increasing storage costs, slowing queries, and complicating data analysis.
How do I find duplicate rows in SQL?
Use GROUP BY with HAVING COUNT(*) > 1 to find duplicates. For example: SELECT Name, COUNT(*) FROM customers GROUP BY Name HAVING COUNT(*) > 1; returns all names that appear more than once. You can also use ROW_NUMBER() OVER (PARTITION BY Name ORDER BY ID) to assign a rank to each row within duplicate groups—rows with a rank greater than 1 are duplicates.
What is the fastest way to remove duplicates from a large SQL table?
For large tables, use a temporary table approach: insert unique rows into a temp table using SELECT DISTINCT or GROUP BY, truncate the original table, then insert the clean data back. This avoids row-by-row deletion, which can be slow on millions of records. Alternatively, ROW_NUMBER() with a CTE is efficient when you need fine-grained control over which duplicate to keep. Always back up your data and test on a staging environment first.
How do I find duplicate rows in SQL?
Use GROUP BY with HAVING COUNT(*) > 1 to find duplicates. For example: SELECT Name, COUNT(*) FROM customers GROUP BY Name HAVING COUNT(*) > 1; returns all names that appear more than once. You can also use ROW_NUMBER() OVER (PARTITION BY Name ORDER BY ID) to assign a rank to each row within duplicate groups—rows with a rank greater than 1 are duplicates.
