How to remove duplicate/repeated rows in csv with python?

Question

I am scraping web with python and getting data to .csv file that looks like this. If I append to the file, I might have some repeated/duplicate data. To avoid that what can i use? I am not sure about pandas - If i should open the file in pandas and then drop duplicates. I tried other methods of my own, but was unable to come up with a solution. I was thinking of using pandas as the last option

Date,Time,Status,School,GPA,GRE,GMAT,Round,Location,Post-MBA Career,via,on,Details,Note

2021-05-18,13:59:00,Accepted from Waitlist,Yale SOM,3.8,No data provided,740,Round 2 ,NYC,Non Profit / Social Impact,phone,2021-05-18,GPA: 3.8 GMAT: 740 Round: Round 2 | NYC,Interviewed and was waitlisted in R2. Just received the call this afternoon. Good luck everyone!

2021-05-18,13:51:00,Accepted from Waitlist,Yale SOM,3.8,323,No data provided,Round 2 ,Austin,Marketing,phone,2021-05-18,GPA: 3.8 GRE: 323 Round: Round 2 | Austin,Keep your head up! It all works out how it is supposed to.

Do the duplicates correspond to exactly identical lines? And are those duplicates consecutive in the file? — Serge Ballesta, Commented May 20, 2021 at 12:33
Then pandas and drop_duplicates is probably your best option if you intend to later use pandas on the data. If you do not, and if the file can fit in memory, then using a set of lines should do the job. — Serge Ballesta, Commented May 20, 2021 at 12:47

Pawan Jain · Accepted Answer · 2021-05-20 12:33:10Z

4

If you want to do it with pandas

# 1. Read CSV
df = pd.read_csv("data.csv")

# 2(a). For complete row duplicate
pd.drop_duplicates(inplace=True)
             
# 2(b). For partials
pd.drop_duplicates(subset=['Date', 'Time', <other_fields>], inplace=True)

# 3. Save then
pd.to_csv("data.csv", index=False)

answered May 20, 2021 at 12:33

Pawan Jain

8265 silver badges15 bronze badges

Thanks, when you mean complete row duplicate, means it checks each entry for duplicate values right? and for partials - It checks all the 'n' fields for that?
– Aytida
Commented May 20, 2021 at 12:44
1

Yes for complete it should be an exact row (All fields) and for partial it will check just n fields e.g for subsets=['Date', 'Time'], it removes all duplicated rows with just Date and Time same (i.e. 4 rows with same Date and Time reduced to 1)
– Pawan Jain
Commented May 20, 2021 at 12:48

Add a comment |

ignoring_gravity · Accepted Answer · 2021-05-20 12:27:56Z

1

Maybe read through the lines one at a time, store them in a set (so there's no duplicates), and then write them back?

lines = set()
file = 'foo.txt'
with open (file) as fd:
    for line in fd:
        lines.add(line)
with open(file, 'w') as fd:
    fd.write(''.join(lines))

edited May 20, 2021 at 12:27

answered May 20, 2021 at 12:24

ignoring_gravity

10.7k7 gold badges44 silver badges89 bronze badges

The file has about 60,000 entries. You think that will be feasible?
– Aytida
Commented May 20, 2021 at 12:25
1

should be fine - only one way to find out ;)
– ignoring_gravity
Commented May 20, 2021 at 12:28
This makes sense, but I have a question - How do you write back? like how do I add more data?
– Aytida
Commented May 21, 2021 at 6:56
can you show an example? just add extra elements to lines, just make sure they end with \n
– ignoring_gravity
Commented May 21, 2021 at 9:40

Add a comment |

Collectives™ on Stack Overflow

How to remove duplicate/repeated rows in csv with python?

2 Answers 2

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Related