0

I am scraping web with python and getting data to .csv file that looks like this. If I append to the file, I might have some repeated/duplicate data. To avoid that what can i use? I am not sure about pandas - If i should open the file in pandas and then drop duplicates. I tried other methods of my own, but was unable to come up with a solution. I was thinking of using pandas as the last option

Date,Time,Status,School,GPA,GRE,GMAT,Round,Location,Post-MBA Career,via,on,Details,Note

2021-05-18,13:59:00,Accepted from Waitlist,Yale SOM,3.8,No data provided,740,Round 2 ,NYC,Non Profit / Social Impact,phone,2021-05-18,GPA: 3.8 GMAT: 740 Round: Round 2 | NYC,Interviewed and was waitlisted in R2. Just received the call this afternoon. Good luck everyone!

2021-05-18,13:51:00,Accepted from Waitlist,Yale SOM,3.8,323,No data provided,Round 2 ,Austin,Marketing,phone,2021-05-18,GPA: 3.8 GRE: 323 Round: Round 2 | Austin,Keep your head up! It all works out how it is supposed to.
3
  • Do the duplicates correspond to exactly identical lines? And are those duplicates consecutive in the file? Commented May 20, 2021 at 12:33
  • Yes, and no, theyre scattered
    – Aytida
    Commented May 20, 2021 at 12:37
  • Then pandas and drop_duplicates is probably your best option if you intend to later use pandas on the data. If you do not, and if the file can fit in memory, then using a set of lines should do the job. Commented May 20, 2021 at 12:47

2 Answers 2

4

If you want to do it with pandas

# 1. Read CSV
df = pd.read_csv("data.csv")

# 2(a). For complete row duplicate
pd.drop_duplicates(inplace=True)
             
# 2(b). For partials
pd.drop_duplicates(subset=['Date', 'Time', <other_fields>], inplace=True)

# 3. Save then
pd.to_csv("data.csv", index=False)
2
  • Thanks, when you mean complete row duplicate, means it checks each entry for duplicate values right? and for partials - It checks all the 'n' fields for that?
    – Aytida
    Commented May 20, 2021 at 12:44
  • 1
    Yes for complete it should be an exact row (All fields) and for partial it will check just n fields e.g for subsets=['Date', 'Time'], it removes all duplicated rows with just Date and Time same (i.e. 4 rows with same Date and Time reduced to 1)
    – Pawan Jain
    Commented May 20, 2021 at 12:48
1

Maybe read through the lines one at a time, store them in a set (so there's no duplicates), and then write them back?

lines = set()
file = 'foo.txt'
with open (file) as fd:
    for line in fd:
        lines.add(line)
with open(file, 'w') as fd:
    fd.write(''.join(lines))
4
  • The file has about 60,000 entries. You think that will be feasible?
    – Aytida
    Commented May 20, 2021 at 12:25
  • 1
    should be fine - only one way to find out ;) Commented May 20, 2021 at 12:28
  • This makes sense, but I have a question - How do you write back? like how do I add more data?
    – Aytida
    Commented May 21, 2021 at 6:56
  • can you show an example? just add extra elements to lines, just make sure they end with \n Commented May 21, 2021 at 9:40

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.