Delete duplicate datas from csv and training (Keras, python, pandas)

Question

Let's say we have a dataset in csv. Let the data be representatively as follows. Let's assume that there are 1000 columns and 1000 rows in the csv that contains this data.

Let's say we use column A and B while performing regression and predict operations in the Keras library. I want to delete the duplicate data in A and leave only one. For example, if there are 5 of 1, only 4 will remain. At the same time, I want all 4 data deleted from duplicate data to be deleted from column B or any column X.

If we think of it as 2 different scenarios,

Duplicate data in column A is likewise deleted from column B or any other column.

The other scenario is to delete more than one, i.e. repetitive data in each column independently of each other.

The regression process needs to be performed using the keras module with the last remaining data.

Can you help with this?

Could you provide an output example, to show what you would expect? — kodkirurg, Commented Aug 10, 2021 at 18:00
@kodkirurg Scenario 1, A 1 2 3 4 5, B 2 4 5 1 3 6 8, C 1 6 3 4, D 2 6 9 0 1 3, E 8 6 1 2 3 5 7 — Dreko, Commented Aug 10, 2021 at 18:12
@kodkirurg Scenario2, A 1 2 3 4 5, B 2 4 5 6 8, C 1 6 3 1 3, D 2 6 9 6 3, E 8 6 1 5 7, main goal is, removing duplicated datas from dataset for applying reggression with keras — Dreko, Commented Aug 10, 2021 at 18:14
What you're saying is that each column should only contain unique values and if a non-unique value does exist we drop the whole row? — kodkirurg, Commented Aug 10, 2021 at 18:18
pandas.pydata.org/pandas-docs/stable/reference/api/… is probably what you're looking for. I I can understand what you're trying to do I can probably help you with code. — kodkirurg, Commented Aug 10, 2021 at 18:21

kodkirurg · Accepted Answer · 2021-08-10 19:09:32Z

2

This will check column A for duplicates, if it finds a duplicate it will drop that whole row.

import pandas as pd

d = {'A': [1,2,3,2,1,4,5],
     'B': [2,4,5,1,3,6,8],
     'C': [1,6,3,4,6,1,3],
     'D': [2,6,9,0,1,6,3],
     'E': [8,6,1,2,3,5,7]
    }

df = pd.DataFrame(data=d)
df.drop_duplicates(subset='A')

output:

edited Aug 10, 2021 at 19:09

answered Aug 10, 2021 at 18:27

kodkirurg

1768 bronze badges

Not exactly that way, I guess I explained it wrong. The data corresponding to the data deleted from column A is likewise removed from column B. Data in A may be unique, but there is no such requirement for B. After deletion, the number of rows in A and B must be equal.
– Dreko
Commented Aug 10, 2021 at 18:32
This is correct then, B does not have to be unique, let me update the output so it's easier to see.
– kodkirurg
Commented Aug 10, 2021 at 18:36
thank you, that's what i was looking for. Well if I thought I was running these from a csv file, import pandas as pd d = pd.read_csv('data.csv') df = pd.DataFrame(data=d) df.drop_duplicates(subset='A') print(d) How can I print data after deletion? I couldn't run my code, can you help?
– Dreko
Commented Aug 10, 2021 at 19:35
I think you might forget to assign the data. df1 = df.drop_duplicates(subset='A'). Then print(df1). Or you could write df.drop_duplicates(subset='A, inplace=true) and print(df). Inplace means it will update the dataframe
– kodkirurg
Commented Aug 10, 2021 at 20:55

Add a comment |

Collectives™ on Stack Overflow

Delete duplicate datas from csv and training (Keras, python, pandas)

1 Answer 1

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Related