0

Let's say we have a dataset in csv. Let the data be representatively as follows. Let's assume that there are 1000 columns and 1000 rows in the csv that contains this data.

Let's say we use column A and B while performing regression and predict operations in the Keras library. I want to delete the duplicate data in A and leave only one. For example, if there are 5 of 1, only 4 will remain. At the same time, I want all 4 data deleted from duplicate data to be deleted from column B or any column X.

If we think of it as 2 different scenarios,

Duplicate data in column A is likewise deleted from column B or any other column.

The other scenario is to delete more than one, i.e. repetitive data in each column independently of each other.

The regression process needs to be performed using the keras module with the last remaining data.

Can you help with this?

enter image description here

11
  • Could you provide an output example, to show what you would expect?
    – kodkirurg
    Commented Aug 10, 2021 at 18:00
  • @kodkirurg Scenario 1, A 1 2 3 4 5, B 2 4 5 1 3 6 8, C 1 6 3 4, D 2 6 9 0 1 3, E 8 6 1 2 3 5 7
    – Dreko
    Commented Aug 10, 2021 at 18:12
  • @kodkirurg Scenario2, A 1 2 3 4 5, B 2 4 5 6 8, C 1 6 3 1 3, D 2 6 9 6 3, E 8 6 1 5 7, main goal is, removing duplicated datas from dataset for applying reggression with keras
    – Dreko
    Commented Aug 10, 2021 at 18:14
  • What you're saying is that each column should only contain unique values and if a non-unique value does exist we drop the whole row?
    – kodkirurg
    Commented Aug 10, 2021 at 18:18
  • pandas.pydata.org/pandas-docs/stable/reference/api/… is probably what you're looking for. I I can understand what you're trying to do I can probably help you with code.
    – kodkirurg
    Commented Aug 10, 2021 at 18:21

1 Answer 1

2

This will check column A for duplicates, if it finds a duplicate it will drop that whole row.

import pandas as pd

d = {'A': [1,2,3,2,1,4,5],
     'B': [2,4,5,1,3,6,8],
     'C': [1,6,3,4,6,1,3],
     'D': [2,6,9,0,1,6,3],
     'E': [8,6,1,2,3,5,7]
    }

df = pd.DataFrame(data=d)
df.drop_duplicates(subset='A')

output:

enter image description here

4
  • Not exactly that way, I guess I explained it wrong. The data corresponding to the data deleted from column A is likewise removed from column B. Data in A may be unique, but there is no such requirement for B. After deletion, the number of rows in A and B must be equal.
    – Dreko
    Commented Aug 10, 2021 at 18:32
  • This is correct then, B does not have to be unique, let me update the output so it's easier to see.
    – kodkirurg
    Commented Aug 10, 2021 at 18:36
  • thank you, that's what i was looking for. Well if I thought I was running these from a csv file, import pandas as pd d = pd.read_csv('data.csv') df = pd.DataFrame(data=d) df.drop_duplicates(subset='A') print(d) How can I print data after deletion? I couldn't run my code, can you help?
    – Dreko
    Commented Aug 10, 2021 at 19:35
  • I think you might forget to assign the data. df1 = df.drop_duplicates(subset='A'). Then print(df1). Or you could write df.drop_duplicates(subset='A, inplace=true) and print(df). Inplace means it will update the dataframe
    – kodkirurg
    Commented Aug 10, 2021 at 20:55

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.