Remove duplicates based on criteria from one column while merging data from different column

Question

My source dataframe:

Name	Source	Description	Value
John	A	Text1	1
John	B	Longer text	4
Bob	B	Text2	2
Alice	Z	Longer text	5
Alice	Y	The Longest text	3
Alice	X	Text3	6

I want to drop duplicates from column Name with the following criteria:

Keep the row where Description is the longest
Have a column that merges all the Source values, sorted alphabetically

Output I want to achieve:

Name	Source	Description	Value
John	A, B	Longer text	4
Bob	B	Text2	2
Alice	X, Y, Z	The Longest text	3

Here's what I have so far:

# Create new column with Description length
df['Description_Length'] = df['Description'].str.len().fillna(0) 

# Drop duplicates from Name based on Description Length
df = df.sort_values('Description_Length',ascending=False).drop_duplicates('Name')

What I'm missing is how to join the Source data before dropping the duplicates? Thanks!

maybe first you should use .groupby('Name') and later run code on every group. For example something like ['Description'].max(key=len) — furas, Commented 20 hours ago
Does this work for you? df = df.sort_values(by='Description', key=lambda col: col.str.len(), ascending=False).groupby('Name', as_index=False).agg({ 'Source': lambda x: ', '.join(sorted(x)), 'Description': 'first', 'Value': 'first' }) — Milos Stojanovic, Commented 19 hours ago
is your source dataframe a csv file pre chance? If so, this is dead simple via the csv package. — JonSG, Commented 19 hours ago

Milos Stojanovic · Accepted Answer · 2025-04-30 15:11:38Z

Something like this might work for you (watch out for indentation when copy):

df = df.sort_values(by='Description',
                    key=lambda col: col.str.len(),
                    ascending=False
                   )
       .groupby('Name', as_index=False)
       .agg({
           'Source': lambda x: ', '.join(sorted(x)),
           'Description': 'first',
           'Value': 'first'
       })

Output:

    Name   Source       Description  Value
0  Alice  X, Y, Z  The Longest text      3
1    Bob        B             Text2      2
2   John     A, B       Longer text      4

Little note, passing string 'first' as aggregation function is possible because first is function name (first function) and agg function accepts strings with function names instead of true functions. You can probably replace 'first' with lambda x: x.iloc[0] if makes more sense (and because first function is deprecated since version 2.1).

Cameron Riddell · Accepted Answer · 2025-04-30 14:21:44Z

You can use a groupby aggregation to gather the sorted sources and the location (indexes) of the longest description. From there you can do a self join along those indexes to carry the values & descriptions forward.

import pandas as pd

df = pd.DataFrame({
    'Name': ['John', 'John', 'Bob', 'Alice', 'Alice', 'Alice'],
    'Source': ['A', 'B', 'B', 'Z', 'Y', 'X'],
    'Description': ['Text1', 'Longer text', 'Text2', 'Longer text', 'The Longest text', 'Text3'],
    'Value': [1, 4, 2, 5, 3, 6]
})

print(
    df
    .assign(
        desc_length=lambda df: df['Description'].str.len().fillna(0)
    )
    .groupby('Name', as_index=False).agg(
        Source=('Source', sorted),
        length_indexes=('desc_length', 'idxmax'),
    )
    .merge(df.drop(columns=['Name', 'Source']), left_on='length_indexes', right_index=True)
    .drop(columns=['length_indexes'])
)

#     Name     Source       Description  Value
# 0  Alice  [X, Y, Z]  The Longest text      3
# 1    Bob        [B]             Text2      2
# 2   John     [A, B]       Longer text      4

Subir Chowdhury · Accepted Answer · 2025-04-30 15:43:19Z

Here is the full code:

Runtime: TotalMilliseconds : 483.8704. Performance optimized.

import pandas as pd

data = {
    'Name': ['John', 'John', 'Bob', 'Alice', 'Alice', 'Alice'],
    'Source': ['A', 'B', 'B', 'Z', 'Y', 'X'],
    'Description': ['Text1', 'Longer text', 'Text2', 'Longer text', 'The Longest text', 'Text3'],
    'Value': [1, 4, 2, 5, 3, 6]
}
df = pd.DataFrame(data)

df['desc_len'] = df['Description'].str.len()
max_len_idx = df.groupby('Name')['desc_len'].idxmax()
longest_rows = df.loc[max_len_idx].copy()

sorted_sources = (
    df.sort_values(['Name', 'Source'])
    .groupby('Name')['Source']
    .agg(list)
    .str.join(', ')
)

result = (
    longest_rows
    .merge(sorted_sources.rename('Source_agg'), 
           left_on='Name', 
           right_index=True)
    .drop(columns=['Source', 'desc_len'])
    .rename(columns={'Source_agg': 'Source'})
    [['Name', 'Source', 'Description', 'Value']]
    .sort_values('Name')
    .reset_index(drop=True)
)

print(result)

Output:

    Name   Source       Description  Value
0  Alice  X, Y, Z  The Longest text      3
1    Bob        B             Text2      2
2   John     A, B       Longer text      4

Collectives™ on Stack Overflow

Remove duplicates based on criteria from one column while merging data from different column

3 Answers 3

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Related