0

My source dataframe:

Name Source Description Value
John A Text1 1
John B Longer text 4
Bob B Text2 2
Alice Z Longer text 5
Alice Y The Longest text 3
Alice X Text3 6

I want to drop duplicates from column Name with the following criteria:

  1. Keep the row where Description is the longest
  2. Have a column that merges all the Source values, sorted alphabetically

Output I want to achieve:

Name Source Description Value
John A, B Longer text 4
Bob B Text2 2
Alice X, Y, Z The Longest text 3

Here's what I have so far:

# Create new column with Description length
df['Description_Length'] = df['Description'].str.len().fillna(0) 

# Drop duplicates from Name based on Description Length
df = df.sort_values('Description_Length',ascending=False).drop_duplicates('Name')

What I'm missing is how to join the Source data before dropping the duplicates? Thanks!

3
  • maybe first you should use .groupby('Name') and later run code on every group. For example something like ['Description'].max(key=len)
    – furas
    Commented 20 hours ago
  • 1
    Does this work for you? df = df.sort_values(by='Description', key=lambda col: col.str.len(), ascending=False).groupby('Name', as_index=False).agg({ 'Source': lambda x: ', '.join(sorted(x)), 'Description': 'first', 'Value': 'first' }) Commented 19 hours ago
  • is your source dataframe a csv file pre chance? If so, this is dead simple via the csv package.
    – JonSG
    Commented 19 hours ago

3 Answers 3

1

Something like this might work for you (watch out for indentation when copy):

df = df.sort_values(by='Description',
                    key=lambda col: col.str.len(),
                    ascending=False
                   )
       .groupby('Name', as_index=False)
       .agg({
           'Source': lambda x: ', '.join(sorted(x)),
           'Description': 'first',
           'Value': 'first'
       })

Output:

    Name   Source       Description  Value
0  Alice  X, Y, Z  The Longest text      3
1    Bob        B             Text2      2
2   John     A, B       Longer text      4

Little note, passing string 'first' as aggregation function is possible because first is function name (first function) and agg function accepts strings with function names instead of true functions. You can probably replace 'first' with lambda x: x.iloc[0] if makes more sense (and because first function is deprecated since version 2.1).

0

You can use a groupby aggregation to gather the sorted sources and the location (indexes) of the longest description. From there you can do a self join along those indexes to carry the values & descriptions forward.

import pandas as pd

df = pd.DataFrame({
    'Name': ['John', 'John', 'Bob', 'Alice', 'Alice', 'Alice'],
    'Source': ['A', 'B', 'B', 'Z', 'Y', 'X'],
    'Description': ['Text1', 'Longer text', 'Text2', 'Longer text', 'The Longest text', 'Text3'],
    'Value': [1, 4, 2, 5, 3, 6]
})

print(
    df
    .assign(
        desc_length=lambda df: df['Description'].str.len().fillna(0)
    )
    .groupby('Name', as_index=False).agg(
        Source=('Source', sorted),
        length_indexes=('desc_length', 'idxmax'),
    )
    .merge(df.drop(columns=['Name', 'Source']), left_on='length_indexes', right_index=True)
    .drop(columns=['length_indexes'])
)

#     Name     Source       Description  Value
# 0  Alice  [X, Y, Z]  The Longest text      3
# 1    Bob        [B]             Text2      2
# 2   John     [A, B]       Longer text      4
0

Here is the full code:

Runtime: TotalMilliseconds : 483.8704. Performance optimized.

import pandas as pd

data = {
    'Name': ['John', 'John', 'Bob', 'Alice', 'Alice', 'Alice'],
    'Source': ['A', 'B', 'B', 'Z', 'Y', 'X'],
    'Description': ['Text1', 'Longer text', 'Text2', 'Longer text', 'The Longest text', 'Text3'],
    'Value': [1, 4, 2, 5, 3, 6]
}
df = pd.DataFrame(data)

df['desc_len'] = df['Description'].str.len()
max_len_idx = df.groupby('Name')['desc_len'].idxmax()
longest_rows = df.loc[max_len_idx].copy()

sorted_sources = (
    df.sort_values(['Name', 'Source'])
    .groupby('Name')['Source']
    .agg(list)
    .str.join(', ')
)

result = (
    longest_rows
    .merge(sorted_sources.rename('Source_agg'), 
           left_on='Name', 
           right_index=True)
    .drop(columns=['Source', 'desc_len'])
    .rename(columns={'Source_agg': 'Source'})
    [['Name', 'Source', 'Description', 'Value']]
    .sort_values('Name')
    .reset_index(drop=True)
)

print(result)

Output:

    Name   Source       Description  Value
0  Alice  X, Y, Z  The Longest text      3
1    Bob        B             Text2      2
2   John     A, B       Longer text      4

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.