My source dataframe:
Name | Source | Description | Value |
---|---|---|---|
John | A | Text1 | 1 |
John | B | Longer text | 4 |
Bob | B | Text2 | 2 |
Alice | Z | Longer text | 5 |
Alice | Y | The Longest text | 3 |
Alice | X | Text3 | 6 |
I want to drop duplicates from column Name
with the following criteria:
- Keep the row where
Description
is the longest - Have a column that merges all the
Source
values, sorted alphabetically
Output I want to achieve:
Name | Source | Description | Value |
---|---|---|---|
John | A, B | Longer text | 4 |
Bob | B | Text2 | 2 |
Alice | X, Y, Z | The Longest text | 3 |
Here's what I have so far:
# Create new column with Description length
df['Description_Length'] = df['Description'].str.len().fillna(0)
# Drop duplicates from Name based on Description Length
df = df.sort_values('Description_Length',ascending=False).drop_duplicates('Name')
What I'm missing is how to join the Source
data before dropping the duplicates? Thanks!
.groupby('Name')
and later run code on every group. For example something like['Description'].max(key=len)
df = df.sort_values(by='Description', key=lambda col: col.str.len(), ascending=False).groupby('Name', as_index=False).agg({ 'Source': lambda x: ', '.join(sorted(x)), 'Description': 'first', 'Value': 'first' })
csv
package.