1

Here is my dataframe:

df= pd.DataFrame(
{"mat" : ['A' ,'A', 'A', 'A', 'B'],
 "ppl" : ['P', 'P', 'P', '',  'P'],
 "ia1" : ['',  'X', 'X', '',  'X'],
 "ia2" : ['X', '',  '',  'X', 'X']},
index = [1, 2, 3, 4, 5])

I want to select unique values on the two first columns. I do:

df2 = df.loc[:,['mat','ppl']].drop_duplicates(subset=['mat','ppl']).sort_values(by=['mat','ppl'])

I get, as expected:

  mat ppl
4   A    
1   A   P
5   B   P

What I want now is, df3 to be:

 mat ppl ia1 ia2
   A           X
   A   P   X   X
   B   P   X   X

That is: in df3 for row A+P, in column ia1, I got an X because there is a X in column ia1 in one of the row of df, for A+P

1

1 Answer 1

1

Solutions with aggregate and unique, if multiple unique values then are joined with ,:

df = df.groupby(['mat','ppl']).agg(lambda x: ','.join(x[x != ''].unique())).reset_index()
print (df)
  mat ppl ia1 ia2
0   A           X
1   A   P   X   X
2   B   P   X   X

Explanation:

Aggregation is working with Series and aggregation function, where output is scalar. I use custom function where first filter out empty spaces by boolean indexing (x[x != ''], then get unique values. For scalar output is used join - it works if empty Series (all values are empty strings) and second advantage is if multiple unique values get one joined value with ,.

For testing is possible use custom function what is same as lambda function:

def f(x):
    a = ''.join(x[x != ''].unique().tolist())
    return a

df = df.groupby(['mat','ppl']).agg(f).reset_index()
print (df)
  mat ppl ia1 ia2
0   A           X
1   A   P   X   X
2   B   P   X   X

As comment of OP mentioned:

Instead of using lambda x: ','.join(x[x != ''].unique()), I used lambda x: ','.join(set(x)-set([''])). I went from 13min 5s to 43.2 s

8
  • Can you please explain the lambda x: ','.join(x[x != ''].unique()) ?
    – thdox
    Commented Apr 7, 2017 at 13:01
  • Please check answer.
    – jezrael
    Commented Apr 7, 2017 at 13:06
  • What I was not understanding is that x is representing all columns to aggregate.
    – thdox
    Commented Apr 7, 2017 at 13:16
  • Hmmm, I think if no column is specify like df = df.groupby(['mat','ppl']).agg({'ia1':f}).reset_index() or df = df.groupby(['mat','ppl'])['ia1'].agg(f).reset_index() then function agg use all columns and apply aggreagate function. Btw, thank you.
    – jezrael
    Commented Apr 7, 2017 at 13:19
  • Well, this is hugely slow on a dataframe with 100K rows and groupby on 10 columns + 4 columns to aggregate.
    – thdox
    Commented Apr 7, 2017 at 13:20

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.