Pandas dataframe: how to summarize columns containing value

Question

Here is my dataframe:

df= pd.DataFrame(
{"mat" : ['A' ,'A', 'A', 'A', 'B'],
 "ppl" : ['P', 'P', 'P', '',  'P'],
 "ia1" : ['',  'X', 'X', '',  'X'],
 "ia2" : ['X', '',  '',  'X', 'X']},
index = [1, 2, 3, 4, 5])

I want to select unique values on the two first columns. I do:

df2 = df.loc[:,['mat','ppl']].drop_duplicates(subset=['mat','ppl']).sort_values(by=['mat','ppl'])

I get, as expected:

  mat ppl
4   A    
1   A   P
5   B   P

What I want now is, df3 to be:

 mat ppl ia1 ia2
   A           X
   A   P   X   X
   B   P   X   X

That is: in df3 for row A+P, in column ia1, I got an X because there is a X in column ia1 in one of the row of df, for A+P

Actually, very close to question stackoverflow.com/questions/14246817/… — thdox, Commented Apr 7, 2017 at 13:15

jezrael · Accepted Answer · 2017-04-26 14:41:56Z

1

Solutions with aggregate and unique, if multiple unique values then are joined with ,:

df = df.groupby(['mat','ppl']).agg(lambda x: ','.join(x[x != ''].unique())).reset_index()
print (df)
  mat ppl ia1 ia2
0   A           X
1   A   P   X   X
2   B   P   X   X

Explanation:

Aggregation is working with Series and aggregation function, where output is scalar. I use custom function where first filter out empty spaces by boolean indexing (x[x != ''], then get unique values. For scalar output is used join - it works if empty Series (all values are empty strings) and second advantage is if multiple unique values get one joined value with ,.

For testing is possible use custom function what is same as lambda function:

def f(x):
    a = ''.join(x[x != ''].unique().tolist())
    return a

df = df.groupby(['mat','ppl']).agg(f).reset_index()
print (df)
  mat ppl ia1 ia2
0   A           X
1   A   P   X   X
2   B   P   X   X

As comment of OP mentioned:

Instead of using lambda x: ','.join(x[x != ''].unique()), I used lambda x: ','.join(set(x)-set([''])). I went from 13min 5s to 43.2 s

edited Apr 26, 2017 at 14:41

answered Apr 7, 2017 at 12:21

jezrael

865k102 gold badges1.4k silver badges1.3k bronze badges

Can you please explain the lambda x: ','.join(x[x != ''].unique()) ?
– thdox
Commented Apr 7, 2017 at 13:01
Please check answer.
– jezrael
Commented Apr 7, 2017 at 13:06
What I was not understanding is that x is representing all columns to aggregate.
– thdox
Commented Apr 7, 2017 at 13:16
Hmmm, I think if no column is specify like df = df.groupby(['mat','ppl']).agg({'ia1':f}).reset_index() or df = df.groupby(['mat','ppl'])['ia1'].agg(f).reset_index() then function agg use all columns and apply aggreagate function. Btw, thank you.
– jezrael
Commented Apr 7, 2017 at 13:19
Well, this is hugely slow on a dataframe with 100K rows and groupby on 10 columns + 4 columns to aggregate.
– thdox
Commented Apr 7, 2017 at 13:20

| Show 3 more comments

Collectives™ on Stack Overflow

Pandas dataframe: how to summarize columns containing value

1 Answer 1

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Linked

Related