1

Consider the following dataframe example:

id  date        hrz tenor   1       2       3       4
AAA 16/03/2010  2   6m      0.54    0.54    0.78    0.19
AAA 30/03/2010  2   6m      0.05    0.67    0.20    0.03
AAA 13/04/2010  2   6m      0.64    0.32    0.13    0.20
AAA 27/04/2010  2   6m      0.99    0.53    0.38    0.97
AAA 11/05/2010  2   6m      0.46    0.90    0.11    0.14
AAA 25/05/2010  2   6m      0.41    0.06    0.96    0.31
AAA 08/06/2010  2   6m      0.19    0.73    0.58    0.80
AAA 22/06/2010  2   6m      0.40    0.95    0.14    0.56
AAA 06/07/2010  2   6m      0.22    0.74    0.85    0.94
AAA 20/07/2010  2   6m      0.34    0.17    0.03    0.77
AAA 03/08/2010  2   6m      0.13    0.32    0.39    0.95
AAA 16/03/2010  2   1y      0.54    0.54    0.78    0.19
AAA 30/03/2010  2   1y      0.05    0.67    0.20    0.03
AAA 13/04/2010  2   1y      0.64    0.32    0.13    0.20
AAA 27/04/2010  2   1y      0.99    0.53    0.38    0.97
AAA 11/05/2010  2   1y      0.46    0.90    0.11    0.14
AAA 25/05/2010  2   1y      0.41    0.06    0.96    0.31
AAA 08/06/2010  2   1y      0.19    0.73    0.58    0.80
AAA 22/06/2010  2   1y      0.40    0.95    0.14    0.56
AAA 06/07/2010  2   1y      0.22    0.74    0.85    0.94
AAA 20/07/2010  2   1y      0.34    0.17    0.03    0.77
AAA 03/08/2010  2   1y      0.13    0.32    0.39    0.95

How can I grouby the variables id, hrz and tenor and apply the following custom functions across the dates?

 def ks_test(x):
    return scipy.stats.kstest(np.sort(x), 'uniform')[0]

 def cvm_test(x):
    n = len(x)
    i = np.arange(1, n + 1)
    x = np.sort(x)
    w2 = (1 / (12 * n)) + np.sum((x - ((2 * i - 1) / (2 * n))) ** 2)
    return w2

The desired output is the following dataframe (figure results are just examples):

id   hrz    tenor   test        1       2       3       4
AAA  2      6m      ks_test     0.04    0.06    0.02    0.03
AAA  2      6m      cvm_test    0.09    0.17    0.03    0.05
AAA  2      1y      ks_test     0.04    0.06    0.02    0.03
AAA  2      1y      cvm_test    0.09    0.17    0.03    0.05

2 Answers 2

3

You can build a group, then apply your functions with groupby.agg and concat the outputs:

group = ['id', 'hrz', 'tenor']
cols = df.columns.difference(group+['date'])

g = df.groupby(group)[cols]

out = (pd.concat({'ks_test': g.agg(ks_test),
                  'cvm_test': g.agg(cvm_test),
                 }, names=['test'])
       .sort_index(level=group, kind='stable', sort_remaining=False)
       .reset_index()
      )

Output:

       test   id  hrz tenor         1         2         3         4
0   ks_test  AAA    2    1y  0.278182  0.166364  0.254545  0.224545
1  cvm_test  AAA    2    1y  0.220803  0.044730  0.158839  0.118321
2   ks_test  AAA    2    6m  0.278182  0.166364  0.254545  0.224545
3  cvm_test  AAA    2    6m  0.220803  0.044730  0.158839  0.118321

Alternatively, pass your functions and stack (in this case the name is defined by the name of the function) :

group = ['id', 'hrz', 'tenor']
cols = df.columns.difference(group+['date'])

out = (
    df.groupby(group)[cols]
    .agg([ks_test, cvm_test])
    .rename_axis([None, 'test'], axis=1)
    .stack()
    .reset_index()
)

Output:

    id  hrz tenor      test         1         2         3         4
0  AAA    2    1y   ks_test  0.278182  0.166364  0.254545  0.224545
1  AAA    2    1y  cvm_test  0.220803  0.044730  0.158839  0.118321
2  AAA    2    6m   ks_test  0.278182  0.166364  0.254545  0.224545
3  AAA    2    6m  cvm_test  0.220803  0.044730  0.158839  0.118321
Sign up to request clarification or add additional context in comments.

1 Comment

@mowzay and chance pls email me about this - email in profile
2

Use GroupBy.agg with DataFrame.stack for reshape last level of MultiIndex in columns:

cols = ['id','hrz', 'tenor']
out = (df.groupby(cols)[df.columns.difference(cols + ['date'], sort=False)]
        .agg([ks_test, cvm_test])
        .rename_axis([None, 'test'], axis=1)
        .stack(future_stack=True)
        .reset_index())

print (out)
    id  hrz tenor      test         1         2         3         4
0  AAA    2    1y   ks_test  0.278182  0.166364  0.254545  0.224545
1  AAA    2    1y  cvm_test  0.220803  0.044730  0.158839  0.118321
2  AAA    2    6m   ks_test  0.278182  0.166364  0.254545  0.224545
3  AAA    2    6m  cvm_test  0.220803  0.044730  0.158839  0.118321

How it working:

print (df.groupby(cols)[df.columns.difference(cols +['date'], sort=False)]
        .agg([ks_test, cvm_test]))

                      1                   2                  3            \
                ks_test  cvm_test   ks_test cvm_test   ks_test  cvm_test   
id  hrz tenor                                                              
AAA 2   1y     0.278182  0.220803  0.166364  0.04473  0.254545  0.158839   
        6m     0.278182  0.220803  0.166364  0.04473  0.254545  0.158839   

                      4            
                ks_test  cvm_test  
id  hrz tenor                      
AAA 2   1y     0.224545  0.118321  
        6m     0.224545  0.118321  

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.