How to aggregate values in a column between values of another column in pandas

Question

I have two dataframes that I want to combine. They look as follows:

df_1
unit   start_time   stop_time
A        0.0          1.2
B        1.3          4.1
A        4.2          4.5
B        4.6          7.2
A        7.3          8.0

df_2
time    other_data
0.2       .0122
0.4       .0128
0.6       .0101
0.8       .0091
1.0       .2122
1.2       .1542
1.4       .1546
1.6       .1522
1.8       .2542
2.0       .1557
2.2       .2542
2.4       .1543
2.6       .0121
2.8       .0111
3.0       .0412
3.2       .0214
3.4       .0155
3.6       .0159
3.8       .0154
4.0       .0155
4.2       .0211
4.4       .0265
4.6       .0146
4.8       .0112
5.0       .0166
5.2       .0101
5.4       .0132
5.6       .0112
5.8       .0121
6.0       .0142
6.2       .0124
6.4       .0111
6.6       .0123
6.8       .0111
6.0       .0119
6.2       .0112
6.4       .0131
6.6       .0117
6.8       .0172
7.0       .0123
7.2       .0127
7.4       .0121
7.6       .0110
7.8       .0120
8.0       .0121

I want to merge these data frames using the following criteria:

Step 1

I want to group all values from df_2.other_data where df_2.time is between df_1.start_time and df_1.stop_time. For example, for the first row of df_1, the following data from df_2 would be grouped:

time    other_data
0.2       .0122
0.4       .0128
0.6       .0101
0.8       .0091
1.0       .2122
1.2       .1542

Step 2

Within this group, I want to count the total number of observations where df_2.other_data is above a threshold, which in this case will be set to .0120. The total number of observations in this group above this threshold is 4. This is the value I want to merge onto df_1. The result should look like this:

unit   start_time   stop_time   other_data_above_threshold
A        0.0          1.2             4

The final dataframe should look like this:

unit   start_time   stop_time   other_data_above_threshold
A        0.0          1.2              4
B        1.3          4.1              13
A        4.2          4.5              3
B        4.6          7.2              11
A        7.3          8.0              4

moys · Accepted Answer · 2019-10-08 02:48:27Z

1

IIUC, this is what you need.

df['other_data_at'] = df.apply(lambda x: df2.loc[(df2['time']>= x['start_time']) & (df2['time']<= x['stop_time'])].loc[df2['other_data']>=0.012].count()[0], axis=1)

Output

   unit start_time  stop_time   other_data_at
0   A   0.0              1.2    4
1   B   1.3              4.1    13
2   A   4.2              4.5    2 #you expected output shows 3 but it should be 2
3   B   4.6              7.2    11
4   A   7.3              8.0    3

edited Oct 8, 2019 at 2:48

answered Oct 8, 2019 at 2:41

moys

8,0853 gold badges17 silver badges50 bronze badges

Add a comment |

DavidE · Accepted Answer · 2019-10-08 02:20:33Z

0

Hi I would try to loop over your df1 and use it's values for df2

That would look a little like this:

def my_counting(df1, df2, threshold):
  count_list = ()
  for index,unit in enumerate(df['unit']):
    df = df2[(df2['time'] >= df1['start_time'][index]) & (df2['time'] < df1['stop_time'][index])]
    count_list.append(df[df['other_data'] <= threshold].shape[0])

  df1['other_data_above_threshold'] = count_list
  return df1

print(my_counting(df1, df2, 0.012)

answered Oct 8, 2019 at 2:20

DavidE

914 bronze badges

Add a comment |

Andy L. · Accepted Answer · 2019-10-08 03:33:54Z

You may try use pd.cut

a = df_1.start_time.to_list() + [np.inf]
s = pd.cut(df_2.time, bins=a, labels=df_1.index, right=False)
df_1['other_data_above_threshold'] = df_2.other_data.gt(0.012).groupby(s).sum()

Out[213]:
  unit  start_time  stop_time  other_data_above_threshold
0    A         0.0        1.2                         4.0
1    B         1.3        4.1                        13.0
2    A         4.2        4.5                         2.0
3    B         4.6        7.2                        11.0
4    A         7.3        8.0                         2.0

Collectives™ on Stack Overflow

How to aggregate values in a column between values of another column in pandas

Step 1

Step 2

3 Answers 3

Hot Network Questions

Collectives™ on Stack Overflow

Step 1

Step 2

3 Answers 3

Related