How to sum variable ranges in a pandas column to another column

Question

I'm relatively new to pandas and I don't know the best approach to solve my problem. Well, I have a df with: an index, and the data in a column called 'Data' and an empty column called 'sum'.

I need help to create a function to add the sum of the variable group of rows of the 'Data' column in the column 'sum'. The grouping criteria is that there should not be empty rows in the group.

Here an example:

index  Data Sum
0       1   
1       1   2
2       
3       
4       1   
5       1   
6       1   3
7       
8       1   
9       1   2
10      
11      1   
12      1   
13      1   
14      1   
15      1   5   
16  
17      1   1
18  
19      1   1
20

As you see, the length of each group of data in 'Data' is variable, could be only one row or any number of rows. Always the sum must be at the end of the group. As an example: the sum of the group of rows 4,5,6 of the 'Data' column should be at row 6 in the 'sum' column.

any insight will be appreciated.

UPDATE

The problem was solved by implementing the Method 3 suggested by ansev. However due to a change in the main program, the sum of each block, now need to be at the beggining of each one (in case the block has more than one row). Then I use the df = df.iloc[::-1] instruction twice in order to reverse the column and back again to normal. Thank you very much!!!!!

df = df.iloc[::-1]
blocks = df['Data'].isnull().cumsum()
m = blocks.duplicated(keep='last')
df['Sum'] = df.groupby(blocks)['Data'].cumsum().mask(m)
df = df.iloc[::-1]

print(df)

Data  Sum
0    1.0  2.0
1    1.0  NaN
2    NaN  NaN
3    NaN  NaN
4    1.0  3.0
5    1.0  NaN
6    1.0  NaN
7    NaN  NaN
8    1.0  2.0
9    1.0  NaN
10   NaN  NaN
11   1.0  5.0
12   1.0  NaN
13   1.0  NaN
14   1.0  NaN
15   1.0  NaN
16   NaN  NaN
17   1.0  1.0
18   NaN  NaN
19   1.0  1.0
20   NaN  NaN

You'd want to start by making a column that explicitly states which group each row is in. Then you can use df.groupby('group').sum() to add up the Data in each group and then join it back into the dataframe using df = df.join( sum, on='group'). — Elliott Collins, Commented Jun 27, 2020 at 21:49
You mean NaN columns, right? There are no "empty" columns in pandas? — DYZ, Commented Jun 27, 2020 at 22:26

ansev · Accepted Answer · 2020-06-27 23:11:36Z

We can use GroupBy.cumsum:

# if you need replace blanks
#df = df.replace(r'^\s*$', np.nan, regex=True)
s = df['Data'].isnull()
df['sum'] = df.groupby(s.cumsum())['Data'].cumsum().where((~s) & (s.shift(-1)))
print(df)
    index  Data  sum
0       0   1.0  NaN
1       1   1.0  2.0
2       2   NaN  NaN
3       3   NaN  NaN
4       4   1.0  NaN
5       5   1.0  NaN
6       6   1.0  3.0
7       7   NaN  NaN
8       8   1.0  NaN
9       9   1.0  2.0
10     10   NaN  NaN
11     11   1.0  NaN
12     12   1.0  NaN
13     13   1.0  NaN
14     14   1.0  NaN
15     15   1.0  5.0
16     16   NaN  NaN
17     17   1.0  1.0
18     18   NaN  NaN
19     19   1.0  1.0
20     20   NaN  NaN

Method 2

#df = df.drop(columns='index') #if neccesary
g = df.reset_index().groupby(df['Data'].isnull().cumsum())
df['sum'] = g['Data'].cumsum().where(lambda x: x.index == g['index'].transform('idxmax'))

Method 3

Series.duplicated and Series.mask

blocks = df['Data'].isnull().cumsum()
m = blocks.duplicated(keep='last')
df['sum'] = df.groupby(blocks)['Data'].cumsum().mask(m)

as you can see the methods only differ in the way of masking the values we don't need from the sum column.

We can also use .transform('sum') instead .cumsum()

performance with the sample dataframe

%%timeit
s = df['Data'].isnull()
df['sum'] = df.groupby(s.cumsum())['Data'].cumsum().where((~s) & (s.shift(-1)))
4.52 ms ± 901 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
g = df.reset_index().groupby(df['Data'].isnull().cumsum())
df['sum'] = g['Data'].cumsum().where(lambda x: x.index == g['index'].transform('idxmax'))
8.52 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
blocks = df['Data'].isnull().cumsum()
m = blocks.duplicated(keep='last')
df['sum'] = df.groupby(blocks)['Data'].cumsum().mask(m)
3.02 ms ± 172 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Hi, what's going on this part . .where((~s) & (s.shift(-1)))? What is it accomplishing? — MasayoMusic, Commented Jun 27, 2020 at 22:36
this puts in NaN the values of the column Sum that are not in the last cell of the group. You can check pandas.pydata.org/pandas-docs/stable/reference/api/… and pandas.pydata.org/pandas-docs/stable/reference/api/… — ansev, Commented Jun 27, 2020 at 22:38
@ansev Would there be a way to use df.groupby.last() to somehow answer this question after calculating cumsums. I am attempting it, but my Pandas skills are weak. — Moondra, Commented Jun 27, 2020 at 23:00
you're right! It is not necessary. if we used . transform ('sum') that would be necessary unless we specified min_count = 1, .transform (lambda x: x.sum (min_count = 1)). On the other hand I have noticed that this method fails when there are no missing values!, I would use method 3! — ansev, Commented Jun 27, 2020 at 23:53

Lovesh Dongre · Accepted Answer · 2020-06-27 22:34:58Z

1

Code Used for replication

import numpy as np
data = {'Data':  [1,1, np.nan , np.nan,1, 1, 1,np.nan , 1,1,np.nan,1,1,1,1,1,np.nan,1,np.nan,1,np.nan]}

df = pd.DataFrame (data)

Iterative Approach Solution

count = 0
for i in range(df.shape[0]):
    if df.iloc[i, 0] == 1:
        count += 1
    elif i != 0 and count != 0:
        df.at[i - 1, 'Sum'] = count
        print(count)
        count = 0

edited Jun 27, 2020 at 22:34

answered Jun 27, 2020 at 21:56

Lovesh Dongre

1,34412 silver badges24 bronze badges

df.shape[0] is always better than len(df).
– DYZ
Commented Jun 27, 2020 at 22:28
df.shape[0] is always better than len(df) - is this because of the speed or something else?
– sammywemmy
Commented Jun 27, 2020 at 23:28

Add a comment |

DYZ · Accepted Answer · 2020-06-27 22:42:38Z

0

Create a new column that is equal to the index in the data gaps and undefined, otherwise:

df.loc[:, 'Sum'] = np.where(df.Data.isnull(), df.index, np.nan)

Fill the column backward, count the lengths of the identically labeled spans, redefine the column:

df.Sum = df.groupby(df.Sum.bfill()).count()

Align the new column with the original data:

df.Sum = df.Sum.shift(-1)

Eliminate 0-length spans:

df.loc[df.Sum == 0, 'Sum'] = np.nan

answered Jun 27, 2020 at 22:42

DYZ

57.2k10 gold badges72 silver badges99 bronze badges

Add a comment |

Collectives™ on Stack Overflow

How to sum variable ranges in a pandas column to another column

3 Answers 3

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Related