3

I'm relatively new to pandas and I don't know the best approach to solve my problem. Well, I have a df with: an index, and the data in a column called 'Data' and an empty column called 'sum'.

I need help to create a function to add the sum of the variable group of rows of the 'Data' column in the column 'sum'. The grouping criteria is that there should not be empty rows in the group.

Here an example:

index  Data Sum
0       1   
1       1   2
2       
3       
4       1   
5       1   
6       1   3
7       
8       1   
9       1   2
10      
11      1   
12      1   
13      1   
14      1   
15      1   5   
16  
17      1   1
18  
19      1   1
20  

As you see, the length of each group of data in 'Data' is variable, could be only one row or any number of rows. Always the sum must be at the end of the group. As an example: the sum of the group of rows 4,5,6 of the 'Data' column should be at row 6 in the 'sum' column.

any insight will be appreciated.

UPDATE

The problem was solved by implementing the Method 3 suggested by ansev. However due to a change in the main program, the sum of each block, now need to be at the beggining of each one (in case the block has more than one row). Then I use the df = df.iloc[::-1] instruction twice in order to reverse the column and back again to normal. Thank you very much!!!!!

df = df.iloc[::-1]
blocks = df['Data'].isnull().cumsum()
m = blocks.duplicated(keep='last')
df['Sum'] = df.groupby(blocks)['Data'].cumsum().mask(m)
df = df.iloc[::-1]

print(df)

Data  Sum
0    1.0  2.0
1    1.0  NaN
2    NaN  NaN
3    NaN  NaN
4    1.0  3.0
5    1.0  NaN
6    1.0  NaN
7    NaN  NaN
8    1.0  2.0
9    1.0  NaN
10   NaN  NaN
11   1.0  5.0
12   1.0  NaN
13   1.0  NaN
14   1.0  NaN
15   1.0  NaN
16   NaN  NaN
17   1.0  1.0
18   NaN  NaN
19   1.0  1.0
20   NaN  NaN
3
  • 1
    Have you made any inroads into the code?
    – greenPlant
    Commented Jun 27, 2020 at 21:43
  • You'd want to start by making a column that explicitly states which group each row is in. Then you can use df.groupby('group').sum() to add up the Data in each group and then join it back into the dataframe using df = df.join( sum, on='group'). Commented Jun 27, 2020 at 21:49
  • You mean NaN columns, right? There are no "empty" columns in pandas?
    – DYZ
    Commented Jun 27, 2020 at 22:26

3 Answers 3

5

We can use GroupBy.cumsum:

# if you need replace blanks
#df = df.replace(r'^\s*$', np.nan, regex=True)
s = df['Data'].isnull()
df['sum'] = df.groupby(s.cumsum())['Data'].cumsum().where((~s) & (s.shift(-1)))
print(df)
    index  Data  sum
0       0   1.0  NaN
1       1   1.0  2.0
2       2   NaN  NaN
3       3   NaN  NaN
4       4   1.0  NaN
5       5   1.0  NaN
6       6   1.0  3.0
7       7   NaN  NaN
8       8   1.0  NaN
9       9   1.0  2.0
10     10   NaN  NaN
11     11   1.0  NaN
12     12   1.0  NaN
13     13   1.0  NaN
14     14   1.0  NaN
15     15   1.0  5.0
16     16   NaN  NaN
17     17   1.0  1.0
18     18   NaN  NaN
19     19   1.0  1.0
20     20   NaN  NaN

Method 2

#df = df.drop(columns='index') #if neccesary
g = df.reset_index().groupby(df['Data'].isnull().cumsum())
df['sum'] = g['Data'].cumsum().where(lambda x: x.index == g['index'].transform('idxmax'))

Method 3

Series.duplicated and Series.mask

blocks = df['Data'].isnull().cumsum()
m = blocks.duplicated(keep='last')
df['sum'] = df.groupby(blocks)['Data'].cumsum().mask(m)

as you can see the methods only differ in the way of masking the values ​​we don't need from the sum column.

We can also use .transform('sum') instead .cumsum()

performance with the sample dataframe

%%timeit
s = df['Data'].isnull()
df['sum'] = df.groupby(s.cumsum())['Data'].cumsum().where((~s) & (s.shift(-1)))
4.52 ms ± 901 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
g = df.reset_index().groupby(df['Data'].isnull().cumsum())
df['sum'] = g['Data'].cumsum().where(lambda x: x.index == g['index'].transform('idxmax'))
8.52 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
blocks = df['Data'].isnull().cumsum()
m = blocks.duplicated(keep='last')
df['sum'] = df.groupby(blocks)['Data'].cumsum().mask(m)
3.02 ms ± 172 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7
  • Hi, what's going on this part . .where((~s) & (s.shift(-1)))? What is it accomplishing? Commented Jun 27, 2020 at 22:36
  • 1
    this puts in NaN the values ​​of the column Sum that are not in the last cell of the group. You can check pandas.pydata.org/pandas-docs/stable/reference/api/… and pandas.pydata.org/pandas-docs/stable/reference/api/…
    – ansev
    Commented Jun 27, 2020 at 22:38
  • @ansev Would there be a way to use df.groupby.last() to somehow answer this question after calculating cumsums. I am attempting it, but my Pandas skills are weak.
    – Moondra
    Commented Jun 27, 2020 at 23:00
  • Nice! Which is fastest? Commented Jun 27, 2020 at 23:03
  • 1
    you're right! It is not necessary. if we used . transform ('sum') that would be necessary unless we specified min_count = 1, .transform (lambda x: x.sum (min_count = 1)). On the other hand I have noticed that this method fails when there are no missing values!, I would use method 3!
    – ansev
    Commented Jun 27, 2020 at 23:53
1

Code Used for replication

import numpy as np
data = {'Data':  [1,1, np.nan , np.nan,1, 1, 1,np.nan , 1,1,np.nan,1,1,1,1,1,np.nan,1,np.nan,1,np.nan]}

df = pd.DataFrame (data)

Iterative Approach Solution

count = 0
for i in range(df.shape[0]):
    if df.iloc[i, 0] == 1:
        count += 1
    elif i != 0 and count != 0:
        df.at[i - 1, 'Sum'] = count
        print(count)
        count = 0
2
  • df.shape[0] is always better than len(df).
    – DYZ
    Commented Jun 27, 2020 at 22:28
  • df.shape[0] is always better than len(df) - is this because of the speed or something else?
    – sammywemmy
    Commented Jun 27, 2020 at 23:28
0

Create a new column that is equal to the index in the data gaps and undefined, otherwise:

df.loc[:, 'Sum'] = np.where(df.Data.isnull(), df.index, np.nan)

Fill the column backward, count the lengths of the identically labeled spans, redefine the column:

df.Sum = df.groupby(df.Sum.bfill()).count()

Align the new column with the original data:

df.Sum = df.Sum.shift(-1)

Eliminate 0-length spans:

df.loc[df.Sum == 0, 'Sum'] = np.nan

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.