Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?

Question

I have the following dataframe:

Index_Date    A   B     C    D
================================
2015-01-31    10   10   Nan   10
2015-02-01     2    3   Nan   22 
2015-02-02    10   60   Nan  280
2015-02-03    10  100   Nan  250

Require:

Index_Date    A   B    C     D
================================
2015-01-31    10   10    10   10
2015-02-01     2    3    23   22
2015-02-02    10   60   290  280
2015-02-03    10  100  3000  250

Column C is derived for 2015-01-31 by taking value of D.

Then I need to use the value of C for 2015-01-31 and multiply by the value of A on 2015-02-01 and add B.

I have attempted an apply and a shift using an if else by this gives a key error.

This is a good question. I have a similar need for a vectorized solution. It would be nice if pandas provided version of apply() where the user's function is able to access one or more values from the previous row as part of its calculation or at least return a value that is then passed 'to itself' on the next iteration. Wouldn't this allow some efficiency gains compared to a for loop? — Bill, Commented Oct 22, 2018 at 19:41

Stefan · Accepted Answer · 2016-01-18 14:34:37Z

112

First, create the derived value:

df.loc[0, 'C'] = df.loc[0, 'D']

Then iterate through the remaining rows and fill the calculated values:

for i in range(1, len(df)):
    df.loc[i, 'C'] = df.loc[i-1, 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']


  Index_Date   A   B    C    D
0 2015-01-31  10  10   10   10
1 2015-02-01   2   3   23   22
2 2015-02-02  10  60  290  280

edited Jan 18, 2016 at 14:34

answered Jan 18, 2016 at 14:09

Stefan

42.9k13 gold badges79 silver badges83 bronze badges

84

is there a function in pandas to do this without the loop?
– ctrl-alt-delete
Commented Jan 18, 2016 at 15:03
7

The iterative nature of the calculation where the inputs depend on results of previous steps complicates vectorization. You could perhaps use apply with a function that does the same calculation as the loop, but behind the scenes this would also be a loop. pandas.pydata.org/pandas-docs/version/0.17.1/generated/…
– Stefan
Commented Jan 18, 2016 at 15:13
If I use this loop and calculate on a merged dataframe and it finds a Nan it works but only to the row with Nan. No errors are thrown, If I try a fillNa I get AttributeError: 'numpy.float64' object has no attribute 'fillna' Is there any way to skip the row with Nan or set values to zero?
– ctrl-alt-delete
Commented Jan 18, 2016 at 16:04
Do you mean missing values in columns other than C?
– Stefan
Commented Jan 18, 2016 at 16:08
1

Yes your solution is fine. I just ensure I fill the Nans in the dataframe before the loop.
– ctrl-alt-delete
Commented Jan 18, 2016 at 16:53

| Show 4 more comments

Community · Accepted Answer · 2023-06-14 04:38:24Z

100

Given a column of numbers:

lst = []
cols = ['A']
for a in range(100, 105):
    lst.append([a])
df = pd.DataFrame(lst, columns=cols, index=range(5))
df

    A
0   100
1   101
2   102
3   103
4   104

You can reference the previous row with shift:

df['Change'] = df.A - df.A.shift(1)
df

    A   Change
0   100 NaN
1   101 1.0
2   102 1.0
3   103 1.0
4   104 1.0

You can fill the missing value with fill_value parameter

df['Change'] = df.A - df.A.shift(1, fill_value=df.A[0]) # fills in the missing value e.g. 100<br>
df

    A   Change
0   100 0.0
1   101 1.0
2   102 1.0
3   103 1.0
4   104 1.0

edited Jun 14, 2023 at 4:38

CommunityBot

11 silver badge

answered May 3, 2017 at 17:05

kztd

3,4152 gold badges23 silver badges19 bronze badges

21

This won't help in this situation because the value from the previous row is not known at the beginning. It has to be computed each iteration and then used in the next iteration.
– Bill
Commented Oct 22, 2018 at 19:27
19

I still am grateful for this answer because I stumbled across this, looking for a case where I do know the value from the previous row. So thanks @kztd
– Kevin Pauli
Commented Apr 5, 2020 at 22:33
3

Exactly what I was looking for. This also works faster because it has array operation instead of looping as suggested on other answers.
– Dimanjan
Commented Feb 2, 2022 at 18:23
2

shift is definitely the way to go. Use the fill_value parameter to provide a default value for that first row.
– maccaroo
Commented Aug 2, 2022 at 0:32

Add a comment |

jpp · Accepted Answer · 2019-01-29 11:54:32Z

33

`numba`

For recursive calculations which are not vectorisable, numba, which uses JIT-compilation and works with lower level objects, often yields large performance improvements. You need only define a regular for loop and use the decorator @njit or (for older versions) @jit(nopython=True):

For a reasonable size dataframe, this gives a ~30x performance improvement versus a regular for loop:

from numba import jit

@jit(nopython=True)
def calculator_nb(a, b, d):
    res = np.empty(d.shape)
    res[0] = d[0]
    for i in range(1, res.shape[0]):
        res[i] = res[i-1] * a[i] + b[i]
    return res

df['C'] = calculator_nb(*df[list('ABD')].values.T)

n = 10**5
df = pd.concat([df]*n, ignore_index=True)

# benchmarking on Python 3.6.0, Pandas 0.19.2, NumPy 1.11.3, Numba 0.30.1
# calculator() is same as calculator_nb() but without @jit decorator
%timeit calculator_nb(*df[list('ABD')].values.T)  # 14.1 ms per loop
%timeit calculator(*df[list('ABD')].values.T)     # 444 ms per loop

edited Jan 29, 2019 at 11:54

answered Jan 29, 2019 at 11:40

jpp

165k36 gold badges300 silver badges356 bronze badges

1

It is wonderful! I have accelerated my function, which counts values from previous values. Thanks!
– Artem Malikov
Commented Apr 13, 2020 at 16:28
How can I use @jit(nopython=True) in jupyter-notebook?
– sergzemsk
Commented Jan 6, 2021 at 14:06
1

@sergzemsk, Just as you've written it (and in my answer), it's called a decorator. Note later versions of numba support the shortcut @njit.
– jpp
Commented Jan 6, 2021 at 14:10
@jpp i have if condition so this improvement failed. I got an error "TypingError: Failed in nopython mode pipeline (step: nopython frontend)"
– sergzemsk
Commented Jan 6, 2021 at 14:29
@sergzemsk, I suggest you ask a new question, not clear to me where the if statement sits, why it's not being vectorised by numba.
– jpp
Commented Jan 6, 2021 at 14:30

Add a comment |

user4322543user4322543 · Accepted Answer · 2016-12-10 07:25:53Z

28

Applying the recursive function on numpy arrays will be faster than the current answer.

df = pd.DataFrame(np.repeat(np.arange(2, 6),3).reshape(4,3), columns=['A', 'B', 'D'])
new = [df.D.values[0]]
for i in range(1, len(df.index)):
    new.append(new[i-1]*df.A.values[i]+df.B.values[i])
df['C'] = new

Output

      A  B  D    C
   0  1  1  1    1
   1  2  2  2    4
   2  3  3  3   15
   3  4  4  4   64
   4  5  5  5  325

answered Dec 10, 2016 at 7:25

user4322543

3

This answer works perfectly for me with a similar calculation. I tried using a combination of cumsum and shift but this solution works much better. Thanks.
– Simon
Commented Apr 16, 2017 at 20:06

Add a comment |

iipr · Accepted Answer · 2018-11-02 11:52:18Z

Although it has been a while since this question was asked, I will post my answer hoping it helps somebody.

Disclaimer: I know this solution is not standard, but I think it works well.

import pandas as pd
import numpy as np

data = np.array([[10, 2, 10, 10],
                 [10, 3, 60, 100],
                 [np.nan] * 4,
                 [10, 22, 280, 250]]).T
idx = pd.date_range('20150131', end='20150203')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df
               A    B     C    D
 =================================
 2015-01-31    10   10    NaN  10
 2015-02-01    2    3     NaN  22 
 2015-02-02    10   60    NaN  280
 2015-02-03    10   100   NaN  250

def calculate(mul, add):
    global value
    value = value * mul + add
    return value

value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)
df
               A    B     C     D
 =================================
 2015-01-31    10   10    10    10
 2015-02-01    2    3     23    22 
 2015-02-02    10   60    290   280
 2015-02-03    10   100   3000  250

So basically we use a apply from pandas and the help of a global variable that keeps track of the previous calculated value.

Time comparison with a for loop:

data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan

df.loc['2015-01-31', 'C'] = df.loc['2015-01-31', 'D']

%%timeit
for i in df.loc['2015-02-01':].index.date:
    df.loc[i, 'C'] = df.loc[(i - pd.DateOffset(days=1)).date(), 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']

3.2 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan

def calculate(mul, add):
    global value
    value = value * mul + add
    return value

value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value

%%timeit
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)

1.82 s ± 64.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So 0.57 times faster on average.

Wazaa · Accepted Answer · 2021-01-28 18:29:51Z

2

It's an old question but the solution below (without a for loop) might be helpful:

def new_fun(df):
    prev_value = df.iloc[0]["C"]
    def func2(row):
        # non local variable ==> will use pre_value from the new_fun function
        nonlocal prev_value
        new_value =  prev_value * row['A'] + row['B']
        prev_value = row['C']
        return new_value
    # This line might throw a SettingWithCopyWarning warning
    df.iloc[1:]["C"] = df.iloc[1:].apply(func2, axis=1)
    return df

df = new_fun(df)

edited Jan 28, 2021 at 18:29

answered Jan 28, 2021 at 11:28

Wazaa

1463 bronze badges

This makes some assumptions about .apply that may not be true: If .apply is parallelized or called in anything other than the order you expect the results will not be as expected.
– feetwet
Commented Feb 14, 2021 at 1:11
I agree with your concerns. The assumptions in this anwser are based on the question of this thread. Also, apply isn't parallelized by default ...
– Wazaa
Commented Feb 15, 2021 at 9:04

Add a comment |

feetwet · Accepted Answer · 2021-02-12 19:32:10Z

2

In general, the key to avoiding an explicit loop would be to join (merge) 2 instances of the dataframe on rowindex-1==rowindex.

Then you would have a big dataframe containing rows of r and r-1, from where you could do a df.apply() function.

However the overhead of creating the large dataset may offset the benefits of parallel processing...

edited Feb 12, 2021 at 19:32

feetwet

3,4868 gold badges53 silver badges92 bronze badges

answered Sep 25, 2020 at 13:21

Martin Alley

1311 silver badge11 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?

7 Answers 7

`numba`

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

Linked

Related