Speeding up filtering function in Pandas

Question

I have a CSV file with 400 000 rows and the following headers:

header_names = ['LEAGUE', 'YEAR', 'DATE', 'HOME', '1', 'X', '2', 'AWAY', 'SCORE', 'SCORE_1', 'SCORE_2', 'FTR', 'FAVORITE', 'UNDER-OVER']

The aim of my function is for every row to take all the previous, filter them by items in the current row and return some statistic.

This is my script so far:

import pandas as pd

filepath = 'data.csv'
header_names = ['LEAGUE', 'YEAR', 'DATE', 'HOME', '1', 'X', '2', 'AWAY', 'SCORE', 'SCORE_1', 'SCORE_2', 'FTR', 'FAVORITE', 'UNDER-OVER'] # Add appropriate headers
df = pd.read_csv(filepath, sep=',', na_values=['', '-'], parse_dates=True, header=None, names=header_names, skiprows=1, nrows=1000)

def mid_func(x):
    global mid
    mid += 1
    return mid

mid = -1
df.insert(0, 'MID', df.apply(mid_func, axis=1))

new_df = df.copy()

def home_1_simple_filter(x):
    mid_stop = x[0] - 1
    home = x[4]
    odd_1 = x[5]
    start = time.time()
    filtered = df[(df['HOME'] == home) & (df['1'] == odd_1)].ix[:mid_stop]['FTR']
    stop = time.time() - start
    print round(stop*1000.,2), 'ms', home, odd_1, mid_stop
    return filtered

start = time.time()
new_df['HOME_1'] = df.apply(home_1_simple_filter, axis=1)
stop = time.time() - start
print stop

The mid_func is to help me take the previous row. The whole process takes 3 seconds for the first 1000, and 0.002 seconds on average.

0.002 seconds per row doesn't seem like much. Do you have some target time in mind? Have you profiled the code to see where the time goes (I would guess that reading in the CSV will be a big chunk of it, which you can't speed up by altering your filter)? — jonrsharpe
– jonrsharpe, Commented Aug 16, 2014 at 10:41
There will be 60 filter operations like the one I mentioned, for each one of the 400 000 rows, so the overall time needed would be in hours. — IordanouGiannis
– IordanouGiannis, Commented Aug 16, 2014 at 10:47

Reinderien · Accepted Answer · 2025-01-12 19:25:36Z

Well, the code doesn't run, and you haven't shown any example input/output. Lest this be lost to obscurity, I will review what I can and make some wild guesses.

You're fully ignoring the original CSV's header names and overwriting with your own. That's almost always a bad idea; use .rename() instead.

mid_func needs to go away. If you needed the effect of a new integral index, then use an Index from Pandas or an np.arange. But the way it's used betrays a deep misunderstanding about this operation and how it should actually be done with the Pandas API: a call to .expanding().

Don't x[4], etc. That's a row Series for which you should be using column name strings.

.ix() is deprecated and needs to be replaced one way or the other.

I think much of this can be replaced by

df.groupby(['HOME', '1'])['FTR'].expanding()

but again, without data samples, it's impossible to say for sure.

Stack Exchange Network

Speeding up filtering function in Pandas

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Speeding up filtering function in Pandas

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions