I have a CSV file with 400 000 rows and the following headers:
header_names = ['LEAGUE', 'YEAR', 'DATE', 'HOME', '1', 'X', '2', 'AWAY', 'SCORE', 'SCORE_1', 'SCORE_2', 'FTR', 'FAVORITE', 'UNDER-OVER']
The aim of my function is for every row to take all the previous, filter them by items in the current row and return some statistic.
This is my script so far:
import pandas as pd
filepath = 'data.csv'
header_names = ['LEAGUE', 'YEAR', 'DATE', 'HOME', '1', 'X', '2', 'AWAY', 'SCORE', 'SCORE_1', 'SCORE_2', 'FTR', 'FAVORITE', 'UNDER-OVER'] # Add appropriate headers
df = pd.read_csv(filepath, sep=',', na_values=['', '-'], parse_dates=True, header=None, names=header_names, skiprows=1, nrows=1000)
def mid_func(x):
global mid
mid += 1
return mid
mid = -1
df.insert(0, 'MID', df.apply(mid_func, axis=1))
new_df = df.copy()
def home_1_simple_filter(x):
mid_stop = x[0] - 1
home = x[4]
odd_1 = x[5]
start = time.time()
filtered = df[(df['HOME'] == home) & (df['1'] == odd_1)].ix[:mid_stop]['FTR']
stop = time.time() - start
print round(stop*1000.,2), 'ms', home, odd_1, mid_stop
return filtered
start = time.time()
new_df['HOME_1'] = df.apply(home_1_simple_filter, axis=1)
stop = time.time() - start
print stop
The mid_func is to help me take the previous row. The whole process takes 3 seconds for the first 1000, and 0.002 seconds on average.
mid_func()? \$\endgroup\$