Similar to my previous post in R All permutations of a linear model given all predictors, I have attempted a Pythonic version. However, a few issues that I have with my version is that I cannot select for multiple predictors beside when using all.
For example, if I wanted:
perModel(savings, response='sr', predictors=['pop15', 'pop75'])
It throws an error so I still need to working on including lists. However, the following works:
perModel(savings, response='sr', predictors='all')
perModel(savings, response='sr', predictors='pop15')
Update:
- I have fixed the issue with log transformations. I had to use
np.logon the variable. - The issue with multiple predictors was fixed also.
Here is my code:
import re
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import faraway.datasets.savings
savings = faraway.datasets.savings.load()
def reformulate(predictors, response):
form = response + "~" + '+'.join(predictors)
return(form)
def perModel(data, response, predictors):
response = response
if '(' in response:
response_val = re.search('(?<=\\()[^\\^\\)]+', response).group(0)
nm = data.columns.values
nm = [x for x in nm if response_val != x]
nm = np.array(nm)
else:
nm = data.columns.values
nm = [x for x in nm if response != x]
nm = np.array(nm)
n = len(nm)
#Alternatively, there are faster options by using circulant from scipy
md_arrange =linalg.circulant(nm)
#md_arrange = [nm[x:] + nm[:x] for x in range(1, len(nm)+1)]
df = pd.DataFrame(md_arrange)
ls_comb = [df.loc[0:i] for i in range(0, len(df))]
if 'all' in predictors:
predictors_wanted = ls_comb
else:
test_val = ['(' in i for i in [predictors]]
extract_text = []
clean_text = ''
for bl, pred in zip(test_val, [predictors]):
if bl:
clean_text=(pred)
extract_text=(re.search('(?<=\\()[^\\^\\)]+', pred).group(0))
else:
extract_text=(pred)
if clean_text != '':
repl_predictors = [i.apply(lambda x: x.str.replace(extract_text, clean_text)) for i in ls_comb]
else:
repl_predictors = ls_comb
predictors_wanted = [df.loc[:, df.columns[df.apply(lambda col: col.str.contains("|".join(extract_text))).any()]]
for df in repl_predictors]
formula_predictors = pd.concat(pd.concat([x]) for x in [i.apply(
lambda x: reformulate(x, response)
).reset_index().drop(['index'], axis=1
) if len(i) < n else i[[1]].apply(
lambda x: reformulate(x, response)
).reset_index().drop(['index'], axis=1
) for i in predictors_wanted ]).rename(columns = {0:'formula'})
models = [smf.ols(i, data).fit().summary() for i in formula_predictors['formula']]
#formula_predictors.to_csv("test.csv")
return(models)