2
\$\begingroup\$

Similar to my previous post in R All permutations of a linear model given all predictors, I have attempted a Pythonic version. However, a few issues that I have with my version is that I cannot select for multiple predictors beside when using all.

For example, if I wanted:

perModel(savings, response='sr', predictors=['pop15', 'pop75'])

It throws an error so I still need to working on including lists. However, the following works:

perModel(savings, response='sr', predictors='all')
perModel(savings, response='sr', predictors='pop15')

Update:

  1. I have fixed the issue with log transformations. I had to use np.log on the variable.
  2. The issue with multiple predictors was fixed also.

Here is my code:

import re
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import faraway.datasets.savings

savings = faraway.datasets.savings.load()

def reformulate(predictors, response):
    form = response + "~" + '+'.join(predictors)
    return(form)
    
def perModel(data, response, predictors):
    response = response
    if '(' in response:
        response_val = re.search('(?<=\\()[^\\^\\)]+', response).group(0)
        nm = data.columns.values
        nm = [x for x in nm if response_val != x]
        nm = np.array(nm)
        
    else:
        nm = data.columns.values
        nm = [x for x in nm if response != x]
        nm = np.array(nm)

    n = len(nm)
    #Alternatively, there are faster options by using circulant from scipy
    md_arrange =linalg.circulant(nm)
    #md_arrange = [nm[x:] + nm[:x] for x in range(1, len(nm)+1)]    
    df = pd.DataFrame(md_arrange)
    ls_comb = [df.loc[0:i] for i in range(0, len(df))]

    if 'all' in predictors:
        predictors_wanted = ls_comb
    else:
        test_val = ['(' in i for i in [predictors]]
        extract_text = []
        clean_text = ''

        for bl, pred in zip(test_val, [predictors]):
            if bl:
                clean_text=(pred)
                extract_text=(re.search('(?<=\\()[^\\^\\)]+', pred).group(0))
            else:
                extract_text=(pred)
        
        if clean_text != '': 
            repl_predictors = [i.apply(lambda x: x.str.replace(extract_text, clean_text)) for i in ls_comb]
        else:
            repl_predictors = ls_comb

        predictors_wanted = [df.loc[:, df.columns[df.apply(lambda col: col.str.contains("|".join(extract_text))).any()]]
                                for df in repl_predictors]
    formula_predictors = pd.concat(pd.concat([x]) for x in [i.apply(
    lambda x: reformulate(x, response)
                    ).reset_index().drop(['index'], axis=1
                                        ) if len(i) < n else i[[1]].apply(
                        lambda x: reformulate(x, response)
                    ).reset_index().drop(['index'], axis=1
                    ) for i in predictors_wanted ]).rename(columns = {0:'formula'})

    models =  [smf.ols(i, data).fit().summary() for i in formula_predictors['formula']]
    #formula_predictors.to_csv("test.csv")
    return(models)
        
\$\endgroup\$

1 Answer 1

3
\$\begingroup\$

There's some magic string processing in here, for all, (, using regexes etc. You haven't described it very much, it isn't documented in the code, and if you wanted to accomplish features like "use all predictors", this isn't the way to go about it (add convenience wrapper functions instead). For the purposes of this answer I'm ignoring all of the string operations other than reformulate.

Let's jump down to one of your intermediate results, where a response variable sr and predictors pop15, pop75, dpi produces a formula_predictors of

                   formula
0                 sr~pop15
1                  sr~ddpi
2                   sr~dpi
3                 sr~pop75
0           sr~pop15+pop75
1            sr~ddpi+pop15
2              sr~dpi+ddpi
3             sr~pop75+dpi
0       sr~pop15+pop75+dpi
1      sr~ddpi+pop15+pop75
2        sr~dpi+ddpi+pop15
3        sr~pop75+dpi+ddpi
0  sr~ddpi+pop15+pop75+dpi

Without wanting to install smf, I will assume that the format of the above needs to stay. Let's delete most of your code and call itertools instead:

import itertools
import typing

import pandas as pd


def reformulate(predictors: typing.Iterable[str], response: str) -> str:
    return f'{response}~{"+".join(predictors)}'


def per_model(
    data: pd.DataFrame, response: str, predictors: typing.Sequence[str],
) -> typing.Iterator[tuple[str, pd.DataFrame]]:
    for n in range(1, 1 + len(predictors)):
        for combo in itertools.combinations(predictors, r=n):
            yield reformulate(combo, response), data[list(combo)]

This can be passed to ols as necessary. I don't think the combination step is worth vectorising.

\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.