4

In order to fit a linear regression model to some given training data X and labels y, i want to augment my training data X by nonlinear transformations of the given features. Let's say we have the feature x1, x2 and x3. And we want to use the additional transformed features:

x4 = x12, x5 = x22 and x6 = x32

x7 = exp(x1), x8 = exp(x2) and x9 = exp(x3)

x10 = cos(x1), x11 = cos(x2) and x12 = cos(x3)

I tried the following approach, which however lead to a model that performed very poorly in terms of Root Mean Squared Error as evaluation criterion:

import pandas as pd
import numpy as np
from sklearn import linear_model
#import the training data and extract the features and labels from it
DATAPATH = 'train.csv'
data = pd.read_csv(DATAPATH)
features = data.drop(['Id', 'y'], axis=1)
labels = data[['y']]

features['x6'] = features['x1']**2
features['x7'] = features['x2']**2
features['x8'] = features['x3']**2


features['x9'] = np.exp(features['x1'])
features['x10'] = np.exp(features['x2'])
features['x11'] = np.exp(features['x3'])


features['x12'] = np.cos(features['x1'])
features['x13'] = np.cos(features['x2'])
features['x14'] = np.cos(features['x3'])

regr = linear_model.LinearRegression()

regr.fit(features, labels)

I'm quite new to ML and there is for sure a better option to do these nonlinear feature transformations, I'm very happy for your help.

Cheers Lukas

1
  • My intuition is that the np.exp terms are much much larger than everything else in your dataset, so your regression fits only them. You can avoid that by normalising your data before training the classifier. Check out this post
    – warped
    Commented Mar 8, 2020 at 10:39

1 Answer 1

6

As initial remark, I think there is a better way to transform all columns. One option would be something like:

# Define list of transformation
trans = [lambda a: a, np.square, np.exp, np.cos]

# Apply and concatenate transformations
features = pd.concat([t(features) for t in trans], axis=1)

# Rename column names
features.columns = [f'x{i}' for i in range(1, len(list(features))+1)]

Regarding performances of the model, as @warped said in the comment, it is a usual practice to scale all your data. Depending of your data distribution you can use different types of scaler (a discussion about it standard vs minmax scaler).

Since you are using nonlinear transformations, even though your initial data may be normal distributed, after transformations they will lose such property. Therefore it may be better to use MinMaxScaler.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(features.to_numpy())
scaled_features = scaler.transform(features.to_numpy())

Now each column of scaled_features will range from 0 to 1.

Remark that if you apply scaler before using something like train_test_split, data leakage may happen, and this is also not good for the model.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.