0

I am using LogisticRegression in python to train a model, I found when I use dataset below,the model cannot learn a proper prediction method.

But when I magnify the data 100 times, the answer is right. I want to know why.

Thank you for your time!

DataSet "chip_test.csv" : use test1 and test2 to predict if the student pass the exam,0 means doesn't pass, 1 means pass

test1,test2,pass
0.051267,0.69956,0
-0.09274,0.68494,0
-0.21371,0.69225,0
-0.375,0.50219,0
0.18376,0.93348,1
0.22408,0.77997,1
0.29896,0.61915,1
-0.51325,0.46564,0
-0.52477,0.2098,0

Code

from sklearn import linear_model
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score

model = linear_model.LogisticRegression()

data = pd.read_csv("chip_test.csv")
test1 = data['test1']
test2 = data['test2']
X = data.drop('pass',axis=1)
y = data.loc[:,'pass']

model.fit(X, y)
predict = model.predict(X)
print("pre:", predict)
print("accuracy", accuracy_score(predict, y))


coef_ = model.coef_
intercept_ = model.intercept_
print('coef_',model.coef_)
print('intercept_',model.intercept_)

mask = y == 1 # true/false
plt.scatter(test1[mask],test2[mask],color='green')
plt.scatter(test1[~mask],test2[~mask],color='red')

# theta_1 * x_1 + theta_2 * x_2 + b = 0
theta_1 = coef_[0][0]
theta_2 = coef_[0][1]
t2_boundary = (- np.dot(test1.values, theta_1) - intercept_) / theta_2

plt.plot(test1,t2_boundary)

plt.xlabel('test1')
plt.ylabel('test2')
plt.title('chip_test')
plt.show()

The output of the code is

pre: [0 0 0 0 0 0 0 0 0]
accuracy 0.6666666666666666
coef_ [[0.8464159  0.36835743]]
intercept_ [-0.84822465]

The image : enter image description here

We can see the answer is wrong.

Then magnify the data 100 times. DataSet "chip_test.csv"

test1,test2,pass
5.1267,69.956,0
-9.274,68.494,0
18.376,93.348,1
22.408,77.997,1
29.896,61.915,1
-21.371,69.225,0
-37.5,50.219,0
-51.325,46.564,0
-52.477,20.98,0

The output of the code is

pre: [0 0 1 1 1 0 0 0 0]
accuracy 1.0
coef_ [[0.40853988 0.15604305]]
intercept_ [-16.80311472]

The image : enter image description here

The answer is right.

1
  • There looks to be a difference in scale between test1 and test2, so multiplying both by 100 accidentally helped, but it would be better to standardize both scores so they fall in the interval [0,1]. There may be more subtle convergence issues with this small data set, but that's not a software development question. Commented Jun 23 at 16:00

1 Answer 1

1

Like many machine learning models, sklearn.linear_model.LogisticRegression uses regularization by default. It's generally considered bad to fit the training data as well as possible, because this can produce overly complicated models that don't generalize. Regularization adds a term to the loss function that represents the complexity of the model. In this case, the regularization term is equal to

(coef_[0][0]**2 + coef_[0][1]**2)/2/C

where C=1.0 by default. When you multiply the data by 100, the coefficients needed to fit the data get divided by 100, so the regularization term gets divided by 100**2 and has very little effect. This is one of the reasons why it helps to normalize the data (as paisanco suggested).

If you don't want regularization, you can turn it off with penalty=None. Another option is to weaken it by using a larger value of C.


Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.