I am using LogisticRegression in python to train a model, I found when I use dataset below,the model cannot learn a proper prediction method.
But when I magnify the data 100 times, the answer is right. I want to know why.
Thank you for your time!
DataSet "chip_test.csv" : use test1 and test2 to predict if the student pass the exam,0 means doesn't pass, 1 means pass
test1,test2,pass
0.051267,0.69956,0
-0.09274,0.68494,0
-0.21371,0.69225,0
-0.375,0.50219,0
0.18376,0.93348,1
0.22408,0.77997,1
0.29896,0.61915,1
-0.51325,0.46564,0
-0.52477,0.2098,0
Code
from sklearn import linear_model
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
model = linear_model.LogisticRegression()
data = pd.read_csv("chip_test.csv")
test1 = data['test1']
test2 = data['test2']
X = data.drop('pass',axis=1)
y = data.loc[:,'pass']
model.fit(X, y)
predict = model.predict(X)
print("pre:", predict)
print("accuracy", accuracy_score(predict, y))
coef_ = model.coef_
intercept_ = model.intercept_
print('coef_',model.coef_)
print('intercept_',model.intercept_)
mask = y == 1 # true/false
plt.scatter(test1[mask],test2[mask],color='green')
plt.scatter(test1[~mask],test2[~mask],color='red')
# theta_1 * x_1 + theta_2 * x_2 + b = 0
theta_1 = coef_[0][0]
theta_2 = coef_[0][1]
t2_boundary = (- np.dot(test1.values, theta_1) - intercept_) / theta_2
plt.plot(test1,t2_boundary)
plt.xlabel('test1')
plt.ylabel('test2')
plt.title('chip_test')
plt.show()
The output of the code is
pre: [0 0 0 0 0 0 0 0 0]
accuracy 0.6666666666666666
coef_ [[0.8464159 0.36835743]]
intercept_ [-0.84822465]
The image :

We can see the answer is wrong.
Then magnify the data 100 times. DataSet "chip_test.csv"
test1,test2,pass
5.1267,69.956,0
-9.274,68.494,0
18.376,93.348,1
22.408,77.997,1
29.896,61.915,1
-21.371,69.225,0
-37.5,50.219,0
-51.325,46.564,0
-52.477,20.98,0
The output of the code is
pre: [0 0 1 1 1 0 0 0 0]
accuracy 1.0
coef_ [[0.40853988 0.15604305]]
intercept_ [-16.80311472]
The image :

The answer is right.