I am working on a dataset for Housing pricing prediction problem. While I am trying to find an estimation between the features and the dependent variable price, I found there is a significant difference between train R^2 and test R^2 eventhough I have tried to avoid data leakage. What might be the reason for this?
Link for data: Data_Link
Code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
data_1 = pd.read_csv('Data/data.csv')
data_new = data_1[data_1['price']!=0]
data_new['log_sqft_lot']= np.log(data_new['sqft_lot'])
data_new['log_price']= np.log(data_new['price'])
data_city = pd.get_dummies(data_new['city'], drop_first = True, dtype= int)
data_new = pd.concat([data_new, data_city], axis=1)
data2= data_new.drop(['sqft_lot','city','date','statezip','country','street'],axis=1)
x = data2.drop(['price','log_price'], axis=1)
y = data2['log_price']
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)
sc = StandardScaler()
x_train_scaled = sc.fit_transform(x_train)
x_test_scaled = sc.transform(x_test)
lm = LinearRegression()
lm.fit(x_train_scaled,y_train)
lm.coef_
lm.intercept_
lm.score(x_train_scaled,y_train)
lm.score(x_test_scaled,y_test)