I'm attempting to build a classification model for electric vehicle charging event data. I want to predict whether the charging station will be available at a given point in time. I have the following code working:
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
raw_data = pd.read_csv('C:/temp/sample_dataset.csv')
raw_test = pd.read_csv('C:/temp/sample_dataset_test.csv')
print ('raw data shape: ', raw_test.shape)
#choose which columns to dummify
X_vars = ['station_id', 'day_of_week', 'epoch', 'station_city',
'station_county', 'station_zip', 'port_level', 'perc_local_occupancy',
'ports_at_station', 'avg_charge_duration']
y_var = ['target_variable']
categorical_vars = ['station_id','station_city','station_county']
#split X and y in training and test
X_train = raw_data.loc[:,X_vars]
y_train = raw_data.loc[:,y_var]
X_test = raw_test.loc[:,X_vars]
y_test = raw_test.loc[:,y_var]
#make dummy variables
X_train = pd.get_dummies(X_train, columns = categorical_vars )
X_test = pd.get_dummies(X_test, columns=categorical_vars)
print('train size', X_train.shape, '\ntest size', X_test.shape)
# Train uncalibrated random forest classifier on whole train and evaluate on test data
clf = RandomForestClassifier(n_estimators=100, max_depth=2)
clf.fit(X_train, y_train.values.ravel())
print ('RF accuracy: TRAINING', clf.score(X_train,y_train))
print ('RF accuracy: TESTING', clf.score(X_test,y_test))
Results
raw data shape: (1000000, 15)
train size (1000000, 125)
test size (1000000, 125)
RF accuracy: TRAINING 0.831456
RF accuracy: TESTING 0.831456
My question is why is the training and testing accuracy EXACTLY the same? I've run this many many times, it's always exactly the same. Any ideas? (I've checked the make sure the original data IS different)
raw_data. Do you expect to have exactly the same number of observations in both yourraw_trainandraw_testsets?X_train == X_testisTrue. Use scikit-learn'stest_train_splitfunction or some form of cross-validation iterator.