sklearn Random Forest accuracy score identical for training and test data

Question

I'm attempting to build a classification model for electric vehicle charging event data. I want to predict whether the charging station will be available at a given point in time. I have the following code working:

from sklearn.ensemble import RandomForestClassifier
import pandas as pd

raw_data = pd.read_csv('C:/temp/sample_dataset.csv')
raw_test = pd.read_csv('C:/temp/sample_dataset_test.csv')
print ('raw data shape: ', raw_test.shape)

#choose which columns to dummify
X_vars = ['station_id', 'day_of_week', 'epoch', 'station_city',
 'station_county', 'station_zip', 'port_level', 'perc_local_occupancy',
 'ports_at_station', 'avg_charge_duration']
y_var = ['target_variable']
categorical_vars = ['station_id','station_city','station_county']

#split X and y in training and test
X_train = raw_data.loc[:,X_vars]
y_train = raw_data.loc[:,y_var]
X_test = raw_test.loc[:,X_vars]
y_test = raw_test.loc[:,y_var]

#make dummy variables
X_train = pd.get_dummies(X_train, columns = categorical_vars )
X_test = pd.get_dummies(X_test, columns=categorical_vars)

print('train size', X_train.shape, '\ntest size', X_test.shape)

# Train uncalibrated random forest classifier on whole train and evaluate on test data
clf = RandomForestClassifier(n_estimators=100, max_depth=2)
clf.fit(X_train, y_train.values.ravel())

print ('RF accuracy: TRAINING', clf.score(X_train,y_train))
print ('RF accuracy: TESTING', clf.score(X_test,y_test))

Results

raw data shape:  (1000000, 15)
train size (1000000, 125) 
test size (1000000, 125)
RF accuracy: TRAINING 0.831456
RF accuracy: TESTING 0.831456

My question is why is the training and testing accuracy EXACTLY the same? I've run this many many times, it's always exactly the same. Any ideas? (I've checked the make sure the original data IS different)

There's no information on the size of raw_data. Do you expect to have exactly the same number of observations in both your raw_train and raw_test sets? — andrew_reece
– andrew_reece, Commented May 14, 2017 at 18:40
You are testing and training on the same data. X_train == X_test is True. Use scikit-learn's test_train_split function or some form of cross-validation iterator. — Arya McCarthy
– Arya McCarthy, Commented May 15, 2017 at 3:31

user4280261 · Accepted Answer · 2017-05-14 21:28:51Z

2

Well there is simply a typo in your code, because each time your select all rows:

#split X and y in training and test
X_train = raw_data.loc[:,X_vars] 
y_train = raw_data.loc[:,y_var]
X_test = raw_test.loc[:,X_vars]
y_test = raw_test.loc[:,y_var]

You should index them separately by some index for example: X_train = raw_data.loc[:idx,X_vars]

answered May 14, 2017 at 21:28

user4280261

Sign up to request clarification or add additional context in comments.

Comments

lukess · Accepted Answer · 2017-05-15 02:16:28Z

0

Is it possible that you are using the same set of data in train and test files?

If it's same data, then it might be better to split the data into train and test using train_test_split module.

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

edited May 15, 2017 at 2:16

lukess

9841 gold badge14 silver badges19 bronze badges

answered May 14, 2017 at 18:45

coffeebytes

4061 gold badge4 silver badges7 bronze badges

Collectives™ on Stack Overflow

sklearn Random Forest accuracy score identical for training and test data

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related