0

I'm attempting to build a classification model for electric vehicle charging event data. I want to predict whether the charging station will be available at a given point in time. I have the following code working:

from sklearn.ensemble import RandomForestClassifier
import pandas as pd

raw_data = pd.read_csv('C:/temp/sample_dataset.csv')
raw_test = pd.read_csv('C:/temp/sample_dataset_test.csv')
print ('raw data shape: ', raw_test.shape)

#choose which columns to dummify
X_vars = ['station_id', 'day_of_week', 'epoch', 'station_city',
 'station_county', 'station_zip', 'port_level', 'perc_local_occupancy',
 'ports_at_station', 'avg_charge_duration']
y_var = ['target_variable']
categorical_vars = ['station_id','station_city','station_county']

#split X and y in training and test
X_train = raw_data.loc[:,X_vars]
y_train = raw_data.loc[:,y_var]
X_test = raw_test.loc[:,X_vars]
y_test = raw_test.loc[:,y_var]

#make dummy variables
X_train = pd.get_dummies(X_train, columns = categorical_vars )
X_test = pd.get_dummies(X_test, columns=categorical_vars)

print('train size', X_train.shape, '\ntest size', X_test.shape)

# Train uncalibrated random forest classifier on whole train and evaluate on test data
clf = RandomForestClassifier(n_estimators=100, max_depth=2)
clf.fit(X_train, y_train.values.ravel())

print ('RF accuracy: TRAINING', clf.score(X_train,y_train))
print ('RF accuracy: TESTING', clf.score(X_test,y_test))

Results

raw data shape:  (1000000, 15)
train size (1000000, 125) 
test size (1000000, 125)
RF accuracy: TRAINING 0.831456
RF accuracy: TESTING 0.831456

My question is why is the training and testing accuracy EXACTLY the same? I've run this many many times, it's always exactly the same. Any ideas? (I've checked the make sure the original data IS different)

2
  • There's no information on the size of raw_data. Do you expect to have exactly the same number of observations in both your raw_train and raw_test sets? Commented May 14, 2017 at 18:40
  • 2
    You are testing and training on the same data. X_train == X_test is True. Use scikit-learn's test_train_split function or some form of cross-validation iterator. Commented May 15, 2017 at 3:31

2 Answers 2

2

Well there is simply a typo in your code, because each time your select all rows:

#split X and y in training and test
X_train = raw_data.loc[:,X_vars] 
y_train = raw_data.loc[:,y_var]
X_test = raw_test.loc[:,X_vars]
y_test = raw_test.loc[:,y_var]

You should index them separately by some index for example: X_train = raw_data.loc[:idx,X_vars]

Sign up to request clarification or add additional context in comments.

Comments

0

Is it possible that you are using the same set of data in train and test files?

If it's same data, then it might be better to split the data into train and test using train_test_split module.

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.