troubleshooting random forests classifier in sci-kit learn

Question

I am trying to run the random forests classifier from sci-kit learn and getting suspiciously bad output - less than 1% of predictions are correct. The model is performing much worse than chance. I am relatively new to Python, ML, and sci-kit learn (a triple whammy) and my concern is that I am missing something fundamental, rather than needing to fine-tune the parameters. What I'm hoping for is more veteran eyes to look through the code and see if something is wrong with the setup.

I'm trying to predict classes for rows in a spreadsheet based on word occurrences - so the input for each row is an array representing how many times each word appears, e.g. [1 0 0 2 0 ... 1]. I am using sci-kit learn's CountVectorizer for do this processing - I feed it strings containing the words in each row, and it outputs the word occurrence array(s). If this input isn't suitable for some reason, that is probably where things are going awry, but I haven't found anything online or in the documentation suggesting that's the case.

Right now, the forest is answering correctly about 0.5% of the time. Using the exact same inputs with an SGD classifier yields close to 80%, which suggests to me that the preprocessing and vectorizing I'm doing is fine - it's something specific to the RF classifier. My first reaction was to look for overfitting, but even when I run the model on the training data, it still gets almost everything wrong.

I've played around with number of trees and amount of training data but that hasn't seemed to change much for me. I'm trying to only show the relevant code but can post more if that's helpful. First SO post so all thoughts and feedback appreciated.

#pull in package to create word occurence vectors for each line
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1,charset_error='ignore')
X_train = vectorizer.fit_transform(train_file)
#convert to dense array, the required input type for random forest classifier
X_train = X_train.todense()

#pull in random forest classifier and train on data
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators = 100, compute_importances=True)
clf = clf.fit(X_train, train_targets)

#transform the test data into the vector format
testdata = vectorizer.transform(test_file)
testdata = testdata.todense()


#export
with open('output.csv', 'wb') as csvfile:
    spamwriter = csv.writer(csvfile)
    for item in clf.predict(testdata):
        spamwriter.writerow([item])

Are you feeding your spreadsheet as train_file to vectorized in this line:X_train = vectorizer.fit_transform(train_file) ? IF that's the case then your vectorized treats your file as a single text file and counts the words inside that file. If you have your input as a spreadsheet then you shouldn't use the CountVectorizer, because you already have your word counts and you should read that spreadsheet into the matrix — Shahram
– Shahram, Commented Feb 23, 2014 at 4:16
@Shahram, X_train is essentially a list of strings, each of which contains all of the words in a particular row in the spreadsheet. String 1 contains the words in row 1, and so on. So I think the vectorizer is necessary to create the word occurrences. I've checked to see what the output from CountVectorizer looks like and I think that part is correct. I will clarify that though. — DanT
– DanT, Commented Feb 23, 2014 at 21:52

desertnaut · Accepted Answer · 2025-02-05 11:09:51Z

if with Random Forest (RF) you get so bad on the training set X_train, then something is definitely wrong, because you should get a huge percentage, above 90%. Try the following (code snippet first):

print "K-means" 
clf  = KMeans(n_clusters=len(train_targets), n_init=1000, n_jobs=2)

print "Gaussian Mixtures: full covariance"
covar_type = 'full'    # 'spherical', 'diag', 'tied', 'full'     
clf = GMM(n_components=len(train_targets), covariance_type=covar_type, init_params='wc', n_iter=10000)

print "VBGMM: full covariance"
covar_type = 'full'    # 'spherical', 'diag', 'tied', 'full'     
clf = VBGMM(n_components=len(train_targets), covariance_type=covar_type, alpha=1.0, random_state=None, thresh=0.01, verbose=False, min_covar=None, n_iter=1000000, params='wc', init_params='wc')

print "Random Forest"
clf = RandomForestClassifier(n_estimators=400, criterion='entropy', n_jobs=2)

print "MultiNomial Logistic Regression"
clf = LogisticRegression(penalty='l2', dual=False, C=1.0, fit_intercept=True, intercept_scaling=1, tol=0.0001)

print "SVM: Gaussian Kernel, infty iterations"
clf = SVC(C=1.0, kernel='rbf', degree=3, gamma=3.0, coef0=1.0, shrinking=True,
probability=False, tol=0.001, cache_size=200, class_weight=None, 
verbose=False, max_iter=-1, random_state=None)

different classifiers, the interface in sci-kit learn is basically always the same and see how they behave (maybe RF is not really the best). See code above
Try to create some randomly generated datasets to give to RF classifier, I strongly suspect something goes wrong in the mapping process that generates the vectorizer objects. Therefore, start creating your X_train and see.

Collectives™ on Stack Overflow

troubleshooting random forests classifier in sci-kit learn

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related