I am trying to run the random forests classifier from sci-kit learn and getting suspiciously bad output - less than 1% of predictions are correct. The model is performing much worse than chance. I am relatively new to Python, ML, and sci-kit learn (a triple whammy) and my concern is that I am missing something fundamental, rather than needing to fine-tune the parameters. What I'm hoping for is more veteran eyes to look through the code and see if something is wrong with the setup.
I'm trying to predict classes for rows in a spreadsheet based on word occurrences - so the input for each row is an array representing how many times each word appears, e.g. [1 0 0 2 0 ... 1]. I am using sci-kit learn's CountVectorizer for do this processing - I feed it strings containing the words in each row, and it outputs the word occurrence array(s). If this input isn't suitable for some reason, that is probably where things are going awry, but I haven't found anything online or in the documentation suggesting that's the case.
Right now, the forest is answering correctly about 0.5% of the time. Using the exact same inputs with an SGD classifier yields close to 80%, which suggests to me that the preprocessing and vectorizing I'm doing is fine - it's something specific to the RF classifier. My first reaction was to look for overfitting, but even when I run the model on the training data, it still gets almost everything wrong.
I've played around with number of trees and amount of training data but that hasn't seemed to change much for me. I'm trying to only show the relevant code but can post more if that's helpful. First SO post so all thoughts and feedback appreciated.
#pull in package to create word occurence vectors for each line
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1,charset_error='ignore')
X_train = vectorizer.fit_transform(train_file)
#convert to dense array, the required input type for random forest classifier
X_train = X_train.todense()
#pull in random forest classifier and train on data
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators = 100, compute_importances=True)
clf = clf.fit(X_train, train_targets)
#transform the test data into the vector format
testdata = vectorizer.transform(test_file)
testdata = testdata.todense()
#export
with open('output.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile)
for item in clf.predict(testdata):
spamwriter.writerow([item])