0

I am training a RandomForestClassifier from sklearn.ensemble with the following code:

        adata = ad.read_h5ad(f'{data_dir}{ct}_clean_log1p_normalized.h5ad')
        adata = adata[:, adata.var.highly_variable]
        print(f'AnnData for {ct}: {adata}')
    
        # Extract feature matrix (X) and target vector (y)
        X = adata.X
        y = adata.obs['clinical_dx']
        
        # Convert sparse matrix to dense for [insert reason]
        if issparse(X):
            X = X.toarray()
        
        # Encode the target variable for [insert reason]
        le = LabelEncoder()
        y_encoded = le.fit_transform(y)
        
        X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)
        
            
       # Initialize the classifier
        rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
        
        # Train the classifier
        rf_classifier.fit(X_train, y_train)
    
        # Validate on the test set
        y_pred_rf = rf_classifier.predict(X_test)
        
        # View validation report
        validation_report = classification_report(y_test, y_pred_rf, target_names=le.classes_)
        print(validation_report)
    
        with open(f'{rfc_dir}validation_report.txt', "w") as report_file:
            report_file.write(validation_report)
    
        # Generate the confusion matrix
        cm = confusion_matrix(y_test, y_pred_rf, labels=rf_classifier.classes_)
        
        # Make percentage
        cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        
        plt.figure(figsize=(10, 9))
        sns.heatmap(cm_normalized, annot=True, fmt='.1%', cmap='Blues', 
                    xticklabels=le.inverse_transform(rf_classifier.classes_), 
                    yticklabels=le.inverse_transform(rf_classifier.classes_))
        
        plt.xlabel('Predicted Label')
        plt.ylabel('True Label')
        plt.title('Confusion Matrix (Random Forest)')
        
        plt.savefig(f"{rfc_dir}confusion_matrix", bbox_inches='tight')
    
        # Get feature importances
        feature_importances_rf = rf_classifier.feature_importances_

        number_of_features = 200
        
        # Create a DataFrame for better visualization
        feature_importance_rf_df = pd.DataFrame({
            'Ensembl': adata.var_names,
            'Importance': feature_importances_rf
        })

        top_features = feature_importance_rf_df.sort_values(
            by='Importance', ascending=False
        ).head(number_of_features)
        
        
        
    
        top_features.to_csv(f'{rfc_dir}markers.csv', index=False)


Unfortunately, I cannot share the data since it is HIPAA-protected.

For some reason, regardless of what celltype (there is a different anndata for each of 8 celltype) I train a classifier on, the importance scores in the CSV file are all incredibly low. The highest importance score in any case is around 0.01. Is this a red flag or just something to do with my datasets? Has anyone experienced this before? Thanks

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.