All Questions
2,843 questions
2
votes
1
answer
38
views
How to fit scaler for different subsets of rows depending on group variable and include it in a Pipeline?
I have a data set like the following and want to scale the data using any of the scalers in sklearn.preprocessing.
Is there an easy way to fit this scaler not over the whole data set, but per group? ...
1
vote
1
answer
56
views
How to apply different model on different rows of a pandas dataframe?
I have a pandas dataframe that looks like this:
import pandas as pd
df = pd.DataFrame({'id': [1,2], 'var1': [5,6], 'var2': [20,60], 'var3': [8, -2], 'model_version': ['model_a', 'model_b']})
I have 2 ...
-1
votes
1
answer
48
views
Error in Pipeline code in ScikitLearn using Python
In below code of pipeline. Even though i have encoded the sex column, i am getting string to float error.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from ...
2
votes
0
answers
387
views
Model Training for Segmentation [duplicate]
I want to train and evaluate models to find the best models for my segments, but sklearn is having something go wrong with the tags and the estimators, and I can't figure out the issue. There might be ...
1
vote
1
answer
204
views
Ignore NaN to calculate mean_absolute_error
I'm trying to calculate MAE (Mean absolute error).
In my original DataFrame, I have 1826 rows and 3 columns. I'm using columns 2 and 3 to calculate MAE.
But, in column 2, I have some NaN values.
When ...
-2
votes
1
answer
87
views
Cannot convert dataframe column to a int64 data type
I have a problem.
In my Pandas DataFrame, I have a column called 'job' column. I've created a simple and custom transformer that will map values in that column that corresponds to the type of job. The ...
0
votes
1
answer
207
views
How to create a scaler applying log transformation and MinMaxScaler in sklearn
I want to apply log() to my DataFrame and MinMaxScaler() together.
I want the output to be a pandas DataFrame() with indexes and columns from the original data.
I want to use the parameters used to ...
3
votes
2
answers
128
views
How to preserve data types when working with pandas and sklearn transformers?
While working with a large sklearn Pipeline (fit using a DataFrame) I ran into an error that lead back to a wrong data type of my input. The problem occurred on an a single observation coming from an ...
-1
votes
1
answer
79
views
How can I achieve accurate imputation of missing values in a dataset?
I'm working with a dataset containing details about used cars, and I've encountered several missing values in the Fuel_Type column. The possible values include 'Gasoline', 'E85 Flex Fuel', 'Hybrid', '...
0
votes
1
answer
254
views
How do I convert string data to numerical data using Label Encoder?
I was trying to convert string data into numerical data in a CSV excel sheet. It kept giving me an error about previously unseen labels, so I searched it up and found that we can use Label Encoder to ...
-1
votes
1
answer
123
views
How to Optimize Memory Usage for Cross-Validation of Large Datasets
I have a very large DF (~200GB) of features that I want to perform cross validation on a random forest model with these features.
The features are from a huggingface model in the form of a .arrow file....
0
votes
1
answer
42
views
Error get_features_name_out in getting back the feature name
I want to know the feature importance to my data, so I use permutation_importance. When I get the result, it seems the feature already decoded, and I want to know the name of my feauture using ...
1
vote
2
answers
202
views
Convert Pandas dataframe of objects to a dataframe of vectors
I have a Pandas dataframe (over 1k of rows). There are numbers, objects, strings, and Boolean values in my dataframe. I want to convert each 'cell' of the dataframe to a vector, and work with the ...
3
votes
2
answers
85
views
How can I link the records in the training dataset to the corresponding model predictions?
Using scikit-learn, I've set up a regression model to predict customers' maximum spend per transaction. The dataset I'm using looks a bit like this; the target column is maximum spend per transaction ...
1
vote
1
answer
58
views
How to save single Random Forest model with cross validation?
I am using 10 fold cross validation, trying to predict binary labels (Y) based on the embedding inputs (X).
I want to save one of the models (perhaps the one with the highest ROC AUC). I'm not sure ...