Questions tagged [feature-engineering]
Feature engineering is the process of using domain knowledge of the data to create features for machine learning models. This tag is meant for both theoretical and practical questions regarding feature engineering, excluding questions asking for code, that would be off-topic on CrossValidated.
769 questions
0
votes
0
answers
41
views
Is it methodologically sound to apply WOE/IV binning before correlation and VIF-based feature selection?
In credit scoring / logistic regression models, it’s common to apply WOE (Weight of Evidence) binning to continuous and categorical variables before modeling.
However, WOE binning discretizes ...
2
votes
1
answer
28
views
How should distribution shift in docking-derived energy features be handled when ligand size changes?
I’m using docking-derived binding energy values as input features in a machine-learning model.
All of the original data was generated from molecules of similar size, but our new dataset contains much ...
2
votes
0
answers
54
views
How to choose features for a Gamma regression, vs. Linear Regression
I'm new to using GLMs which are not Linear Regression, and am working on a project where I am using Gamma regression with a log-link. I'm having problems with the feature engineering step.
With linear ...
1
vote
0
answers
44
views
Designing a demand forecasting model with a dynamic daily update and a final horizon prediction — best practices to avoid leakage?
I am working on a demand forecasting problem for ferry vehicle capacity.
For each voyage, I have daily snapshots of the cumulative reservations from the opening date until departure day.
So each ...
1
vote
0
answers
70
views
Impact of Full Probability Distribution in GP Regression on Optimisation
In the context of an engineering design project that requires determining optimal design configurations (e.g., finding optimal design configurations of nozzle that maximise thrust ratio and discharge ...
4
votes
1
answer
174
views
Is using contemporaneous components to forecast an aggregate a valid method or a form of data leakage?
I am in the middle of a deep methodological debate regarding a time series forecasting problem and would appreciate the community's expert opinion.
The Context
I am trying to forecast an aggregate ...
0
votes
0
answers
78
views
How to handle short runtimes and class imbalance for ML?
I’m revisiting a paper that used CPU performance counters (PMUs/HPCs) to detect malware with machine learning. I have two questions for the ML community:
Unequal runtime lengths
For each sample, I ...
2
votes
0
answers
72
views
Preventing data leakage when using street-level aggregated features in classification
I’m working with a dataset of streetlights, where each row represents a streetlight. Each streetlight has a type (LED, Incandescent, Unknown), an address, and a street name. I am trying to predict ...
2
votes
0
answers
48
views
How to improve few-shot generalization to unseen families in tabular regression (XGBoost vs neural nets, feature encoding)?
this is my first post in this forum.
I'm working on a regression problem where $y = f(X_1, X_2, N_1, N_2)$. $X_1$, $X_2$ are continuous features; $N_1$, $N_2$ are integer "family/group" IDs. ...
0
votes
0
answers
52
views
Handling Missing Values in the dataset
I'm using this dataset for a regression project, and the goal is to predict the beneficiary risk score(Bene_Avg_Risk_Scre). Now, to protect beneficiary identities and safeguard this information, CMS ...
0
votes
0
answers
50
views
Extract Information from feature having different distribution for different classes
I'm working with the Kaggle Dataset - 'Give Me Some Credit', and I am trying to improve the model AUC score through feature engineering. I already created several categorical features, and I noticed ...
1
vote
1
answer
78
views
Is the arithmetic mean appropriate when feature scaling rates?
Certain machine learning algorithms perform better when the features of the dataset have been scaled. In particular, feature standardization (subtracting the mean and dividing by the standard ...
0
votes
1
answer
78
views
Is normalization necessary when the predictions are made per group?
As a beginner in ml, I am watching a video on YouTube
about designing a model of song recommendations on Spotify. For each user, there to be predictions about which songs to be recommended.
One of the ...
0
votes
1
answer
107
views
Conflicting feature importance from different models
I am trying to derive feature importance (solubility) for a dataset with correlated features using different models to predict (physicochemical) properties for small molecules:
I am using ridge, lasso ...
0
votes
0
answers
56
views
Predictive modeling on biased features
Some features I want to use for modeling have distributions like below:
There are high values of the features occurring frequently in my data. I can identify a subset of my data points that cause ...