Questions tagged [feature-engineering]

Ask Question

Feature engineering is the process of using domain knowledge of the data to create features for machine learning models. This tag is meant for both theoretical and practical questions regarding feature engineering, excluding questions asking for code, that would be off-topic on CrossValidated.

769 questions

0 votes

0 answers

41 views

Is it methodologically sound to apply WOE/IV binning before correlation and VIF-based feature selection?

In credit scoring / logistic regression models, it’s common to apply WOE (Weight of Evidence) binning to continuous and categorical variables before modeling. However, WOE binning discretizes ...

Ibrahim Rustamov

asked Feb 12 at 15:23

2 votes

1 answer

28 views

How should distribution shift in docking-derived energy features be handled when ligand size changes?

I’m using docking-derived binding energy values as input features in a machine-learning model. All of the original data was generated from molecules of similar size, but our new dataset contains much ...

CCC

asked Dec 6, 2025 at 21:08

2 votes

0 answers

54 views

How to choose features for a Gamma regression, vs. Linear Regression

I'm new to using GLMs which are not Linear Regression, and am working on a project where I am using Gamma regression with a log-link. I'm having problems with the feature engineering step. With linear ...

michael james

asked Nov 10, 2025 at 20:08

1 vote

0 answers

44 views

Designing a demand forecasting model with a dynamic daily update and a final horizon prediction — best practices to avoid leakage?

I am working on a demand forecasting problem for ferry vehicle capacity. For each voyage, I have daily snapshots of the cumulative reservations from the opening date until departure day. So each ...

Analivia Valery

asked Nov 7, 2025 at 15:53

1 vote

0 answers

70 views

Impact of Full Probability Distribution in GP Regression on Optimisation

In the context of an engineering design project that requires determining optimal design configurations (e.g., finding optimal design configurations of nozzle that maximise thrust ratio and discharge ...

xminx

asked Sep 19, 2025 at 13:30

4 votes

1 answer

174 views

Is using contemporaneous components to forecast an aggregate a valid method or a form of data leakage?

I am in the middle of a deep methodological debate regarding a time series forecasting problem and would appreciate the community's expert opinion. The Context I am trying to forecast an aggregate ...

PSE

asked Sep 17, 2025 at 5:35

0 votes

0 answers

78 views

How to handle short runtimes and class imbalance for ML?

I’m revisiting a paper that used CPU performance counters (PMUs/HPCs) to detect malware with machine learning. I have two questions for the ML community: Unequal runtime lengths For each sample, I ...

WhiteForce

asked Sep 4, 2025 at 20:54

2 votes

0 answers

72 views

Preventing data leakage when using street-level aggregated features in classification

I’m working with a dataset of streetlights, where each row represents a streetlight. Each streetlight has a type (LED, Incandescent, Unknown), an address, and a street name. I am trying to predict ...

setty

asked Aug 14, 2025 at 19:26

2 votes

0 answers

48 views

How to improve few-shot generalization to unseen families in tabular regression (XGBoost vs neural nets, feature encoding)?

this is my first post in this forum. I'm working on a regression problem where $y = f(X_1, X_2, N_1, N_2)$. $X_1$, $X_2$ are continuous features; $N_1$, $N_2$ are integer "family/group" IDs. ...

cconsta1

asked Jun 8, 2025 at 21:01

0 votes

0 answers

52 views

Handling Missing Values in the dataset

I'm using this dataset for a regression project, and the goal is to predict the beneficiary risk score(Bene_Avg_Risk_Scre). Now, to protect beneficiary identities and safeguard this information, CMS ...

Anirudh

asked Apr 2, 2025 at 7:34

0 votes

0 answers

50 views

Extract Information from feature having different distribution for different classes

I'm working with the Kaggle Dataset - 'Give Me Some Credit', and I am trying to improve the model AUC score through feature engineering. I already created several categorical features, and I noticed ...

hadamardgate

asked Mar 25, 2025 at 16:11

1 vote

1 answer

78 views

Is the arithmetic mean appropriate when feature scaling rates?

Certain machine learning algorithms perform better when the features of the dataset have been scaled. In particular, feature standardization (subtracting the mean and dividing by the standard ...

steeps

asked Mar 10, 2025 at 11:19

0 votes

1 answer

78 views

Is normalization necessary when the predictions are made per group?

As a beginner in ml, I am watching a video on YouTube about designing a model of song recommendations on Spotify. For each user, there to be predictions about which songs to be recommended. One of the ...

bilanush

asked Mar 7, 2025 at 23:14

0 votes

1 answer

107 views

Conflicting feature importance from different models

I am trying to derive feature importance (solubility) for a dataset with correlated features using different models to predict (physicochemical) properties for small molecules: I am using ridge, lasso ...

limmi

asked Feb 27, 2025 at 13:41

0 votes

0 answers

56 views

Predictive modeling on biased features

Some features I want to use for modeling have distributions like below: There are high values of the features occurring frequently in my data. I can identify a subset of my data points that cause ...

Jakub Małecki

asked Feb 11, 2025 at 15:19

15 30 50 per page

2 3 4 5

…

52 Next

Stack Exchange Network

Questions tagged [feature-engineering]

Is it methodologically sound to apply WOE/IV binning before correlation and VIF-based feature selection?

How should distribution shift in docking-derived energy features be handled when ligand size changes?

How to choose features for a Gamma regression, vs. Linear Regression

Designing a demand forecasting model with a dynamic daily update and a final horizon prediction — best practices to avoid leakage?

Impact of Full Probability Distribution in GP Regression on Optimisation

Is using contemporaneous components to forecast an aggregate a valid method or a form of data leakage?

How to handle short runtimes and class imbalance for ML?

Preventing data leakage when using street-level aggregated features in classification

How to improve few-shot generalization to unseen families in tabular regression (XGBoost vs neural nets, feature encoding)?

Handling Missing Values in the dataset

Extract Information from feature having different distribution for different classes

Is the arithmetic mean appropriate when feature scaling rates?

Is normalization necessary when the predictions are made per group?

Conflicting feature importance from different models

Predictive modeling on biased features

Hot Network Questions