Skip to main content

Questions tagged [feature-engineering]

Feature engineering is the process of using domain knowledge of the data to create features for machine learning models. This tag is meant for both theoretical and practical questions regarding feature engineering, excluding questions asking for code, that would be off-topic on CrossValidated.

0 votes
0 answers
41 views

In credit scoring / logistic regression models, it’s common to apply WOE (Weight of Evidence) binning to continuous and categorical variables before modeling. However, WOE binning discretizes ...
Ibrahim Rustamov's user avatar
2 votes
1 answer
28 views

I’m using docking-derived binding energy values as input features in a machine-learning model. All of the original data was generated from molecules of similar size, but our new dataset contains much ...
CCC's user avatar
  • 21
2 votes
0 answers
54 views

I'm new to using GLMs which are not Linear Regression, and am working on a project where I am using Gamma regression with a log-link. I'm having problems with the feature engineering step. With linear ...
michael james's user avatar
1 vote
0 answers
44 views

I am working on a demand forecasting problem for ferry vehicle capacity. For each voyage, I have daily snapshots of the cumulative reservations from the opening date until departure day. So each ...
Analivia Valery's user avatar
1 vote
0 answers
70 views

In the context of an engineering design project that requires determining optimal design configurations (e.g., finding optimal design configurations of nozzle that maximise thrust ratio and discharge ...
xminx's user avatar
  • 11
4 votes
1 answer
174 views

I am in the middle of a deep methodological debate regarding a time series forecasting problem and would appreciate the community's expert opinion. The Context I am trying to forecast an aggregate ...
PSE's user avatar
  • 318
0 votes
0 answers
78 views

I’m revisiting a paper that used CPU performance counters (PMUs/HPCs) to detect malware with machine learning. I have two questions for the ML community: Unequal runtime lengths For each sample, I ...
WhiteForce's user avatar
2 votes
0 answers
72 views

I’m working with a dataset of streetlights, where each row represents a streetlight. Each streetlight has a type (LED, Incandescent, Unknown), an address, and a street name. I am trying to predict ...
setty's user avatar
  • 161
2 votes
0 answers
48 views

this is my first post in this forum. I'm working on a regression problem where $y = f(X_1, X_2, N_1, N_2)$. $X_1$, $X_2$ are continuous features; $N_1$, $N_2$ are integer "family/group" IDs. ...
cconsta1's user avatar
  • 121
0 votes
0 answers
52 views

I'm using this dataset for a regression project, and the goal is to predict the beneficiary risk score(Bene_Avg_Risk_Scre). Now, to protect beneficiary identities and safeguard this information, CMS ...
Anirudh's user avatar
0 votes
0 answers
50 views

I'm working with the Kaggle Dataset - 'Give Me Some Credit', and I am trying to improve the model AUC score through feature engineering. I already created several categorical features, and I noticed ...
hadamardgate's user avatar
1 vote
1 answer
78 views

Certain machine learning algorithms perform better when the features of the dataset have been scaled. In particular, feature standardization (subtracting the mean and dividing by the standard ...
steeps's user avatar
  • 11
0 votes
1 answer
78 views

As a beginner in ml, I am watching a video on YouTube about designing a model of song recommendations on Spotify. For each user, there to be predictions about which songs to be recommended. One of the ...
bilanush's user avatar
  • 119
0 votes
1 answer
107 views

I am trying to derive feature importance (solubility) for a dataset with correlated features using different models to predict (physicochemical) properties for small molecules: I am using ridge, lasso ...
limmi's user avatar
  • 11
0 votes
0 answers
56 views

Some features I want to use for modeling have distributions like below: There are high values of the features occurring frequently in my data. I can identify a subset of my data points that cause ...
Jakub Małecki's user avatar

15 30 50 per page
1
2 3 4 5
52