How to make model not too dependent on one variable?

Question

Let's suppose I have a generic model:

Variable A | Variable B | Variable C | Variable D

Variable Dis a categorical variable. ( for example models of cars - and the dataset on which I trained my model only has models up to year 2020 )

I know for sure that Variable A | Variable B | Variable C are always present, however Variable D can be missing (if for example I am using models of cars from 2021).

My questions are:

If I cannot use data from 2021, how safe is it to use Variable D in my predictions?
Could I just randomly assign a value to Variable D when it is missing?
Is it possible that the model may become too reliant on Variable D and by randomly assigning values I might introduce bias?
Should I just drop Variable D, or just the rows without an associated category in the data on which my model has been trained?

Thank you for your time.

Akshay · Accepted Answer · 2021-11-14 14:45:26Z

Answers to all your questions really depend on what Variable D is. Based on your description it does seem like your model would be too dependent on Variable D and would not generalize.

I'll be using the car model example you have mentioned to explain my answer.

Let us consider a model which predicts Car Price based on car features. The dataset would be as follows:

Here you should not use Car Model as a feature as:

Car Model is a direct indicator of price. The model will just learn the mapping Car Model -> Price and doesn't learn any other features.
For future cases Car Model does not help prediction.

Consider a new car for which your model has to find the price. You'll have the following data:

Since your model hasn't seen audi 100ls it would make a very bad prediction.

You need to ask the following questions to help you decide what to do:

Will the variable be available during real-time prediction. If not, then do not make use of it during training.
If it is available, does it help prediction? Eg: A new car model's name does not help you determine car price, but other features like fueltype, doornumber, mileage etc. do help.
If the variable is both available and helps prediction in real-time , you can try imputing missing values.

Stack Exchange Network

How to make model not too dependent on one variable?

1 Answer 1

Hot Network Questions

How to make model not too dependent on one variable?

1 Answer 1

Related

Hot Network Questions