How do you deal with missing values in a test set?

Contents

1 How do you deal with missing values in a test set?
2 Should we impute test data?
3 When to impute missing values in training set?
4 How is imputing missing values on testing set validated?

How do you deal with missing values in a test set?

There are multiple ways to deal with missing values.

Replacing them with mean/mode.
Replacing them with a constant say -1.
Using classifier models to predict them. No idea about SAS but R provides various packages for missing value imputation like kNN, Amelia.

Should we impute test data?

You should not impute your testing set unless you know you can get that data in real life. Most of the time imputing just makes zero sense in real life data.

When to impute missing values in training set?

Keeping the past/future analogy in mind, this means anything you do to pre-process or process your data, such as imputing missing values, you should do on the training set alone. You can then remember what you did to your training set if your test set also needs pre-processing or imputing, so that you do it the same way on both sets.

Is it OK to impute missing values with the mean?

Yes. It is fine to perform mean imputation, however, make sure to calculate the mean (or any other metrics) only on the train data to avoid data leakage to your test set. Is it ok to impute mean based missing values with the mean whenever implementing the model?

What happens when you impute both training and testing?

If you fit imputation on both training and testing, then any new testing dataset requires you to re-impute all data again, and this allows leaking information/feature into the model because the information from testing dataset is included in training the model, and consequently, your model won’t be able to predict a new data.

How is imputing missing values on testing set validated?

I have been comparing multiple data pre-processing approaches where I carry out combinations of various filtering steps which are: removing mean based outliers with mean replacement & additionally replacing NA’s with the mean. removing median absolute deviation outliers with mean replacement & additionally replacing NA’s with the mean.

How do you deal with missing values in a test set?

How do you deal with missing values in a test set?

Should we impute test data?

When to impute missing values in training set?

How is imputing missing values on testing set validated?

Can you sand Danish oil finish?

Is 12 inches of insulation enough?