What considerations have been used for model selection?

What considerations have been used for model selection?

Four commonly used probabilistic model selection measures include:

  • Akaike Information Criterion (AIC).
  • Bayesian Information Criterion (BIC).
  • Minimum Description Length (MDL).
  • Structural Risk Minimization (SRM).

What is the main difference between regularization and model selection approaches?

Another advantage of the model selection approach is that it allows one to select the number of clusters based on the data, while the regularization approach requires that it be known or specified in advance by the user. It also allows one to select among a range of models for the covariance structure.

Which is the best strategy for model selection?

The recommended strategy for model selection depends on the amount of data available. If plenty of data is available, we may split the data into several parts, each serving a special purpose. For instance, for hyperparameter tuning we may split the data into three sets: train / validation / test.

When do you need nested cross validation for model selection?

If you need to perform model selection, then you need to perform that independently in each fold of the cross-validation procedure, as it is an integral part of the model fitting procedure. If you use a cross-validation based model selection procedure, this means you end up with nested cross-validation.

How is the training set used in model selection?

The training set is used to train as many models as there are different combinations of model hyperparameters. These models are then evaluated on the validation set, and the model with the best performance on this validation set is selected as the winning model.

Do you need independent data for model selection?

To avoid such issues, we need completely independent data for estimating the generalization error of a model. We will come back to this point in the context of cross validation. The recommended strategy for model selection depends on the amount of data available.