How to perform feature selection for categorical variables?

How to perform feature selection for categorical variables?

If i have a dataset with 50 Categorical and 50 numerical variables then how can i perform Feature selection for my Categorical variables. I believe that we can convert those 50 Categorical variables into continuous using One Hot Encoding or Feature Hashing and apply SelectKBest or RFECV or PCA..

How is feature selection used in regression modeling?

Feature selection is the process of identifying and selecting a subset of input variables that are most relevant to the target variable. Perhaps the simplest case of feature selection is the case where there are numerical input variables and a numerical target for regression predictive modeling.

What are the different types of feature selection?

There are two main types of feature selection techniques: supervised and unsupervised, and supervised methods may be divided into wrapper, filter and intrinsic. Filter-based feature selection methods use statistical measures to score the correlation or dependence between input variables that can be filtered to choose the most relevant features.

How to perform feature selection with numerical input data?

The dataset classifies patients’ data as either an onset of diabetes within five years or not. There are 768 examples and eight input variables. It is a binary classification problem. A naive model can achieve an accuracy of about 65 percent on this dataset. A good score is about 77 percent +/- 5 percent.

How to detect multicollinearity in a categorical variable?

For categorical variables, multicollinearity can be detected with Spearman rank correlation coefficient (ordinal variables) and chi-square test (nominal variables).

How are statistical measures used in feature selection?

The statistical measures used in filter-based feature selection are generally calculated one input variable at a time with the target variable. As such, they are referred to as univariate statistical measures. This may mean that any interaction between input variables is not considered in the filtering process.