Contents
For a dichotomous categorical variable and a continuous variable you can calculate a Pearson correlation if the categorical variable has a 0/1-coding for the categories. This correlation is then also known as a point-biserial correlation coefficient.
How do you remove multicollinearity from a categorical variable?
get_dummies are highly correlated with others. To avoid or remove multicollinearity in the dataset after one-hot encoding using pd. get_dummies, you can drop one of the categories and hence removing collinearity between the categorical features. Sklearn provides this feature by including drop_first=True in pd.
How to investigate multi collinearity for categorical variables?
For categorical variables, multicollinearity can be detected with Spearman rank correlation coefficient (ordinal variables) and chi-square test (nominal variables). For a categorical and a continuous variable, multicollinearity can be measured by t-test (if the categorical variable has 2 categories) or ANOVA (more than 2 categories).
When is a collinearity is a special case?
A collinearity is a special case when two or more variables are exactly correlated. Unfortunately because of the multicollinearity it becomes harder to understand what is going on:
What happens if multicollinearity is not present in a model?
Therefore, if multicollinearity is not present for the independent variables that you are particularly interested in, you may not need to resolve it. Suppose your model contains the experimental variables of interest and some control variables.
How is collinearity related to the theory of connected designs?
On the theory of connected designs: Characterization and optimality, The Annals of Statistics, 2, 6, pp. 1238-1255. The upshot is that collinearity among categorical variables means that the dataset must be split into disconnected parts, with a reference level in each component.