Contents
The potential solutions include the following:
- Remove some of the highly correlated independent variables.
- Linearly combine the independent variables, such as adding them together.
- Perform an analysis designed for highly correlated variables, such as principal components analysis or partial least squares regression.
What does highly correlated data mean?
Correlation simply means a mutual relationship between two or more things. If it is > 0, they are positively correlated. Correlation coefficient varies between -1 and 1. Most important point to note is, correlation measures only the association between the two variables and does not measure causation.
What does the VIF tell you?
Variance inflation factor (VIF) is a measure of the amount of multicollinearity in a set of multiple regression variables. This ratio is calculated for each independent variable. A high VIF indicates that the associated independent variable is highly collinear with the other variables in the model.
What does it mean when columns have high correlations?
The high correlations suggest that many of the columns contain redundant information, i.e. information from one column is contained in other columns. My first question was: How do I quantify this level of redundancy?
Are there any algorithms that benefit from correlation?
Some algorithms like Naive Bayes actually directly benefit from “positive” correlated features. And others like random forest may indirectly benefit from them. Imagine having 3 features A, B, and C. A and B are highly correlated to the target and to each other, and C isn’t at all.
What happens when predictor variables are highly correlated?
That is, think about the system you are studying and all of the extraneous variables that could influence the system. When predictor variables are correlated, the precision of the estimated regression coefficients decreases as more predictor variables are added to the model.
For the model to be stable enough, the above variance should be low. If the variance of the weights is high, it means that the model is very sensitive to data. The weights differ largely with training data if the variance is high. It means that the model might not perform well with test data.