Are there too many categorical variables in machine learning?

Are there too many categorical variables in machine learning?

Closed 4 years ago. I have a machine learning problem where the dependent variable is binomial (Yes/No) and some of the independent variables are categorical (with more than 100 levels). I’m not sure whether dummy coding these categorical variables and then passing them to the machine learning model is a optimal solution.

Can a categorical variable have 100 different levels?

If we have HUGE amount of data (say 1 billion data points), a categorical variable that has 100 different levels may not be a problem. Since it is very likely that we have sufficient “training examples” on each level. However, the aforementioned example is less likely happen in real data.

How to use categorical lgbm in machine learning?

Categorical LGBM encoding The split can be made based on the variable being of one specific level or any subset of levels. You have 2^N splits available instead of 4 for one-hot-encoding. Each split is counted.

Why are there so many levels in machine learning?

Since it is very likely that we have sufficient “training examples” on each level. However, the aforementioned example is less likely happen in real data. In most real world data, the data has “80-20” rules, which means few levels will be very often, and most levels will not have sufficient data in.

How is one dummy variable related to another?

When we use one hot encoding for handling the categorical data, then one dummy variable (attribute) can be predicted with the help of other dummy variables. Hence, one dummy variable is highly correlated with other dummy variables.

What do you need to know about dummy variable trap?

Before learning about dummy variable trap, let’s first understand what actually dummy variable is. In statistics, especially in regression models, we deal with various kind of data. The data may be quantitative (numerical) or qualitative (categorical).

How to handle large number of features in machine learning?

Of course, I could turn to models which are capable of handling non-standardized features with NAs but this limits the number of possible models on one hand and seems me very unprofessional on the other hand. What is the way to get over this issue? There are 4 ways I know in Python.