How to improve the accuracy of a naive Bayes classifier?

How to improve the accuracy of a naive Bayes classifier?

I have implemented a Naive Bayes Classifier, and with some feature selection (mostly filtering useless words), I’ve gotten about a 30% test accuracy, with 45% training accuracy. This is significantly better than random, but I want it to be better.

What is the Balanced Accuracy of Bayes in Python?

The balanced accuracy has as well. The sensitivity was 0.52 and 0.65 for logistic regression and Naive Bayes, respsectively and is now 0.73. The balanced accuracy was 0.76 and 0.82, and is now 0.87.

How to calculate accuracy in an imbalanced dataset?

Accuracy is generally calculated as (TP+TN)/ (TP+TN+FP+FN). However, for imbalanced datasets, balanced accuracy, given by , where TP/ (TP+FN) and TN/ (TN+FP). Balanced accuracy will not have very high numbers simply due to class imbalance and is a better metric here.

How long does it take to initialize a Bayes classifier?

Initialization of the class takes around 10 seconds because we need to make a pass through all spam and ham emails and compute all the word-class dependent probabilities. The is_spam function does all the inference job, and is responsible for deciding if an email is a spam based on the tokens that appear in the text.

How to improve the accuracy of text classification?

Improve your model my adding bigrams and tri-grams as features. Try doing some topic modelling like latent Dirichlet allocation or Probabilistic latent Semantic Analysis for the corpus using a specified number of topics – say 20. You would get a vector of 20 probabilities corresponding to the 20 topics for each document.

How is the Bayes formula used in NLP?

As we all know Bayes formula looks like that: In our NLP setting we would like to calculate what is P (spam|text) and what is P (not spam|text). By the commonly used terminology I will refer to “not spam” as to “ham”. As we all know, text of any email usually consists of words.