2.2.6
Naive Bayes
Summary
- Naive Bayes assumes conditional independence between features and combines prior probabilities with likelihoods via Bayes’ rule.
- Training and inference are extremely fast, making it a strong baseline for high-dimensional sparse data such as text or spam filtering.
- Laplace smoothing and TF-IDF features mitigate issues with unseen words and frequency imbalance.
- When the independence assumption is too strong, consider feature selection or ensembling Naive Bayes with other models.
Intuition #
This method should be interpreted through its assumptions, data conditions, and how parameter choices affect generalization.
Detailed Explanation #
Mathematical formulation #
For class \(y\) and features \(\mathbf{x} = (x_1, \ldots, x_d)\),
$$ P(y \mid \mathbf{x}) \propto P(y) \prod_{j=1}^{d} P(x_j \mid y). $$Different likelihood models suit different data types: the multinomial model for word counts, the Bernoulli model for binary presence/absence, and Gaussian Naive Bayes for continuous values.
Experiments with Python #
The snippet below trains a multinomial Naive Bayes classifier on a subset of the 20 Newsgroups data set, using TF-IDF features. Even with thousands of features the model trains quickly, and the classification report summarises performance.
| |
References #
- Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
- Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.