Naive Bayes

入門

2.2.6

Naive Bayes

Last updated 2020-04-08 Read time 4 min
Summary
  • Naive Bayes assumes conditional independence between features and combines prior probabilities with likelihoods via Bayes’ rule.
  • Training and inference are extremely fast, making it a strong baseline for high-dimensional sparse data such as text or spam filtering.
  • Laplace smoothing and TF-IDF features mitigate issues with unseen words and frequency imbalance.
  • When the independence assumption is too strong, consider feature selection or ensembling Naive Bayes with other models.

Intuition #

This method should be interpreted through its assumptions, data conditions, and how parameter choices affect generalization.

Detailed Explanation #

Mathematical formulation #

For class \(y\) and features \(\mathbf{x} = (x_1, \ldots, x_d)\),

$$ P(y \mid \mathbf{x}) \propto P(y) \prod_{j=1}^{d} P(x_j \mid y). $$

Different likelihood models suit different data types: the multinomial model for word counts, the Bernoulli model for binary presence/absence, and Gaussian Naive Bayes for continuous values.

Experiments with Python #

The snippet below trains a multinomial Naive Bayes classifier on a subset of the 20 Newsgroups data set, using TF-IDF features. Even with thousands of features the model trains quickly, and the classification report summarises performance.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
from __future__ import annotations

import japanize_matplotlib
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.naive_bayes import GaussianNB

def run_naive_bayes_demo(
    n_samples: int = 600,
    n_classes: int = 3,
    random_state: int = 0,
    title: str = "Decision regions of Gaussian Naive Bayes",
    xlabel: str = "feature 1",
    ylabel: str = "feature 2",
) -> dict[str, float]:
    """Train Gaussian Naive Bayes on synthetic data and plot decision regions."""
    japanize_matplotlib.japanize()
    X, y = make_classification(
        n_samples=n_samples,
        n_features=2,
        n_informative=2,
        n_redundant=0,
        n_clusters_per_class=1,
        n_classes=n_classes,
        random_state=random_state,
    )

    clf = GaussianNB()
    clf.fit(X, y)

    accuracy = float(accuracy_score(y, clf.predict(X)))
    conf = confusion_matrix(y, clf.predict(X))

    x_min, x_max = X[:, 0].min() - 1.0, X[:, 0].max() + 1.0
    y_min, y_max = X[:, 1].min() - 1.0, X[:, 1].max() + 1.0
    grid_x, grid_y = np.meshgrid(np.linspace(x_min, x_max, 400), np.linspace(y_min, y_max, 400))
    grid = np.c_[grid_x.ravel(), grid_y.ravel()]
    preds = clf.predict(grid).reshape(grid_x.shape)

    fig, ax = plt.subplots(figsize=(7, 6))
    ax.contourf(grid_x, grid_y, preds, alpha=0.25, cmap="coolwarm", levels=np.arange(-0.5, n_classes + 0.5, 1))
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap="coolwarm", edgecolor="k", s=25)
    ax.set_title(title)
    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)
    fig.tight_layout()
    plt.show()

    return {"accuracy": accuracy, "confusion": conf}

metrics = run_naive_bayes_demo(
    title="Decision regions of Gaussian Naive Bayes",
    xlabel="feature 1",
    ylabel="feature 2",
)
print(f"Training accuracy: {metrics['accuracy']:.3f}")
print("Confusion matrix:")
print(metrics['confusion'])

FAQ #

What is Naive Bayes? #

Naive Bayes is a family of probabilistic classifiers based on Bayes’ theorem with a naive assumption: all features are conditionally independent given the class label. Despite this simplification, Naive Bayes often performs surprisingly well in practice, especially for text classification and spam filtering.

The prediction rule is:

$$ \hat{y} = \arg\max_y \; P(y) \prod_{j=1}^{d} P(x_j \mid y) $$

The model is fast to train (O(nd)), requires little data, and handles high-dimensional sparse features well.

What are the different types of Naive Bayes? #

VariantLikelihood modelBest for
GaussianNBGaussian distributionContinuous numeric features
MultinomialNBMultinomial distributionWord counts, TF-IDF vectors
BernoulliNBBernoulli distributionBinary feature presence/absence
ComplementNBComplement of multinomialImbalanced text datasets

When should I use Naive Bayes? #

Naive Bayes works well when:

  • Features are high-dimensional and sparse (e.g., text with thousands of vocabulary terms).
  • You need a fast, interpretable baseline before investing in complex models.
  • Training data is limited — the independence assumption reduces variance.
  • Real-time prediction is needed (inference is very fast).

Avoid Naive Bayes when feature correlations are strong (e.g., word bigrams, structured tabular data), or when calibrated probability estimates are critical.

What is Laplace smoothing and why does it matter? #

Without smoothing, if a word never appears in the training data for class \(y\), then \(P(x_j \mid y) = 0\), which zeroes out the entire product regardless of other evidence. Laplace smoothing (additive smoothing) adds a pseudo-count \(\alpha\) to every feature count:

$$ P(x_j \mid y) = \frac{\text{count}(x_j, y) + \alpha}{\text{count}(y) + \alpha \cdot d} $$

In scikit-learn, set alpha=1.0 (default) for standard Laplace smoothing, or tune alpha as a hyperparameter.

How does Naive Bayes compare to logistic regression? #

Naive BayesLogistic Regression
AssumptionFeatures conditionally independentNo structural assumption
Training speedO(nd)O(nd · iterations)
ConvergenceImmediateRequires optimisation
With small dataOften better (less variance)May overfit
With large dataMay underfit (bias from assumption)Usually better
Probability calibrationOften over-confidentWell-calibrated

References #

  • Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
  • Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.