層化抽出

最終更新 2026-03-03 読了時間 1 分

まとめ

層化抽出（Stratified Sampling）は母集団をグループ（層）に分け、各層から比率に応じてサンプルを取得する。
単純ランダム抽出に比べ、少数グループの代表性を保ちやすく、推定精度が向上する。
pandas の groupby + sample や scikit-learn の train_test_split(stratify=) で実装できる。

直感 #

全国の有権者から1000人をランダムに選ぶと、地方の小さな県がゼロになるかもしれない。層化抽出は「都道府県ごとに人口比で人数を割り振り、各県からランダムに選ぶ」方式。機械学習では「クラスの比率を保ったまま訓練/テストに分割する」のが典型的な応用。

詳細な解説 #

scikit-learn での基本 #

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from collections import Counter

X, y = make_classification(n_samples=1000, weights=[0.9, 0.1], random_state=42)

# 層化あり
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
print("訓練:", Counter(y_train))
print("テスト:", Counter(y_test))

pandas での層化抽出 #

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import pandas as pd

df = pd.DataFrame({
    "region": ["A"]*500 + ["B"]*300 + ["C"]*200,
    "value": range(1000)
})

# 各 region から 10% を抽出
sampled = df.groupby("region", group_keys=False).apply(
    lambda x: x.sample(frac=0.1, random_state=42)
)
print(sampled["region"].value_counts())

サンプリング手法の比較 #

手法	代表性	実装の容易さ	使いどころ
単純ランダム	△	○	母集団が均一
層化抽出	○	○	グループ比率を保ちたい
クラスター抽出	△	○	地理的に散在
系統抽出	△	○	順序データ

層化 K 分割との関係 #

機械学習の文脈では、層化抽出は交差検証の分割にも応用されます。

層化K分割交差検証 — クラス比率を保つ CV
SMOTE — 不均衡データのオーバーサンプリング
交差検証 — 分割手法の基本