title: “Categorical Features | Encoding and Handling” weight: 4 created: 2019-03-26T23:45:36+09:00 lastmod: 2024-06-01T00:00:00+09:00 chapter: true not_use_colab: true not_use_twitter: true pre: “3.2 ” header_image: “/images/bg/germany2.jpg” #
Section 3.2 #
Categorical Features #
Practical patterns for encoding categories, handling rare levels, dealing with high cardinality, and combining with numerical features in pipelines.
Common encodings #
- One‑hot / Dummy: safe default for low cardinality; watch dimensionality.
- Ordinal: when a natural order exists (e.g., low < medium < high).
- Target / Mean: powerful, but requires leakage‑safe CV encoding.
- Hashing: scalable for high cardinality; non‑invertible collisions.
Tips #
- Consolidate rare levels (e.g., frequency threshold) to stabilize models.
- Keep encoding inside
Pipelineto avoid train/test leakage. - For trees/GBMs, one‑hot is often sufficient; linear models benefit from careful target/ordinal choices.