Categorical

Prep

title: “Categorical Features | Encoding and Handling” weight: 4 created: 2019-03-26T23:45:36+09:00 lastmod: 2024-06-01T00:00:00+09:00 chapter: true not_use_colab: true not_use_twitter: true pre: “3.2 ” header_image: “/images/bg/germany2.jpg” #

Section 3.2 #

Categorical Features #

Practical patterns for encoding categories, handling rare levels, dealing with high cardinality, and combining with numerical features in pipelines.

Common encodings #

  • One‑hot / Dummy: safe default for low cardinality; watch dimensionality.
  • Ordinal: when a natural order exists (e.g., low < medium < high).
  • Target / Mean: powerful, but requires leakage‑safe CV encoding.
  • Hashing: scalable for high cardinality; non‑invertible collisions.

Tips #

  • Consolidate rare levels (e.g., frequency threshold) to stabilize models.
  • Keep encoding inside Pipeline to avoid train/test leakage.
  • For trees/GBMs, one‑hot is often sufficient; linear models benefit from careful target/ordinal choices.