Isolation Forest

Basic

Isolation Forest | Isolate Outliers with Random Splits

Created: Last updated: Read time: 2 min

Isolation Forest builds many randomized decision trees and assigns an anomaly score based on how quickly a sample can be isolated. It runs fast even on high-dimensional data.


1. How it works #

  • Randomly subsample the data.
  • Build Isolation Trees by splitting on randomly chosen features and thresholds.
  • Samples with shorter average path length are easier to isolate, so they are more likely to be anomalies.

The anomaly score is normalized using the expected path length of a random binary search tree, \(c(n)\), and the observed average path length.


2. Python example #

import numpy as np
import matplotlib.pyplot as plt
import japanize_matplotlib
from sklearn.ensemble import IsolationForest

rng = np.random.default_rng(0)
X_inliers = 0.3 * rng.normal(size=(200, 2))
X_anom = rng.uniform(low=-4, high=4, size=(20, 2))
X = np.vstack([X_inliers, X_anom])

model = IsolationForest(n_estimators=200, contamination=0.1, random_state=0)
model.fit(X)
scores = -model.score_samples(X)
labels = model.predict(X)  # -1 = anomaly

plt.figure(figsize=(6, 5))
plt.scatter(X[:, 0], X[:, 1], c=scores, cmap="magma", s=30)
plt.colorbar(label="anomaly score")
plt.title("Isolation Forest scores")
plt.tight_layout()
plt.show()

print("Detected anomalies:", np.sum(labels == -1))

Python example plot


3. Hyperparameters #

  • n_estimators: Number of trees. More trees give stabler results.
  • max_samples: Samples used per tree. Default is min(256, n_samples).
  • contamination: Estimated fraction of anomalies; used to set the threshold.
  • max_features: Features used at each split.

4. Pros and cons #

ProsCons
Relatively fast even in high dimensionsResults can vary with the random seed
Scaling is not required (but recommended)Small local anomalies may be missed
Simple training and inferencecontamination can be hard to set

5. Summary #

  • Isolation Forest is a tree-based anomaly detector that uses short isolation paths as the anomaly signal.
  • It is easy to use in scikit-learn, mainly tuning the number of trees and samples.
  • It is a good fit when you need fast candidate filtering for logs or sensor data.