Isolation Forest

Last updated 2020-03-11 Read time 2 min

Summary

Isolation Forest isolates points through random partitioning trees and uses short path lengths as anomaly evidence.rn- It makes weak distributional assumptions and scales well to high-dimensional settings.rn- contamination and tree settings control detection sensitivity and practical alert volume.

Intuition #

Anomalies are easier to isolate than normal points. Isolation Forest quantifies this by averaging how quickly samples get separated across many random trees.

Detailed Explanation #

1. How it works #

Randomly subsample the data.
Build Isolation Trees by splitting on randomly chosen features and thresholds.
Samples with shorter average path length are easier to isolate, so they are more likely to be anomalies.

The anomaly score is normalized using the expected path length of a random binary search tree, \(c(n)\), and the observed average path length.

2. Python example #

import numpy as np
import matplotlib.pyplot as plt
import japanize_matplotlib
from sklearn.ensemble import IsolationForest

rng = np.random.default_rng(0)
X_inliers = 0.3 * rng.normal(size=(200, 2))
X_anom = rng.uniform(low=-4, high=4, size=(20, 2))
X = np.vstack([X_inliers, X_anom])

model = IsolationForest(n_estimators=200, contamination=0.1, random_state=0)
model.fit(X)
scores = -model.score_samples(X)
labels = model.predict(X)  # -1 = anomaly

plt.figure(figsize=(6, 5))
plt.scatter(X[:, 0], X[:, 1], c=scores, cmap="magma", s=30)
plt.colorbar(label="anomaly score")
plt.title("Isolation Forest scores")
plt.tight_layout()
plt.show()

print("Detected anomalies:", np.sum(labels == -1))

Python example plot

3. Hyperparameters #

n_estimators: Number of trees. More trees give stabler results.
max_samples: Samples used per tree. Default is min(256, n_samples).
contamination: Estimated fraction of anomalies; used to set the threshold.
max_features: Features used at each split.

4. Pros and cons #

Pros	Cons
Relatively fast even in high dimensions	Results can vary with the random seed
Scaling is not required (but recommended)	Small local anomalies may be missed
Simple training and inference	`contamination` can be hard to set

5. Summary #

Isolation Forest is a tree-based anomaly detector that uses short isolation paths as the anomaly signal.
It is easy to use in scikit-learn, mainly tuning the number of trees and samples.
It is a good fit when you need fast candidate filtering for logs or sensor data.