Isolation Forest builds many randomized decision trees and assigns an anomaly score based on how quickly a sample can be isolated. It runs fast even on high-dimensional data.
1. How it works #
- Randomly subsample the data.
- Build Isolation Trees by splitting on randomly chosen features and thresholds.
- Samples with shorter average path length are easier to isolate, so they are more likely to be anomalies.
The anomaly score is normalized using the expected path length of a random binary search tree, \(c(n)\), and the observed average path length.
2. Python example #
import numpy as np
import matplotlib.pyplot as plt
import japanize_matplotlib
from sklearn.ensemble import IsolationForest
rng = np.random.default_rng(0)
X_inliers = 0.3 * rng.normal(size=(200, 2))
X_anom = rng.uniform(low=-4, high=4, size=(20, 2))
X = np.vstack([X_inliers, X_anom])
model = IsolationForest(n_estimators=200, contamination=0.1, random_state=0)
model.fit(X)
scores = -model.score_samples(X)
labels = model.predict(X) # -1 = anomaly
plt.figure(figsize=(6, 5))
plt.scatter(X[:, 0], X[:, 1], c=scores, cmap="magma", s=30)
plt.colorbar(label="anomaly score")
plt.title("Isolation Forest scores")
plt.tight_layout()
plt.show()
print("Detected anomalies:", np.sum(labels == -1))
3. Hyperparameters #
n_estimators: Number of trees. More trees give stabler results.max_samples: Samples used per tree. Default ismin(256, n_samples).contamination: Estimated fraction of anomalies; used to set the threshold.max_features: Features used at each split.
4. Pros and cons #
| Pros | Cons |
|---|---|
| Relatively fast even in high dimensions | Results can vary with the random seed |
| Scaling is not required (but recommended) | Small local anomalies may be missed |
| Simple training and inference | contamination can be hard to set |
5. Summary #
- Isolation Forest is a tree-based anomaly detector that uses short isolation paths as the anomaly signal.
- It is easy to use in scikit-learn, mainly tuning the number of trees and samples.
- It is a good fit when you need fast candidate filtering for logs or sensor data.