isopro.varpro.Rd
Use isolation forests to identify rare/anomalous data.
VarPro object obtained from previous call.
Isolation forest method to be used. Choices are "unsupv" (unsupervised analysis, the default), "rnd" (pure random splitting) and "auto" (auto-encoder, a type of multivariate forest).
Function specifying the sample size used for constructing a tree where sampling is without replacement. Can also be specified using a number.
Number of trees.
Minumum terminal node size.
Formula for when supervised isolation forest is to be used.
Ignored if object
is provided.
Data frame used for fitting isolation forest. Ignored if
object
is provided.
Additional options to be passed to rfsrc
.
Isolation Forest (Liu et al., 2008) is a random forest procedure for detecting anomalous data. In the original method, random trees are constructed using pure random splitting with each tree constructed from a subsample of the data. Typically the subsample size is much smaller than 0.632, the rate used with random forests. The idea is that under random splitting, and because of extreme subsampling, rare or anomalous data will tend to split early and become isolated very quickly. Because of this, the depth of an observation, defined as the number of splits needed to be become a terminal node, can be used to find anomalous observations: anomalous data points are those with relatively small depth values.
There are several ways to run the code. The default is to provide a
formula and a data set via options formula
and data
. In
the case of unsupervised analysis (which is the usual application of
isolation forest), the user only needs to pass in a data set using
data
. In this setting, the option method
is applicable
(see above). In the case where both a formula and data set is
provided, then a supervised analysis is run. The option method
is no longer applicable in this case. This is the less conventional
approach.
Another approach (also unconventional) is to pass a VarPro object
obtained from a previous call (typically this being an unsupervised
analysis using unsupv
, but this is not the only choice).
Unsupervised isolation forests is then applied to the dimension
reduced x matrix from the VarPro call. Therefore this results in an
analysis similar to using only the option data
, however the
difference and advantage is that dimension reduction is achieved via
VarPro.
Users should experiment with the type of isolation forest method used. This is because while the original isolation forest performs well, there are many examples where it can be improved upon. See the examples below which illustrate that the pure random splitting "rnd" method may not always be the best.
Computationally "rnd" is the fastest method followed by "unsupv". The slowest is "auto" which should be used in low-dimensional problems.
Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. (2008). Isolation forest. 2008 Eighth IEEE International Conference on Data Mining. IEEE.
Ishwaran H. (2025). Multivariate Statistics: Classical Foundations and Modern Machine Learning, CRC (Chapman and Hall), in press.
## ------------------------------------------------------------
##
## satellite data: convert some of the classes to "outliers"
## unsupervised isopro analysis
##
## ------------------------------------------------------------
## load data, make three of the classes into outliers
data(Satellite, package = "mlbench")
is.outlier <- is.element(Satellite$classes,
c("damp grey soil", "cotton crop", "vegetation stubble"))
## remove class labels, make unsupervised data
x <- Satellite[, names(Satellite)[names(Satellite) != "classes"]]
## isopro calls
i.rnd <- isopro(data=x, method = "rnd", sampsize=32)
i.uns <- isopro(data=x, method = "unsupv", sampsize=32)
i.aut <- isopro(data=x, method = "auto", sampsize=32)
## AUC and precision recall (computed using true class label information)
perf <- cbind(get.iso.performance(is.outlier,i.rnd$howbad),
get.iso.performance(is.outlier,i.uns$howbad),
get.iso.performance(is.outlier,i.aut$howbad))
colnames(perf) <- c("rnd", "unsupv", "auto")
print(perf)
# \donttest{
## ------------------------------------------------------------
##
## boston housing analysis
## isopro analysis using a previous VarPro (supervised) object
##
## ------------------------------------------------------------
data(BostonHousing, package = "mlbench")
## call varpro first and then isopro
o <- varpro(medv~., BostonHousing)
o.iso <- isopro(o)
## identify data with extreme percentiles
print(BostonHousing[o.iso$howbad <= quantile(o.iso$howbad, .01),])
## ------------------------------------------------------------
##
## boston housing analysis
## supervised isopro analysis - direct call using formula/data
##
## ------------------------------------------------------------
data(BostonHousing, package = "mlbench")
## direct approach uses formula and data options
o.iso <- isopro(formula=medv~., data=BostonHousing)
## identify data with extreme percentiles
print(BostonHousing[o.iso$howbad <= quantile(o.iso$howbad, .01),])
## ------------------------------------------------------------
##
## monte carlo experiment to study different methods
## unsupervised isopro analysis
##
## ------------------------------------------------------------
## monte carlo parameters
nrep <- 25
n <- 1000
## simulation function
twodimsim <- function(n=1000) {
cluster1 <- data.frame(
x = rnorm(n, -1, .4),
y = rnorm(n, -1, .2)
)
cluster2 <- data.frame(
x = rnorm(n, +1, .2),
y = rnorm(n, +1, .4)
)
outlier <- data.frame(
x = -1,
y = 1
)
x <- data.frame(rbind(cluster1, cluster2, outlier))
is.outlier <- c(rep(FALSE, 2 * n), TRUE)
list(x=x, is.outlier=is.outlier)
}
## monte carlo loop
hbad <- do.call(rbind, lapply(1:nrep, function(b) {
cat("iteration:", b, "\n")
## draw the data
simO <- twodimsim(n)
x <- simO$x
is.outlier <- simO$is.outlier
## iso pro calls
i.rnd <- isopro(data=x, method = "rnd")
i.uns <- isopro(data=x, method = "unsupv")
i.aut <- isopro(data=x, method = "auto")
## save results
c(tail(i.rnd$howbad,1),
tail(i.uns$howbad,1),
tail(i.aut$howbad,1))
}))
## compare performance
colnames(hbad) <- c("rnd", "unsupv", "auto")
print(summary(hbad))
boxplot(hbad,col="blue",ylab="outlier percentile value")
# }