outpro.Rd
outpro
computes an out-of-distribution (OOD) score for new
inputs using a fitted model, integrating variable prioritization and
local neighborhoods derived from the model. The procedure is model aware
and subspace aware: it scores departures in the coordinates that the
model has learned to rely on, rather than relying on a global distance
in the full feature space. Applicable across all outcome types.
outpro(object,
newdata,
neighbor = NULL,
distancef = "prod",
reduce = TRUE,
cutoff = NULL,
max.rules.tree = 150,
max.tree = 150)
outpro.null(object,
nulldata = NULL,
neighbor = NULL,
distancef = "prod",
reduce = TRUE,
cutoff = .79,
max.rules.tree = 150,
max.tree = 150)
Out-of-distribution (OOD) detection is essential for determining when a supervised model encounters inputs that differ in ways that matter for prediction. The approach here embeds variable prioritization directly in the detection step, constructing localized, task relevant neighborhoods from the fitted model and aggregating coordinate wise deviations within the selected subspace to obtain a distance value for an input.
For a varpro
object, variable prioritization is obtained from the
model and controlled by cutoff
. For an rfsrc
object, all
predictors are used unless a reduction is supplied. Distances are
computed after standardizing the selected variables with training means
and scales. Variables with zero standard deviation in the training data
are removed automatically before scoring.
The multiplicative "prod"
metric uses a small \(\epsilon\) to
avoid zero multiplicands. Since differences are measured on a
standardized scale, \(\epsilon\) is set automatically by default as a
small fraction of the median absolute coordinate difference across
variables and neighbors; users can keep the default or pass a custom
value via out.distance
if calling it directly.
The Mahalanobis option uses absolute differences by design and the covariance of standardized training features. A small ridge is added to the covariance for numerical stability.
A fitted varpro
object or an rfsrc
object
with classes c("rfsrc","grow")
.
New data to score. If omitted, the training design
matrix is used. For varpro
objects, encodings are aligned to
training with get.hotencode.test
.
Number of training neighbors per case, as determined
by the model structure. If NULL
, a default of min(n/10,
5000)
is used where n
is the number of training rows.
Distance function for aggregation. One of
"prod"
, "euclidean"
, "mahalanobis"
,
"manhattan"
, "minkowski"
, "kernel"
. The default
is "prod"
.
Controls variable selection. If TRUE
with a
varpro
object, uses model based prioritization with threshold
cutoff
. A character vector selects variables by name. A named
numeric vector supplies variable weights. Otherwise all predictors
are used with unit weights.
Threshold used with varpro
variable importance
z
. If NULL
, a default based on the number of predictors
is used: .79
when the number of predictors is not large, else
0
.
Maximum number of rules per tree for neighbor extraction.
Maximum number of trees to use for neighbor extraction.
For outpro.null
, optional data representing an
in distribution reference. If omitted, the training design matrix is
used.
outpro
returns a list with components:
distance
: numeric vector of length nrow(newdata)
with one score per case.
distance.object
: ingredients used for distance computation, including
score
: neighbor frames returned by varpro.strength
.
neighbor
: neighbor count per case.
xvar.names
: selected variable names after zero sd removal.
xvar.wt
: variable weights used after normalization.
dist.xvar
: list of absolute coordinate difference matrices (neighbors by cases) in standardized units.
xorg.scale
, xnew.scale
: standardized training and test matrices for the selected variables.
means
, sds
: training means and scales for the selected variables.
dropped.zero.sd.variables
: variables removed due to zero standard deviation in training.
distance.args
: list of metric arguments actually used,
including distancef
, weights.used
,
normalize.weights
, p
, and epsilon.used
.
score
: the neighbor information returned by varpro.strength
.
neighbor
: neighbor setting used.
cutoff
: cutoff used for variable prioritization.
oob.bits
: indicator of whether scoring was done on training rows or new data.
selected.variables
: the variables used in scoring after all filters.
selected.weights
: the normalized squared weights for the selected variables.
means
, sds
: duplicates for convenience.
call
: the matched call.
outpro.null
returns the same list with two additional components:
cdf
: the empirical distribution function of distance
.
quantile
: the empirical cumulative probability for each scored case.
The method follows a model centered view of out-of-distribution (OOD) detection that is both model aware and subspace aware. Variable prioritization is embedded directly in the detection process to focus on coordinates that matter for prediction and to discount nuisance directions. Scoring does not rely on global feature density estimation. The implementation uses a random forest engine whose rule based structure provides localized neighborhoods reflecting the learned predictive mapping.
## ------------------------------------------------
## fit a varPro model
data(BostonHousing, package = "mlbench")
smp <- sample(1:nrow(BostonHousing), size = nrow(BostonHousing) * .75)
train.data <- BostonHousing[smp,]
test.data <- BostonHousing[-smp,]
vp <- varpro(medv ~ ., data = train.data)
## Score new data with default multiplicative metric
op <- outpro(vp, newdata = test.data)
head(op$distance)
## Calibrate a null distribution using training data
op.null <- outpro.null(vp)
head(op.null$quantile)