outpro computes an out-of-distribution (OOD) score for new inputs using a fitted model, integrating variable prioritization and local neighborhoods derived from the model. The procedure is model aware and subspace aware: it scores departures in the coordinates that the model has learned to rely on, rather than relying on a global distance in the full feature space. Applicable across all outcome types.

outpro(object,
       newdata,
       neighbor = NULL,
       distancef = "prod",
       reduce = TRUE,
       cutoff = NULL,
       max.rules.tree = 150,
       max.tree = 150)

outpro.null(object,
            nulldata = NULL,
            neighbor = NULL,
            distancef = "prod",
            reduce = TRUE,
            cutoff = .79,
            max.rules.tree = 150,
            max.tree = 150)

Details

Out-of-distribution (OOD) detection is essential for determining when a supervised model encounters inputs that differ in ways that matter for prediction. The approach here embeds variable prioritization directly in the detection step, constructing localized, task relevant neighborhoods from the fitted model and aggregating coordinate wise deviations within the selected subspace to obtain a distance value for an input.

For a varpro object, variable prioritization is obtained from the model and controlled by cutoff. For an rfsrc object, all predictors are used unless a reduction is supplied. Distances are computed after standardizing the selected variables with training means and scales. Variables with zero standard deviation in the training data are removed automatically before scoring.

The multiplicative "prod" metric uses a small \(\epsilon\) to avoid zero multiplicands. Since differences are measured on a standardized scale, \(\epsilon\) is set automatically by default as a small fraction of the median absolute coordinate difference across variables and neighbors; users can keep the default or pass a custom value via out.distance if calling it directly.

The Mahalanobis option uses absolute differences by design and the covariance of standardized training features. A small ridge is added to the covariance for numerical stability.

Arguments

object

A fitted varpro object or an rfsrc object with classes c("rfsrc","grow").

newdata

New data to score. If omitted, the training design matrix is used. For varpro objects, encodings are aligned to training with get.hotencode.test.

neighbor

Number of training neighbors per case, as determined by the model structure. If NULL, a default of min(n/10, 5000) is used where n is the number of training rows.

distancef

Distance function for aggregation. One of "prod", "euclidean", "mahalanobis", "manhattan", "minkowski", "kernel". The default is "prod".

reduce

Controls variable selection. If TRUE with a varpro object, uses model based prioritization with threshold cutoff. A character vector selects variables by name. A named numeric vector supplies variable weights. Otherwise all predictors are used with unit weights.

cutoff

Threshold used with varpro variable importance z. If NULL, a default based on the number of predictors is used: .79 when the number of predictors is not large, else 0.

max.rules.tree

Maximum number of rules per tree for neighbor extraction.

max.tree

Maximum number of trees to use for neighbor extraction.

nulldata

For outpro.null, optional data representing an in distribution reference. If omitted, the training design matrix is used.

Value

outpro returns a list with components:

  • distance: numeric vector of length nrow(newdata) with one score per case.

  • distance.object: ingredients used for distance computation, including

    • score: neighbor frames returned by varpro.strength.

    • neighbor: neighbor count per case.

    • xvar.names: selected variable names after zero sd removal.

    • xvar.wt: variable weights used after normalization.

    • dist.xvar: list of absolute coordinate difference matrices (neighbors by cases) in standardized units.

    • xorg.scale, xnew.scale: standardized training and test matrices for the selected variables.

    • means, sds: training means and scales for the selected variables.

    • dropped.zero.sd.variables: variables removed due to zero standard deviation in training.

  • distance.args: list of metric arguments actually used, including distancef, weights.used, normalize.weights, p, and epsilon.used.

  • score: the neighbor information returned by varpro.strength.

  • neighbor: neighbor setting used.

  • cutoff: cutoff used for variable prioritization.

  • oob.bits: indicator of whether scoring was done on training rows or new data.

  • selected.variables: the variables used in scoring after all filters.

  • selected.weights: the normalized squared weights for the selected variables.

  • means, sds: duplicates for convenience.

  • call: the matched call.

outpro.null returns the same list with two additional components:

  • cdf: the empirical distribution function of distance.

  • quantile: the empirical cumulative probability for each scored case.

Background

The method follows a model centered view of out-of-distribution (OOD) detection that is both model aware and subspace aware. Variable prioritization is embedded directly in the detection process to focus on coordinates that matter for prediction and to discount nuisance directions. Scoring does not rely on global feature density estimation. The implementation uses a random forest engine whose rule based structure provides localized neighborhoods reflecting the learned predictive mapping.

See also

Examples

## ------------------------------------------------

## fit a varPro model
data(BostonHousing, package = "mlbench")
smp <- sample(1:nrow(BostonHousing), size = nrow(BostonHousing) * .75)
train.data <- BostonHousing[smp,]
test.data <- BostonHousing[-smp,]
vp <- varpro(medv ~ ., data = train.data)

## Score new data with default multiplicative metric
op <- outpro(vp, newdata = test.data)
head(op$distance)

## Calibrate a null distribution using training data
op.null <- outpro.null(vp)
head(op.null$quantile)