Unsupervised feature selection plays a vital role in high-dimensional data analysis, especially in fields such as bioinformatics, imaging, and text mining, where labeled outcomes are often unavailable or costly to obtain. In the absence of supervision, identifying informative variables becomes a fundamentally different and more challenging task: without a response to guide selection, variable importance must be inferred from the intrinsic structure of the data itself.
We address this challenge by extending the Variable Priority (VarPro) framework, originally developed for supervised learning, to the unsupervised domain. The resulting approach, known as Unsupervised Variable Priority (UVarPro), reframes feature selection as a sequence of localized two-class classification tasks, where the class labels arise from rule-based data partitions and their complement regions. This transformation introduces a layer of supervision grounded in statistical dependence and Bayes’ theorem, enabling principled importance estimation without labeled data.
Unlike purely heuristic approaches based on variance or clustering alignment, UVarPro draws from a more formal notion of informativeness: a variable is considered a signal if it explains the joint structure of the data and contributes to dependencies among features. This aligns with the concept of a Markov blanket in graphical models, where a minimal set of variables renders others conditionally independent. By leveraging this principle, UVarPro selects variables that preserve essential structure while filtering out noise.
UVarPro is implemented by the uvarpro
function which
provides multiple strategies for unsupervised forest construction,
controlled by the method
argument:
method = "auto"
(default): Constructs a
random forest autoencoder by regressing selected variables against
themselves. This specialized multivariate forest effectively learns the
internal structure of the data but may be slower on large
datasets.
method = "unsupv"
: Builds forests using standard
unsupervised random forest techniques that do not rely on any response
variable. This approach offers a good balance between scalability and
structural fidelity.
method = "rnd"
: Constructs forests using pure random
splitting, independent of data-driven criteria. While this sacrifices
structural precision, it offers the fastest performance and is useful
for extremely large datasets.
Feature importance in uvarpro
is quantified using an
entropy-based criterion derived from the overall variability explained
by each variable. Users may also supply custom entropy functions to
tailor the importance scoring mechanism to specific applications. See
the examples below.
For further details, see Zhou, Lu and Ishwaran (2025). Variable Priority for Unsupervised Variable Selection.
library(varPro)
data(BostonHousing, package = "mlbench")
## default call
o <- uvarpro(BostonHousing)
print(importance(o))
mean std z
nox 0.5880322 0.3838791 1.5318162
indus 0.4964445 0.3893825 1.2749533
rad 0.4051130 0.3518696 1.1513157
zn 0.4124063 0.4104604 1.0047407
medv 0.3700886 0.4275897 0.8655226
tax 0.2813970 0.3510989 0.8014749
crim 0.2791197 0.3866170 0.7219540
dis 0.2617584 0.3878109 0.6749641
lstat 0.2666589 0.3990492 0.6682358
ptratio 0.2297203 0.3440373 0.6677192
rm 0.2032958 0.3678408 0.5526733
age 0.1778247 0.3459164 0.5140685
b 0.0000000 0.0000000 NaN
chas 0.0000000 0.0000000 NaN
## unsupervised splitting
o <- uvarpro(BostonHousing, method = "unsupv")
print(importance(o))
mean std z
nox 0.4869640 0.3995163 1.2188839
indus 0.4495944 0.4044867 1.1115182
lstat 0.4240813 0.4392637 0.9654369
crim 0.3803382 0.3983700 0.9547361
dis 0.3802262 0.4123258 0.9221499
tax 0.3220476 0.3811110 0.8450231
ptratio 0.3191808 0.3920556 0.8141211
medv 0.3302492 0.4193045 0.7876120
rm 0.3270985 0.4244647 0.7706141
rad 0.2503864 0.3309411 0.7565889
age 0.3014338 0.4085133 0.7378802
chas 0.2578039 0.3717827 0.6934263
zn 0.2466710 0.3703855 0.6659845
b 0.0000000 0.0000000 NaN
## load the data
data(BostonHousing, package = "mlbench")
## convert some of the features to factors
Boston <- BostonHousing
Boston$zn <- factor(Boston$zn)
Boston$chas <- factor(Boston$chas)
Boston$lstat <- factor(round(0.2 * Boston$lstat))
Boston$nox <- factor(round(20 * Boston$nox))
Boston$rm <- factor(round(Boston$rm))
## call unsupervised varpro and print importance
print(importance(o <- uvarpro(Boston)))
mean std z
indus 0.54117940 0.3271978 1.6539825
rm6 0.56326599 0.3975106 1.4169835
zn0 0.51511629 0.3794794 1.3574289
rad 0.36567070 0.3642130 1.0040022
tax 0.26527079 0.3446583 0.7696632
dis 0.28795676 0.3760424 0.7657561
lstat1 0.25008573 0.3660272 0.6832436
medv 0.24229301 0.3756820 0.6449418
ptratio 0.18427676 0.3168359 0.5816159
crim 0.18407113 0.3258101 0.5649645
nox8 0.17606129 0.3185826 0.5526393
nox11 0.11161593 0.2519212 0.4430588
nox9 0.08109181 0.2257001 0.3592901
nox12 0.06632392 0.1892204 0.3505114
nox10 0.07201069 0.2107668 0.3416604
nox17 0.06073111 0.1777529 0.3416604
age 0.00000000 0.0000000 NaN
chas 0.00000000 0.0000000 NaN
lstat0 0.00000000 0.0000000 NaN
lstat2 0.00000000 0.0000000 NaN
lstat3 0.00000000 0.0000000 NaN
lstat6 0.00000000 0.0000000 NaN
nox13 0.00000000 0.0000000 NaN
nox14 0.00000000 0.0000000 NaN
rm5 0.00000000 0.0000000 NaN
rm7 0.00000000 0.0000000 NaN
rm8 0.00000000 0.0000000 NaN
zn12.5 0.00000000 0.0000000 NaN
zn20 0.00000000 0.0000000 NaN
zn28 0.00000000 0.0000000 NaN
zn75 0.00000000 0.0000000 NaN
## get top variables
get.topvars(o)
[1] "indus" "rm6" "zn0" "rad" "tax" "dis" "lstat1" "medv"
[9] "ptratio" "crim" "nox8" "nox11" "nox9" "nox12" "nox10" "nox17"
[17] "age" "chas" "lstat0" "lstat2" "lstat3" "lstat6" "nox13" "nox14"
[25] "rm5" "rm7" "rm8" "zn12.5" "zn20" "zn28" "zn75"
## map importance values back to original features
print(get.orgvimp(o))
variable z
3 indus 1.6539825
6 rm 1.4169835
2 zn 1.3574289
9 rad 1.0040022
10 tax 0.7696632
8 dis 0.7657561
12 lstat 0.6832436
13 medv 0.6449418
11 ptratio 0.5816159
1 crim 0.5649645
5 nox 0.5526393
4 chas 0.0000000
7 age 0.0000000
## same as above ... but for all variables
print(get.orgvimp(o, pretty = FALSE))
crim zn indus chas nox rm age dis rad
0.5649645 1.3574289 1.6539825 0.0000000 0.5526393 1.4169835 0.0000000 0.7657561 1.0040022
tax ptratio b lstat medv
0.7696632 0.5816159 0.0000000 0.6832436 0.6449418
VarPro allows users to define a custom importance function through
the (hidden) entropy
argument. This function is evaluated
internally during forest construction.
The custom function must return a list with two elements: 1. A scalar
importance value. 2. An entropy
list containing the
comp
and oob
membership vectors.
Example:
my.entropy <- function(xC, xO, ...) {
wss <- mean(apply(rbind(xO, xC), 2, sd, na.rm = TRUE))
bss <- mean(apply(xO, 2, sd, na.rm = TRUE)) +
mean(apply(xC, 2, sd, na.rm = TRUE))
imp <- 0.5 * bss / wss
entropy <- list(comp = list(...)$compMembership,
oob = list(...)$oobMembership)
list(imp = imp, entropy)
}
o <- uvarpro(BostonHousing, entropy = my.entropy)
print(importance(o))
mean std z
nox 0.5689125 0.3900506 1.4585608
indus 0.5087139 0.3900071 1.3043709
rad 0.3504640 0.3520091 0.9956106
zn 0.3997953 0.4147441 0.9639565
medv 0.4126304 0.4310544 0.9572582
tax 0.3490256 0.3727909 0.9362502
crim 0.3446691 0.3990397 0.8637464
dis 0.2564835 0.3807887 0.6735587
ptratio 0.2151499 0.3324902 0.6470863
lstat 0.2429619 0.3907914 0.6217177
age 0.2182291 0.3743837 0.5829022
rm 0.1556733 0.3331967 0.4672112
b 0.0000000 0.0000000 NaN
chas 0.0000000 0.0000000 NaN
As an alternative, users can compute variable importance
after model fitting by working directly with the
$entropy
output. This approach gives full control over the
definition of importance scores.
data(BostonHousing, package = "mlbench")
o <- uvarpro(BostonHousing, ntree = 3, max.rules.tree = 10)
## convert original/release region into two-class problem
## define importance as the lasso beta values
## For faster performance on Unix systems, consider using:
## library(parallel)
## imp <- do.call(rbind, mclapply(seq_along(o$entropy), function(j) { ... }))
imp <- do.call(rbind, lapply(seq_along(o$entropy), function(j) {
rO <- do.call(rbind, lapply(o$entropy[[j]], function(r) {
xC <- o$x[r[[1]],names(o$entropy),drop=FALSE]
xO <- o$x[r[[2]],names(o$entropy),drop=FALSE]
y <- factor(c(rep(0, nrow(xC)), rep(1, nrow(xO))))
x <- rbind(xC, xO)
x <- x[, colnames(x) != names(o$entropy)[j]]
fit <- tryCatch(
suppressWarnings(glmnet::cv.glmnet(as.matrix(x), y, family = "binomial")),
error = function(e) NULL
)
if (!is.null(fit)) {
beta <- setNames(rep(0, length(o$entropy)), names(o$entropy))
bhat <- abs(coef(fit)[-1, 1])
beta[names(bhat)] <- bhat
beta
} else {
NULL
}
}))
if (!is.null(rO)) {
val <- colMeans(rO, na.rm = TRUE)
names(val) <- colnames(rO)
return(val)
} else {
return(NULL)
}
}) |> setNames(names(o$entropy)))
print(imp)
crim zn indus nox rm dis rad ptratio
crim 0.0000000 0.00000000 0.000000000 0.5186556 0.0000000 0.04332184 0.119731180 0.000000000
zn 8.3120410 0.00000000 0.115124727 33.6707767 0.0000000 0.03767438 0.008917331 0.000000000
indus 0.4843001 0.08755108 0.000000000 25.0019427 0.9215305 0.81272780 0.192673483 0.001807552
nox 0.7031151 0.01256135 0.197637889 0.0000000 0.1103938 1.48287678 0.559274899 0.186624787
dis 0.0000000 0.00000000 0.000000000 15.4727857 0.0000000 0.00000000 0.000000000 0.000000000
rad 5.2414764 0.00000000 1.189979771 2.7000874 0.0000000 0.77578291 0.000000000 0.712066509
lstat 0.5619712 0.00000000 0.005360787 0.7473822 1.2940632 0.00000000 0.000000000 0.000000000
medv 0.0000000 0.00000000 0.000000000 0.0000000 2.9693986 0.00000000 0.000000000 0.298883360
lstat medv
crim 0.02642667 0.00000000
zn 0.00000000 0.00000000
indus 0.01050727 0.00000000
nox 0.03387207 0.02638144
dis 0.00000000 0.00000000
rad 0.01790731 0.02322370
lstat 0.00000000 0.05517093
medv 0.12894740 0.00000000
In the above examples, the th row serves as a ensembled regression model, representing the importance scores for the th variable.
For large datasets, consider using parallel::mclapply
to
improve computation time.
Users can also use the fast built in hidden lasso-beta entropy function:
varPro:::get.beta.entropy(o)
.
Summary |
Both approaches offer flexibility in defining importance scores tailored to domain-specific objectives. Option 1 is more integrated, while Option 2 offers full post-processing control. |
VarPro includes a powerful tool for unsupervised data visualization.
For each VarPro rule, a two-class analysis is performed by comparing the rule-defined region to its complementary region. This analysis produces regression coefficients that highlight variables most associated with the release variable.
These coefficients are then used to scale the centroids of the two regions. By saving all such pairs of scaled centroids, VarPro constructs an enhanced learning dataset specific to the release variable.
You can then apply standard visualization tools (e.g., PCA, t-SNE, UMAP) to this enhanced dataset to explore complex relationships and structures in the data.
if (library("MASS", logical.return=TRUE)) {
fourcsim <- function(n=500, sigma=2) {
cl1 <- mvrnorm(n,c(0,4),cbind(c(1,0),c(0,sigma)))
cl2 <- mvrnorm(n,c(4,0),cbind(c(1,0),c(0,sigma)))
cl3 <- mvrnorm(n,c(0,-4),cbind(c(1,0),c(0,sigma)))
cl4 <- mvrnorm(n,c(-4,0),cbind(c(1,0),c(0,sigma)))
dta <- data.frame(rbind(cl1,cl2,cl3,cl4))
colnames(dta) <- c("x","y")
data.frame(dta, noise=matrix(rnorm((n*4)*20),ncol=20))
}
d4c <- fourcsim()
o4c <- clusterpro(d4c)
par(mfrow=c(2,2));plot(o4c,1:4)
}
data(Glass, package = "mlbench")
dg <- Glass
## with class label
og <- clusterpro(dg)
par(mfrow=c(3,3));plot(og,1:16)
## without class label
dgU <- Glass; dgU$Type <- NULL
ogU <- clusterpro(dgU)
par(mfrow=c(3,3));plot(ogU,1:9)
Cite this vignette as
M. Lu, L. Zhou, A.
Shear, U. B. Kogalur, and H. Ishwaran. 2024. “varPro: variable selection
for unsupervised problems vignette.” http://www.varprotools.org/articles/unsupervise.html.
@misc{HemantInstall,
author = "Min Lu and Lili Zhou and Aster Shear and Udaya B. Kogalur and Hemant Ishwaran",
title = {{varPro}: variable selection for unsupervised problems vignette},
year = {2024},
url = {http://www.varprotools.org/articles/unsupervise.html},
howpublished = "\url{http://www.varprotools.org/articles/unsupervise.html}",
note = "[accessed date]"
}