Variable selection problem with unsupv

A quick start

data(Glass, package = "mlbench")

## remove the outcome
Glass$Type <- NULL

## get importance
o <- unsupv(Glass)
print(importance(o))

## compare to PCA
(biplot(prcomp(o$x, scale = TRUE)))

Forest Methods in unsupv

VarPro computes variable importance by analyzing the branches of a random forest. You can choose from three types of forests using the method option:

  • method = "auto" (default): Runs a random forest autoencoder, fitting selected variables against themselves. This is a special form of multivariate forest, offering powerful structure learning but may be slower for very large datasets.

  • method = "unsupv": Runs unsupervised forests, a faster alternative for high-dimensional or unlabeled data.

  • method = "rnd": Uses random forests with pure random splitting, offering speed and simplicity, especially on massive datasets.

For large-scale data, consider using “unsupv” or “rnd” methods to improve runtime.

Default call

data(BostonHousing, package = "mlbench")

## default call
o <- unsupv(BostonHousing)
print(importance(o))

The “unsupv” method

data(BostonHousing, package = "mlbench")

## unsupervised splitting 
o <- unsupv(BostonHousing, method = "unsupv")
print(importance(o))

Hot-encoding

## load the data
data(BostonHousing, package = "mlbench")

## convert some of the features to factors
Boston <- BostonHousing
Boston$zn <- factor(Boston$zn)
Boston$chas <- factor(Boston$chas)
Boston$lstat <- factor(round(0.2 * Boston$lstat))
Boston$nox <- factor(round(20 * Boston$nox))
Boston$rm <- factor(round(Boston$rm))

## call unsupervised varpro and print importance
print(importance(o <- unsupv(Boston)))

## get top variables
get.topvars(o)

## map importance values back to original features
print(get.orgvimp(o))

## same as above ... but for all variables
print(get.orgvimp(o, pretty = FALSE))

Measuring Importance

In the above examples, VarPro calculates variable importance using an entropy-based measure, defined in terms of overall feature variance. This quantifies how much information each variable contributes to the structure learned by the forest. Moreover, we provide a p×pp\times p matrix where the iith row serves as a ensembled regression model, representing the importance scores for the iith variable. The pp column sums for this p×pp\times p matrix can measure variable importance.

o <- unsupv(Glass)
oo <- varPro:::get.beta.entropy(o)

          RI        Na       Mg       Al         Si         K        Ca         Ba
RI   0.00000  1.865670 1.150622 1.922131  48.524728 0.2377953  8.467378 0.01350490
Na 309.65613  0.000000 6.980719 3.503467 239.313206 3.4321527 23.340351 0.21437366
Mg  99.02674 12.339392 0.000000 2.826864  82.989191 0.8368602 14.319128 0.59565304
Al 249.70613  7.985651 3.324894 0.000000  47.251765 0.6026326  7.367585 0.21102410
Si 626.25351 46.346045 8.631093 3.709279   0.000000 1.8263418 20.484964 0.11105199
K  202.12093 25.650772 2.875092 2.453549  48.153254 0.0000000  6.393179 0.08504921
Ca 711.40803 15.910878 6.379171 2.630699  68.109970 1.1812897  0.000000 0.17287874
Ba  51.52017  4.681302 1.341591 1.432208   6.755787 0.2048431  1.491477 0.00000000

ClusterPro for Unsupervised Data Visualization

VarPro includes a powerful tool for unsupervised data visualization.

For each VarPro rule, a two-class analysis is performed by comparing the rule-defined region to its complementary region. This analysis produces regression coefficients that highlight variables most associated with the release variable.

These coefficients are then used to scale the centroids of the two regions. By saving all such pairs of scaled centroids, VarPro constructs an enhanced learning dataset specific to the release variable.

You can then apply standard visualization tools (e.g., PCA, t-SNE, UMAP) to this enhanced dataset to explore complex relationships and structures in the data.

Example of 4-cluster simulation

if (library("MASS", logical.return=TRUE)) {


fourcsim <- function(n=500, sigma=2) {
  
  cl1 <- mvrnorm(n,c(0,4),cbind(c(1,0),c(0,sigma)))
  cl2 <- mvrnorm(n,c(4,0),cbind(c(1,0),c(0,sigma)))
  cl3 <- mvrnorm(n,c(0,-4),cbind(c(1,0),c(0,sigma)))
  cl4 <- mvrnorm(n,c(-4,0),cbind(c(1,0),c(0,sigma)))
  dta <- data.frame(rbind(cl1,cl2,cl3,cl4))
  colnames(dta) <- c("x","y")
  data.frame(dta, noise=matrix(rnorm((n*4)*20),ncol=20))

}

d4c <- fourcsim()
o4c <- clusterpro(d4c)
par(mfrow=c(2,2));plot(o4c,1:4)

}

Example of real data

data(Glass, package = "mlbench")
dg <- Glass

## with class label
og <- clusterpro(dg)
par(mfrow=c(3,3));plot(og,1:16)

## without class label
dgU <- Glass; dgU$Type <- NULL
ogU <- clusterpro(dgU)
par(mfrow=c(3,3));plot(ogU,1:9)



Cite this vignette as
M. Lu, L. Zhou, A. Shear, U. B. Kogalur, and H. Ishwaran. 2024. “varPro: variable selection for unsupervised problems vignette.” http://www.varprotools.org/articles/unsupervise.html.

@misc{LuUnsup,
    author = "Min Lu and Lili Zhou and Aster Shear and Udaya B. Kogalur and Hemant Ishwaran",
    title = {{varPro}: variable selection for unsupervised problems vignette},
    year = {2024},
    url = {http://www.varprotools.org/articles/unsupervise.html},
    howpublished = "\url{http://www.varprotools.org/articles/unsupervise.html}",
    note = "[accessed date]"
}