Title: | Hierarchical Shrinkage Stan Models for Biomarker Selection |
---|---|
Description: | Linear and logistic regression models penalized with hierarchical shrinkage priors for selection of biomarkers (or more general variable selection), which can be fitted using Stan (Carpenter et al. (2017) <doi:10.18637/jss.v076.i01>). It implements the horseshoe and regularized horseshoe priors (Piironen and Vehtari (2017) <doi:10.1214/17-EJS1337SI>), as well as the projection predictive selection approach to recover a sparse set of predictive biomarkers (Piironen, Paasiniemi and Vehtari (2020) <doi:10.1214/20-EJS1711>). |
Authors: | Marco Colombo [aut, cre] , Paul McKeigue [aut] , Athina Spiliopoulou [ctb] |
Maintainer: | Marco Colombo <[email protected]> |
License: | GPL-3 |
Version: | 0.8.2.9000 |
Built: | 2024-10-26 04:57:03 UTC |
Source: | https://github.com/mcol/hsstan |
The hsstan package provides linear and logistic regression models penalized with hierarchical shrinkage priors for selection of biomarkers. Models are fitted with Stan (Carpenter et al. (2017)), which allows to perform full Bayesian inference.
The package implements the horseshoe and regularized horseshoe priors (Piironen and Vehtari (2017)), and the projection predictive selection approach to recover a sparse set of predictive biomarkers (Piironen, Paasiniemi and Vehtari (2020)).
The approach is particularly suited to selection from high-dimensional panels of biomarkers, such as those that can be measured by MSMS or similar technologies (Colombo, Valo, McGurnaghan et al. (2019), Colombo, McGurnaghan, Blackbourn et al. (2020)).
B. Carpenter et al. (2017), Stan: a probabilistic programming language, Journal of Statistical Software, 76 (1). doi:10.18637/jss.v076.i01
J. Piironen and A. Vehtari (2017), Sparsity information and regularization in the horseshoe and other shrinkage priors, Electronic Journal of Statistics, 11 (2), 5018-5051. doi:10.1214/17-EJS1337SI
J. Piironen, M. Paasiniemi and A. Vehtari (2020), Projective inference in high-dimensional problems: prediction and feature selection, Electronic Journal of Statistics, 14 (1), 2155-2197. doi:10.1214/20-EJS1711
M. Colombo, E. Valo, S.J. McGurnaghan et al. (2019), Biomarkers associated with progression of renal disease in type 1 diabetes, Diabetologia, 62 (9), 1616-1627. doi:10.1007/s00125-019-4915-0
M. Colombo, S.J. McGurnaghan, L.A.K. Blackbourn et al. (2020), Comparison of serum and urinary biomarker panels with albumin creatinin ratio in the prediction of renal function decline in type 1 diabetes, Diabetologia, 63 (4), 788-798. doi:10.1007/s00125-019-05081-8
M. Colombo, A. Asadi Shehni, I. Thoma et al. (2021), Quantitative levels of serum N-glycans in type 1 diabetes and their association with kidney disease, Glycobiology, 31 (5), 613-623. doi:10.1093/glycob/cwaa106
Compute the Bayesian and the LOO-adjusted R-squared from the posterior samples. For Bayesian R-squared it uses the modelled residual variance (rather than the variance of the posterior distribution of the residuals). The LOO-adjusted R-squared uses Pareto smoothed importance sampling LOO residuals and Bayesian bootstrap.
## S3 method for class 'hsstan' bayes_R2(object, prob = 0.95, summary = TRUE, ...) ## S3 method for class 'hsstan' loo_R2(object, prob = 0.95, summary = TRUE, ...)
## S3 method for class 'hsstan' bayes_R2(object, prob = 0.95, summary = TRUE, ...) ## S3 method for class 'hsstan' loo_R2(object, prob = 0.95, summary = TRUE, ...)
object |
An object of class |
prob |
Width of the posterior interval (0.95, by default). It is
ignored if |
summary |
Whether a summary of the distribution of the R-squared
should be returned rather than the pointwise values ( |
... |
Currently ignored. |
The mean, standard deviation and posterior interval of R-squared if
summary=TRUE
, or a vector of R-squared values with length equal to
the size of the posterior sample if summary=FALSE
.
Andrew Gelman, Ben Goodrich, Jonah Gabry and Aki Vehtari (2019), R-squared for Bayesian regression models, The American Statistician, 73 (3), 307-309. doi:10.1080/00031305.2018.1549100
Aki Vehtari, Andrew Gelman, Ben Goodrich and Jonah Gabry (2019), Bayesian R2 and LOO-R2. https://avehtari.github.io/bayes_R2/bayes_R2.html
# continued from ?hsstan bayes_R2(hs.biom) loo_R2(hs.biom)
# continued from ?hsstan bayes_R2(hs.biom) loo_R2(hs.biom)
The dataset consists of observations on 442 individuals for which a
quantitative measure of diabetes progression is recorded in variable Y
.
Predictors include 10 baseline measurements, 45 interactions and 9 quadratic
terms, for a total of 64 variables for each individual. Each variable has
been standardized by subtracting the mean and then dividing it by its
standard deviation.
A data frame with 442 rows and 65 columns (centred and scaled).
B. Efron, T. Hastie, I. Johnstone and R. Tibshirani (2004), Least angle regression, The Annals of Statistics, 32 (2), 407-499. doi:10.1214/009053604000000067
The original dataset is available from https://web.stanford.edu/~hastie/Papers/LARS/data64.txt
data(diabetes, package="hsstan")
data(diabetes, package="hsstan")
Run the No-U-Turn Sampler (NUTS) as implemented in Stan to fit a hierarchical shrinkage model.
hsstan( x, covs.model, penalized = NULL, family = gaussian, iter = 2000, warmup = floor(iter/2), scale.u = 2, regularized = TRUE, nu = ifelse(regularized, 1, 3), par.ratio = 0.05, global.df = 1, slab.scale = 2, slab.df = 4, qr = TRUE, seed = 123, adapt.delta = NULL, keep.hs.pars = FALSE, ... )
hsstan( x, covs.model, penalized = NULL, family = gaussian, iter = 2000, warmup = floor(iter/2), scale.u = 2, regularized = TRUE, nu = ifelse(regularized, 1, 3), par.ratio = 0.05, global.df = 1, slab.scale = 2, slab.df = 4, qr = TRUE, seed = 123, adapt.delta = NULL, keep.hs.pars = FALSE, ... )
x |
Data frame containing outcome, covariates and penalized predictors. Continuous predictors and outcome variable should be standardized before fitting the models as priors assume them to have mean zero and unit variance. |
covs.model |
Formula containing the unpenalized covariates. |
penalized |
Names of the variables to be used as penalized predictors.
Any variable that is already part of the |
family |
Type of model fitted: either |
iter |
Total number of iterations in each chain, including warmup (2000 by default). |
warmup |
Number of warmup iterations per chain (by default, half the total number of iterations). |
scale.u |
Prior scale (standard deviation) for the unpenalized covariates. |
regularized |
If |
nu |
Number of degrees of freedom of the half-Student-t prior on the
local shrinkage parameters (by default, 1 if |
par.ratio |
Expected ratio of non-zero to zero coefficients (ignored
if |
global.df |
Number of degrees of freedom for the global shrinkage
parameter (ignored if |
slab.scale |
Scale of the regularization parameter (ignored if
|
slab.df |
Number of degrees of freedom of the regularization parameter
(ignored if |
qr |
Whether the thin QR decomposition should be used to decorrelate the
predictors ( |
seed |
Optional integer defining the seed for the pseudo-random number generator. |
adapt.delta |
Target average proposal acceptance probability for adaptation, a value between 0.8 and 1 (excluded). If unspecified, it's set to 0.99 for hierarchical shrinkage models and to 0.95 for base models. |
keep.hs.pars |
Whether the parameters for the horseshoe prior should be
kept in the |
... |
Further arguments passed to |
An object of class hsstan
containing the following fields:
stanfit |
an object of class |
betas |
posterior means of the unpenalized and penalized regression parameters. |
call |
the matched call. |
data |
the dataset used in fitting the model. |
model.terms |
a list of names for the outcome variable, the unpenalized covariates and the penalized predictors. |
family |
the |
hsstan.settings |
the optional settings used in the model. |
kfold()
for cross-validating a fitted object.
data(diabetes) # non-default settings for speed of the example df <- diabetes[1:50, ] hs.biom <- hsstan(df, Y ~ age + sex, penalized=colnames(df)[5:10], chains=2, iter=250)
data(diabetes) # non-default settings for speed of the example df <- diabetes[1:50, ] hs.biom <- hsstan(df, Y ~ age + sex, penalized=colnames(df)[5:10], chains=2, iter=250)
Perform K-fold cross-validation using the same settings used when fitting the model on the whole data.
## S3 method for class 'hsstan' kfold( x, folds, chains = 1, store.fits = TRUE, cores = getOption("mc.cores", 1), ... )
## S3 method for class 'hsstan' kfold( x, folds, chains = 1, store.fits = TRUE, cores = getOption("mc.cores", 1), ... )
x |
An object of class |
folds |
Integer vector with one element per observation indicating the cross-validation fold in which the observation should be withdrawn. |
chains |
Number of Markov chains to run. By default this is set to 1,
independently of the number of chains used for |
store.fits |
Whether the fitted models for each fold should be stored
in the returned object ( |
cores |
Number of cores to use for parallelization (the value of
|
... |
Further arguments passed to |
An object with classes kfold
and loo
that has a similar structure as the
objects returned by loo()
and waic()
and is compatible with the
loo_compare
function for
comparing models. The object contains the following fields:
estimates |
a matrix containing point estimates and standard errors of
the expected log pointwise predictive density ("elpd_kfold"),
the effective number of parameters ("p_kfold", always |
pointwise |
a matrix containing the pointwise contributions of "elpd_kfold", "p_kfold" and "kfoldic". |
fits |
a matrix with two columns and number of rows equal to the number
of cross-validation folds. Column |
data |
the dataset used in fitting the model (before withdrawing
observations). This is not present if |
# continued from ?hsstan # only 2 folds for speed of example folds <- rep(1:2, length.out=length(df$Y)) cv.biom <- kfold(hs.biom, folds=folds, cores=2)
# continued from ?hsstan # only 2 folds for speed of example folds <- rep(1:2, length.out=length(df$Y)) cv.biom <- kfold(hs.biom, folds=folds, cores=2)
Compute the pointwise log-likelihood.
## S3 method for class 'hsstan' log_lik(object, newdata = NULL, ...)
## S3 method for class 'hsstan' log_lik(object, newdata = NULL, ...)
object |
An object of class |
newdata |
Optional data frame to use to evaluate the log-likelihood.
If |
... |
Currently ignored. |
A matrix of size S
by N
, where S
is number of draws from the posterior
distribution, and N
is the number of data points.
# continued from ?hsstan log_lik(hs.biom)
# continued from ?hsstan log_lik(hs.biom)
Compute an efficient approximate leave-one-out cross-validation using Pareto smoothed importance sampling (PSIS-LOO), or the widely applicable information criterion (WAIC), also known as the Watanabe-Akaike information criterion.
## S3 method for class 'hsstan' loo(x, cores = getOption("mc.cores"), ...) ## S3 method for class 'hsstan' waic(x, cores = getOption("mc.cores"), ...)
## S3 method for class 'hsstan' loo(x, cores = getOption("mc.cores"), ...) ## S3 method for class 'hsstan' waic(x, cores = getOption("mc.cores"), ...)
x |
An object of class |
cores |
Number of cores used for parallelisation (the value of
|
... |
Currently ignored. |
A loo
object.
# continued from ?hsstan loo(hs.biom) waic(hs.biom)
# continued from ?hsstan loo(hs.biom) waic(hs.biom)
Extracts the number of posterior samples stored in a fitted model.
## S3 method for class 'hsstan' nsamples(object, ...)
## S3 method for class 'hsstan' nsamples(object, ...)
object |
An object of class |
... |
Currently ignored. |
The total number of posterior samples across the chains after discarding the warmup iterations.
# continued from ?hsstan nsamples(hs.biom)
# continued from ?hsstan nsamples(hs.biom)
The function plots the relative explanatory power of each predictor in order of selection. The relative explanatory power of predictors is computed according to the KL divergence from the full model to each submodel, scaled in such a way that the baseline model (either the null model or the model containing only unpenalized covariates) is at 0, while the full model is at 1.
## S3 method for class 'projsel' plot( x, title = NULL, max.points = NULL, max.labels = NULL, from.covariates = TRUE, font.size = 12, hadj = 0.05, vadj = 0, ... )
## S3 method for class 'projsel' plot( x, title = NULL, max.points = NULL, max.labels = NULL, from.covariates = TRUE, font.size = 12, hadj = 0.05, vadj = 0, ... )
x |
A data frame created by |
title |
Title of the plot. If |
max.points |
Maximum number of predictors to be plotted. If |
max.labels |
Maximum number of predictors to be labelled. If |
from.covariates |
Whether the plotting should start from the unpenalized
covariates ( |
font.size |
Size of the textual elements (labels and axes). |
hadj , vadj
|
Horizontal and vertical adjustment for the labels. |
... |
Currently ignored. |
A ggplot2 object showing the relative incremental contribution of each predictor starting from the initial set of unpenalized covariates.
Compute posterior uncertainty intervals for hsstan
objects.
## S3 method for class 'hsstan' posterior_interval(object, pars = NULL, prob = 0.95, ...)
## S3 method for class 'hsstan' posterior_interval(object, pars = NULL, prob = 0.95, ...)
object |
An object of class |
pars |
Names of parameters for which posterior intervals should be
returned, which can be specified as regular expressions. If |
prob |
A value between 0 and 1 indicating the desired probability to be covered by the uncertainty intervals (0.95, by default). |
... |
Currently ignored. |
A matrix with lower and upper interval bounds as columns and as many rows as selected parameters.
# continued from ?hsstan posterior_interval(hs.biom)
# continued from ?hsstan posterior_interval(hs.biom)
Extract the posterior draws of the linear predictor, possibly transformed by the inverse-link function.
## S3 method for class 'hsstan' posterior_linpred(object, transform = FALSE, newdata = NULL, ...)
## S3 method for class 'hsstan' posterior_linpred(object, transform = FALSE, newdata = NULL, ...)
object |
An object of class |
transform |
Whether the linear predictor should be transformed using
the inverse-link function ( |
newdata |
Optional data frame containing the variables to use to
predict. If |
... |
Currently ignored. |
A matrix of size S
by N
, where S
is the number of draws from the
posterior distribution of the (transformed) linear predictor, and N
is
the number of data points.
# continued from ?hsstan posterior_linpred(hs.biom)
# continued from ?hsstan posterior_linpred(hs.biom)
Compute the log-likelihood and a relevant measure of performance (R-squared or AUC) from the posterior samples.
posterior_performance( obj, prob = 0.95, sub.idx = NULL, summary = TRUE, cores = getOption("mc.cores", 1) )
posterior_performance( obj, prob = 0.95, sub.idx = NULL, summary = TRUE, cores = getOption("mc.cores", 1) )
obj |
An object of class |
prob |
Width of the posterior interval (0.95, by default). It is
ignored if |
sub.idx |
Vector of indices of observations in the dataset to be used
in computing the performance measures. If |
summary |
Whether a summary of the distribution of the performance
measure should be returned rather than the pointwise values
( |
cores |
Number of cores to use for parallelization (the value of
|
The mean, standard deviation and posterior interval of the performance
measure (R-squared or AUC) if summary=TRUE
, or a vector of values
of the performance measure with length equal to the size of the posterior
sample if summary=FALSE
. Attribute type
reports whether the performance
measures are cross-validated or not. If sub.idx
is not NULL
, attribute
subset
reports the index of observations used in the computations.
# continued from ?hsstan posterior_performance(hs.biom, cores=1)
# continued from ?hsstan posterior_performance(hs.biom, cores=1)
Draw from the posterior predictive distribution of the outcome.
## S3 method for class 'hsstan' posterior_predict(object, newdata = NULL, nsamples = NULL, seed = NULL, ...)
## S3 method for class 'hsstan' posterior_predict(object, newdata = NULL, nsamples = NULL, seed = NULL, ...)
object |
An object of class |
newdata |
Optional data frame containing the variables to use to
predict. If |
nsamples |
A positive integer indicating the number of posterior samples
to use. If |
seed |
Optional integer defining the seed for the pseudo-random number generator. |
... |
Currently ignored. |
A matrix of size S
by N
, where S
is the number of simulations from
the posterior predictive distribution, and N
is the number of data points.
# continued from ?hsstan posterior_predict(hs.biom)
# continued from ?hsstan posterior_predict(hs.biom)
Produce a summary of the posterior samples for the quantities of interest.
posterior_summary(x, prob, ...) ## Default S3 method: posterior_summary(x, prob = 0.95, ...) ## S3 method for class 'hsstan' posterior_summary(x, prob = 0.95, pars = NULL, ...)
posterior_summary(x, prob, ...) ## Default S3 method: posterior_summary(x, prob = 0.95, ...) ## S3 method for class 'hsstan' posterior_summary(x, prob = 0.95, pars = NULL, ...)
x |
An object containing or representing posterior samples. If |
prob |
Width of the posterior intervals (0.95, by default). |
... |
Further arguments passed to or from other methods. |
pars |
Vector of parameter names to be extracted. If |
A matrix with columns containing mean, standard deviation and posterior intervals for the given quantities.
summary()
to produce summaries of hsstan
objects that include the number
of effective samples and the split-Rhat diagnostic.
# continued from ?hsstan posterior_summary(hs.biom)
# continued from ?hsstan posterior_summary(hs.biom)
Print a summary for the fitted model
## S3 method for class 'hsstan' print(x, ...)
## S3 method for class 'hsstan' print(x, ...)
x |
An object of class |
... |
Further arguments to |
Forward selection minimizing KL-divergence in projection
projsel(obj, max.iters = 30, start.from = NULL, out.csv = NULL)
projsel(obj, max.iters = 30, start.from = NULL, out.csv = NULL)
obj |
Object of class |
max.iters |
Maximum number of iterations (number of predictors selected) after which the selection procedure should stop. |
start.from |
Vector of variable names to be used in the starting
submodel. If |
out.csv |
If not |
A data frame of class projsel
where each row corresponds to a
forward-selected submodel that contains all variables listed up to that row.
Attribute start.from
reports the predictors in the initial model.
The data frame contains the following columns:
var |
names of the variables selected. |
kl |
KL-divergence from the full model to the submodel. |
rel.kl.null |
relative explanatory power of predictors starting from the intercept-only model. |
rel.kl |
relative explanatory power of predictors starting from the initial submodel. |
elpd |
the expected log predictive density of the submodels. |
delta.elpd |
the difference in elpd from the full model. |
# continued from ?hsstan sel <- projsel(hs.biom, max.iters=3) plot(sel)
# continued from ?hsstan sel <- projsel(hs.biom, max.iters=3) plot(sel)
Report statistics on the parameters used in the sampler, the sampler behaviour and the sampling time.
sampler.stats(object)
sampler.stats(object)
object |
An object of class |
A matrix with C + 1
rows, where C
is the number of Markov
chains, reporting average acceptance probability, average stepsize, number
of divergent transitions, maximum tree depth, total number of gradient
evaluations, warmup and sampling times in seconds.
# continued from ?hsstan sampler.stats(hs.biom)
# continued from ?hsstan sampler.stats(hs.biom)
Summary for the fitted model
## S3 method for class 'hsstan' summary( object, pars = NULL, prob = 0.95, digits = 2, sort = NULL, decreasing = TRUE, max.rows = NULL, ... )
## S3 method for class 'hsstan' summary( object, pars = NULL, prob = 0.95, digits = 2, sort = NULL, decreasing = TRUE, max.rows = NULL, ... )
object |
An object of class |
pars |
Vector of parameter names to be extracted. If |
prob |
Width of the posterior intervals (0.95, by default). |
digits |
Number of decimal places to be reported (2 by default). |
sort |
Column name used to sort the results according to the absolute
value of the column. If |
decreasing |
Whether the results should be sorted in decreasing order
when a valid column name is specified in |
max.rows |
Maximum number of rows to be returned. If |
... |
Currently ignored. |
A matrix with summaries from the posterior distribution of the parameters of interest.
# continued from ?hsstan summary(hs.biom)
# continued from ?hsstan summary(hs.biom)