Sparse Bayesian modelling for categorical predictors
Sparse Bayesian modelling for categorical predictors
Disciplines
Other Social Sciences (60%); Computer Sciences (40%)
Keywords
-
Sparse Bayesian Modelling,
Categorical Covariate,
Variable Selection,
MCMC,
Regression Model,
Effect Fusion
Sparse modelling and variable selection is one of the most important issues in regression type models, as in applications often a large number of covariates on comparably few subjects is available. Estimation of regression effects in such "large p, small n" problems is ill-conditioned: estimated regression effects typically have large standard errors, estimation results are instable and fitted models have no good predictive performance. Many methods to regularize inference and perform variable selection have been developed which are adequate for metric covariates, whereas special issues arising for categorical covariates have not received much attention sofar. The goal of this project is to develop and investigate Bayesian methods which allow sparse modeling the effect of categorical covariates in regression type models. Particularly in medical and socio-economic applications categorical covariates, measured on an ordinal or nominal scale, e.g. gender, educational level or health status, are commonly used. For a metric covariate, where the (linear) effect of the covariate is captured by only one regression coefficient, sparsity is achieved by restricting this coefficient to zero. In contrast, the effect of a categorical covariate with k+1 categories is modeled by a set of k regression coefficients, one for each covariate level except the baseline category. Sparsity can therefore not only be achieved by removing the variable from the model, i.e. restricting all k regression coefficients to zero but whenever the effect of the categorical predictor can be represented by less than k regression effects. This is the case when only some of the coefficients are zero and when the effect of two or more categories on the response is the same: a sparse representation would then use a joint effect for these categories. Performing effect fusion during estimation would be attractive as the question which levels of the categorical covariate have essentially the same effect could be answered directly from the estimation output. In a Bayesian approach sparsity can be introduced by choosing appropriate prior distributions, e.g. spike and slab prior distributions which are very popular for Bayesian variable selection. We will extend these prior distributions to allow also selection of a group of regressors and effect fusion. Different variants of prior distributions will be proposed, in particular for nominal covariates. MCMC methods for Bayesian inference will be developed and compared to frequentist methods in an extensive simulation study. The methods developed within this project will be implemented in the free software Bayes-X (http://www.stat.uni- muenchen.de/~bayesx/), which is the most advanced software package for Bayesian inference in structured additive regression models. The usefulness of the new methods for applied statistics will be demonstrated by employing them in the analysis of data from EU-SILC and PISA.
ln this project we developed Bayesian methods which allow a sparse representation of the effect of covariates in regression type models. Particularly in medical or socio-economic applications a researcher is often interested in modelling the dependence of a response variable on categorical covariates which are measured either on an ordinal or on a nominal scale. Examples for such categorical covariates are gender, educational level or health status. Typically, for a regression analysis of the variable of interest data from relatively few subjects on many potential covariates are available. lf a categorical variable is used as a covariate in a regression type model one regression effect has to be estimated for each category (except the reference category). Thus in models with categorical predictors the estimation problem often is ill-posed resulting in estimates with high standard errors, instability of estimation results and poor predictive properties of the fitted model. A sparser model with fewer effects to be estimated can be achieved by excluding irrelevant variables or irrelevant categories from the model but also by fusion of categories which have essentially the same effect on the response. ln this project a Bayesian approach was adopted to achieve sparsity of a regression model with respect to the number of regression effects that have to be estimated. To achieve this goal, two different prior distributions were developed which encourage fusion of effects that are essentially the same: The first prior extends the spike and slab approach which has proved to be useful for variable selection to effect fusion. The second prior relies on approaches of model based clustering using on finite mixtures and extends these to clustering of regression effects. MCMC methods were developed to fit regression models under these priors. The MCMC output allows to determine either model-averaged estimates or to perform model selection, i.e. to determine which level effects are practically zero and can be completely removed from the model and which are practically identical and hence can be fused. For the predominant regression models in practical applications, linear and logistic regression the methods are implemented in the R- package effectFusion and are thus available for free use by interested statisticians and applied researchers.
- Universität Linz - 100%
Research Output
- 14 Citations
- 2 Publications
-
2018
Title Effect fusion using model-based clustering DOI 10.1177/1471082x17739058 Type Journal Article Author Malsiner-Walli G Journal Statistical Modelling Pages 175-196 Link Publication -
2016
Title Discussion: Bayesian regularization and effect smoothing for categorical predictors DOI 10.1177/1471082x16642655 Type Journal Article Author Wagner H Journal Statistical Modelling Pages 220-227