en
Scientific article
English

Fast Robust Model Selection in Large Datasets

Published inJournal of the American Statistical Association, vol. 106, p. 203-212
Publication date2011
Abstract

Large datasets are more and more common in many research fields. In particular, in the linear regression context, it is often the case that a huge number of potential covariates are available to explain a response variable, and the first step of a reasonable statistical analysis is to reduce the number of covariates. This can be done in a forward selection procedure that includes the selection of the variable to enter, the decision to retain it or stop the selection and estimation of the augmented model. Least squares plus t-tests can be fast, but the outcome of a forward selection might be suboptimal when there are outliers. In this paper, we propose a complete algorithm for fast robust model selection, including considerations for huge sample sizes. Since simply replacing the classical statistical criteria by robust ones is not computationally possible, we develop simplified robust estimators, selection criteria and testing procedures for linear regression. The robust estimator is a one-step weighted M-estimator that can be biased if the covariates are not orthogonal. We show that the bias can be made smaller by iterating the M-estimator one or more steps further. In the variable selection process, we propose a simplified robust criterion based on a robust t-statistic that we compare to a false discovery rate adjusted level. We carry out a simulation study to show the good performance of our approach. We also analyze two datasets and show that the results obtained by our method outperform those from robust LARS and random forests.

Keywords
  • Linear regression
  • Multicollinearity
  • M-estimator
  • Robust t-test
  • Partial correlation
  • False discovery rate
  • LARS
  • Random forests
Citation (ISO format)
DUPUIS, Debbie, VICTORIA-FESER, Maria-Pia. Fast Robust Model Selection in Large Datasets. In: Journal of the American Statistical Association, 2011, vol. 106, p. 203–212. doi: 10.1198/jasa.2011.tm09650
Main files (1)
Article (Submitted version)
accessLevelPublic
Identifiers
ISSN of the journal0162-1459
1619views
1077downloads

Technical informations

Creation11/26/2010 11:04:00 AM
First validation11/26/2010 11:04:00 AM
Update time03/14/2023 4:09:52 PM
Status update03/14/2023 4:09:52 PM
Last indexation01/15/2024 9:54:39 PM
All rights reserved by Archive ouverte UNIGE and the University of GenevaunigeBlack