Indirect Estimators and Computational Methods for Models with Unobserved Variables in High Dimensions

Blanc, Guillaume

doi:10.13097/archive-ouverte/unige:169345

Dimension reduction for high dimensional data is an important and challenging task, relevant to both machine learning and statistical applications. Generalized linear latent variable models (GLLVMs) provide a probabilistic alternative to matrix factorization when the data are of mixed types, whether discrete, continuous, or a mixture of both. They achieve the reduction of dimensionality by mapping the correlated multivariate data to so-called latent variables, defined in a lower-dimensional space. The benefit of GLLVMs is twofold: the latent variables can be estimated and used as features to be embedded in another model, and the model parameters themselves are interpretable and provide meaningful indications on the very structure of the data. For all its merits, GLLVMs suffer a serious drawback: their estimation represents a tremendous challenge for even moderately large dimensions, essentially due to the multiple integrals involved in the likelihood function. Numerous methods based on approximations of this latter have been proposed: Laplace approximation, adaptive quadrature, or, recently, variational approximation (VA). For GLLVMs, however, these methods do not scale well to high dimensions, and they may also introduce a large bias in the estimates. An alternative approach is to use the EM-algorithm: unfortunately, for GLLVMs, the update does not admit an analytical solution and requires computationally expensive Monte Carlo procedures, which ultimately results in sluggish progress towards the solution. Of all these methods, the best performing, and indeed current state-of-the-art implementation, is the VA approach. In this paper, we propose an alternative route, which consists in drastically simplifying the estimating equations and applying numerically efficient bias reduction methods in order to recover a consistent estimator for the GLLVM parameters. The resulting estimator, for which we derive the statistical properties, is actually an M-estimator, which has a small efficiency loss compared to the (exact) maximum likelihood estimator. It can, however, provide (bias reduced) estimates for GLLVM models with over n = 10'000 observations, p = 500 (non Gaussian) manifest variables and over q = 10 latent variables in reasonable time on commodity hardware. For larger data sets, the resulting method, whose computational burden is linear in npq, remains applicable when the state-of-the-art method fails to converge. To compute the M-estimator, we derive a stochastic-approximation algorithm which allows us to estimate the GLLVMs with mixed response types, and provide an accompanying R package.

Archive ouverte UNIGE

Indirect Estimators and Computational Methods for Models with Unobserved Variables in High Dimensions

Technical informations