Discussion of “The power of monitoring: how to make the most of a contaminated multivariate sample” by Andrea Cerioli, Marco Riani, Anthony C. Atkinson and Aldo Corbellini

This paper discusses the contribution of Cerioli et al. (Stat Methods Appl, 2018), where robust monitoring based on high breakdown point estimators is proposed for multivariate data. The results follow years of development in robust diagnostic techniques. We discuss the issues of extending data monitoring to other models with complex structure, e.g. factor analysis, mixed linear models for which S and MM-estimators exist or deviating data cells. We emphasise the importance of robust testing that is often overlooked despite robust tests being readily available once S and MM-estimators have been defined. We mention open questions like out-of-sample inference or big data issues that would benefit from monitoring.


Introduction
We congratulate the authors for their comprehensive and convincing presentation of the data monitoring approach based on the forward search (FS) algorithm using high breakdown robust estimation of covariance matrices. The methods presented here can be modified to include various degree of robustness via the choice of tuning constants controlling the breakdown point and/or efficiency of the robust estimators. They are de facto used as diagnostic tools to determine important tuning parameters, such as the breakdown point, for the robust covariance estimator. In particular, in detecting the proportion of data that can be considered as outliers, hence providing information about the breakdown point for possible parameters tuning. Cerioli et al. (2018) also compare several robust estimators, which incidentally, do not lead to the same breakdown point estimation.
Early attempts to estimate robustly covariance matrices when some of the data at hand are missing dates back to Little and Smith (1987) and Little and Rubin (1987). High breakdown estimation of multivariate location and scale were then proposed by Cheng and Victoria-Feser (2002); Copt and Victoria-Feser (2004); Alqallaf et al. (2009) andDanilov et al. (2012).
As Cerioli et al. (2018), we also consider estimation of a covariance matrix as a first step towards a more thorough data analysis. In what follows, we briefly mention robust methods that have been developed in multivariate settings, most of them under the multivariate normality assumption, for which the robust monitoring approach of Cerioli et al. (2018) could possibly be extended. We also discuss the important issue of robust inference (testing) and out-of-sample validity.

Deviating data cells in multivariate samples
The identification of outliers in multivariate settings has been extended to cellwise contamination. The problem is complex as this type of contamination cannot be identified easily using purely columnwise/rowwise methods such as S or M M-estimators. Substantial progress has been made in the last decade; see for instance Agostinelli et al. (2015); Öllerer (2016) and Leung et al. (2016). The snipping approach of Farcomeni (2014a, b) developed to deal with robust clustering with cellwise contamination can also accommodate single clusters. Very recently, Rousseeuw and Van den Bossche (2017) proposed a method that can detect deviating cells while taking correlation into account. It has no restriction on the number of clean rows and can deal with high dimensions. A R package called Cellwise is now available on Cran. Monitoring cellwise contamination can certainly be of interest and could bring new insight on the data and estimators to be used.

Principal component and factor analysis
Robust estimation of the covariance matrix is an essential step for principal component analysis (PCA), or similarly, factor analysis (FA), both being dimension reduction technique. Various robust alternatives have been proposed; see for example Devlin et al. (1981); Croux and Hasebroeck (2000); Croux and Ruiz-Gazen (2005) 2012) and Xu et al. (2013). Monitoring outliers in this setting is also important in practice, especially when the results of PCA or FA are used to compute individual scores that can be used for example to somehow classify participants according to their response pattern. For example, Mavridis and Moustaki (2008) implement the forward search algorithm in FA models and Mavridis and Moustaki (2009) extend it to FA with binary data (for the non normal cases, see also Moustaki and Victoria-Feser 2006).

Mixed linear models
Another notable multivariate setting is the framework of mixed linear models. Common examples are repeated measures taken over time on the same individual and responses from patients in group randomised trials. In this situation, the clusters are independent and a multivariate normal formulation is available at the cluster level. Copt and Victoria-Feser (2006) exploited this equivalence and proposed S-estimators in the balanced case, i.e. all clusters have the same dimension p. The difference with the multivariate model considered by Cerioli et al. (2018) is that (i) μ i , the mean of the outcome y i for cluster i can be written as a linear combination of covariates, i.e. μ i = x T i α; (ii) , the variance of y i also have a specific structure, i.e. = r j=0 σ 2 j z j z T j where the z j 's are design matrices for the r random effects. Like in the unstructured case, these S estimators have a high breakdown assuming that the z j 's are well controlled. In addition, once a high breakdown estimate of the covariance matrix has been obtained via a S-estimator, M M-estimators follow as later suggested by Copt and Heritier (2007); see Heritier et al. (2009) and the book website and also Koller (2016) for R code. More recently, Chervoneva and Vishnyakov (2014) extended the theory developed in Copt and Victoria-Feser (2006) by relaxing the assumption of the same number of observations per cluster. Their general S-estimator shares similar properties to the ones proposed earlier but can accommodate unbalanced clustered data. A nice illustration of this approach can be found in Chervoneva and Vishnyakov (2011). The monitoring approach proposed by Cerioli et al. (2018) and particularly the plots of squared Mahalanobis distances from monitoring S or M M-estimation could valuably be used in the this setting.
It should be stressed that the above references do not only talk about high breakdown estimation in mixed linear models but also about robust testing, a point that tends to be overlooked. In the context of this discussion, we would like to emphasize that robust monitoring is nice but robust monitoring done jointly with robust inference (testing) is better, especially when the tools are available. Heritier and Ronchetti (1994) showed that robust M-estimators can be used to build Wald, score and likelihood ratio type tests that have a stable level (robustness of validity) and power (robustness of efficiency) in a neighbourhood of the model. Examples of such tests include the robust score test based on the S-estimator of Copt and Victoria-Feser (2006), the robust likelihood ratio type test derived from the M M-estimator of Copt and Heritier (2007) and their equivalent for one-way multivariate analysis of variance (Van Aelst and Willems 2012). Cerioli et al. (2018), propose to use the information available in the sample to evaluate the proportion of outliers and to better calibrate important quantities such as the breakdown point, a tuning parameter for the robust covariance estimator. This is a less convincing approach due to the well-known related problem of overfitting. Broadly speaking, results of a statistical analysis should be generalizable to other outcomes, equivalently, to the population from which the sample is drawn. In other words, the out-of-sample validity should be somehow assessed. Without out-of-sample validation, a statistical procedure will necessarily tend to excessively focus on the sample data, creating some type of overfitting that includes the sampling error (noise) together with the true underlying quantity of interest.

Out-of-sample inference
Out-of-sample inference is in particular at the basis of model selection procedures where the problem of overfitting is well recognised. Model selection criteria have also been developed that contain a penalty for out-of-sample validation. Mallows (1973) C p , Akaike (1974 Information Criterion (AIC), Schwarz (1978) Bayesian Information Criterion (BIC), and related refinements (see e.g. McQuarrie and Tsai 1998), are the most popular. Efron (2004) developed a general framework for covariance penalty criteria that allows, for a given loss function (such as the squared loss function) to derive an estimator of the penalty for out-of-sample validity. Efron (2004) in particular shows that, given a predicted valueŶ i , considering the squared loss function, the penalty term for the the sample prediction error is 2cov Ŷ i ; Y i . The loss function can be chosen as a weighted prediction error as is done, for the linear regression model, in e.g. Ronchetti and Staudte (1994) who derived the penalty associated to the sample weighted prediction error, therefore proposing a robust version of the C p .
Other robust model selection methods have been proposed in the literature which include Machado (1993) for the BIC, Ronchetti et al. (1997) for cross-validation, Khan et al. (2007) for a robust LARS algorithm, Dupuis and Victoria-Feser (2011) and Dupuis and Victoria-Feser (2013) for fast robust search and very recently Avella Medina and Ronchetti (2017) for generalized linear models.

Final remarks
In this discussion, we limited our extensions to situations where the multivariate normal model with, possibly, a structure on the mean and covariance matrix, is of interest. As robust monitoring is particularly effective with high breakdown estimators, it is natural to wonder what can be done beyond the multivariate normal case. References are sparse but include Bianco et al. (2005) who proposed S and M M-estimators for a class of asymmetric models, e.g. the log-gamma distribution; Salibian-Barrera and Yohai (2008) who extended S-estimators to robust regression with censored data. Once again, robust monitoring is worth considering in these settings as they would bring new insight on the data. Finally, in high dimensional data, multivariate contamination may take a different form to the one usually assumed, i.e. outlying measurements may exist in such a way that the majority of observations are contaminated in at least one of their components. Van Aelst et al. (2012) adapt the Stahel-Donoho estimator by huberizing the data before the outlyingness is computed. They show that their proposal could better withstand large numbers of outliers. It would be interesting to see whether monitoring can be adapted to this problem. Very recently, Van Aelst and Wang (2017) robustified sure independence screening, a procedure used for variable selection in ultra-high dimensional regression analysis. Their approach is a very fast screening method using least trimmed squares principal component analysis to estimate the latent factors and the factor profiled variables. Variable screening is then performed on factor profiled variables by using regression M M-estimators. This brand new procedure may be amenable to monitoring.