Comparison of statistical methods for multivariate outlier detection

Archimbaud A., Nordhausen K. and Ruiz-Gazen A.

Date

May 18 – 19, 2015

Time

12:00 AM

Location

The Open University, UK

Event

Abstract

In this poster, we are interested in detecting outliers, like for example manufacturing defects, in multivariate numerical data sets. Several non-supervised methods that are based on robust and non-robust covariance matrix estimators exist in the statistical literature. Our first aim is to exhibit the links between three outliers detection methods: the Invariant Coordinate Selection method as proposed by Caussinus and Ruiz-Gazen (1993) and generalized by Tyler et al. (2009), the method based on the Mahalanobis distance as detailed in Rousseeuw and Van Zomeren (1990), and the robust Principal Component Analysis (PCA) method with its diagnostic plot as proposed by Hubert et al. (2005).

Caussinus and Ruiz-Gazen (1993) proposed a Generalized PCA which diagonalizes a scatter matrix relative to another: $V_1V_2^{-1}$ where $V_2$ is a more robust covariance estimator than $V_1$, the usual empirical covariance estimator. These authors compute scores by projecting $V_2^{-1}$-orthogonally all the observations on some of the components and high scores are associated with potential outliers. We note that computing euclidean distances between observations using all the components is equivalent to the computation of robust Mahalanobis distances according to the matrix $V_2$ using the initial data. Tyler et al. (2009) generalized this method and called it Invariant Coordinate Selection (ICS). Contrary to Caussinus and Ruiz-Gazen (1993), they diagonalize $V_1^{-1}V_2$ which leads to the same eigen elements but to different scores that are proportional to each other. As explained in Tyler et al. (2009), the method is equivalent to a robust PCA with a scatter matrix $V_2$ after making the data spherical using $V_1$. However, the euclidean distances between observations based on all the components of ICS corresponds now to Mahalanobis distances according to $V_1$ and not to $V_2$.

Note that each of the three methods leads to a score for each observation and high scores are associated with potential outliers. We compare the three methods on some simulated and real data sets and show in particular that the ICS method is the only method that permits a selection of the relevant components for detecting outliers.

References

[1] Caussinus, H. and Ruiz-Gazen, A. (1993), Projection pursuit and generalized principal component analysis, In New Directions in Statistical Data Analysis and Robustness (eds S. Morgenthaler, E. Ronchetti and W. A. Stahel), 35–46, Basel: Birkh"auser.

[2] Hubert, M., Rousseeuw, P. J. and Vanden Branden, K. (2005), ROBPCA: a new approach to robust principal component analysis, Technometrics, 47(1), 64–79.

[3] Rousseeuw, P. J. and Van Zomeren, B. C. (1990), Unmasking multivariate outliers and leverage points, Journal of the American Statistical Association, 85(411), 633–639.

[4] Tyler, D. E., Critchley, F., D"umbgen, L. and Oja, H. (2009), Invariant coordinate selection, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3), 549–592.

Details
Posted on:
May 18, 2015
Length:
3 minute read, 482 words
See Also: