Multivariate outlier detection with Mahalanobis distance and ICS (Invariant Coordinate Selection) for standard and high-dimensional data

Archimbaud A., Nordhausen K. and Ruiz-Gazen A.

Date

September 24, 2019

Time

12:00 AM

Location

Institut de Mathématiques de Toulouse (IMT), Toulouse, France

Event

Seminar

Abstract

In this presentation, we are interested in detecting outliers in an unsupervised way in multivariate numerical data sets. We focus specifically on the case of a small proportion of outlying observations, like for example fraud or manufacturing faults. Indeed, in the industrial context of fault detection, this task is of great importance for ensuring a high quality production. In addition, with the exponential increase in the number of measurements on electronic components, the concern of high dimensional data arises in the identification of outlying observations. The ippon innovation company, an expert in industrial statistics and anomaly detection, wanted to deal with this new situation. So, it collaborated with the TSE-R research laboratory by financing a thesis work. It led to several publications, some R packages and a proprietary algorithm already used by some customers. The main ideas, propositions and results will be presented.

The well-known Mahalalanobis distance computes a score for each observation taking into account the covariance structure of the data set. High scores indicate possible outliers. However, the limitation of this method appears if the dimension of the data increases while the structure of interest remains in a fixed dimension subspace. The ICS method (Invariant Coordinate Selection) overcomes this drawback by selecting relevant components for outlier detection. The results will be illustrated on simulated and real data sets through the R package ICSOutlier and the shiny app ICSShiny we implemented.

To go further, because of some multicollinearity problems in high dimension, the scatter matrices may be singular. In such a context, it is possible to generalize ICS by using some Generalized Singular Value Decomposition. This approach has some advantages compared to another approach based on generalized inverse of scatter matrices. In some examples where the structure of interest is contained in some subspace, the proposed method is able to recover the subspace of interest while other approaches may fail in identifying such a subspace. These advantages are discussed in detail from a theoretical point of view and using some simulated examples.

Details

Posted on:: September 24, 2019

Length:: 2 minute read, 368 words

See Also: