Tandem clustering with invariant coordinate selection (ICS)

Alfons, A., Archimbaud, A., Nordhausen, K. and Ruiz-Gazen, A.

Date

June 2, 2023

Time

12:00 AM

Location

Erasmus University Rotterdam, The Netherlands

Event

Abstract

Tandem clustering is a well-known technique for dealing with high-dimensional or noisy data to better identify clusters. This is a sequential approach based on first reducing the dimension of the data and then performing the clustering. The most common method, based on principal component analysis (PCA), has been criticized for only focusing on maximizing inertia and not necessarily preserving the structure of interest for clustering. Therefore, we suggest a new tandem clustering approach based on invariant coordinate selection (ICS). This multivariate method is designed to identify the structure of the data by jointly diagonalizing two scatter matrices, while maintaining the affine invariance of the new coordinates. More specifically, some theoretical results proved that under some elliptical mixture models, the first and/or last components are carrying the information regarding the clustering structure. However, despite the attractive properties of ICS, the method has not been studied much in the context of clustering but mostly for outlier detection purposes. The issues of choosing the pair of scatter matrices and the components to keep are the two challenges that must be addressed. For clustering purposes, we suggest that the best scatter pairs consist of one matrix which captures the within-cluster structure and another which captures the global structure. To this end, we find the local shape or pairwise scatters to be good choices for estimating the within-structure. In addition, we also investigate the use of the well-known minimum covariance determinant (MCD) estimator based on a smaller-than usual subset size. The performance of ICS as a dimension reduction method is evaluated to determine its ability to preserve the cluster structure of the data. We conducted a large simulation study and applied it to benchmark data sets. We tested various combinations of scatter matrices, component selection criteria, and the effects of the presence of outliers. Results indicate that the ICS-based tandem clustering method has superior performance over PCA, and thus is a promising approach.

Details
Posted on:
June 2, 2023
Length:
2 minute read, 353 words
See Also: