The development of new technologies and methods of data collection produces the necessity to summarise the large quantity of information that is available. Usually, we face a data matrix X of size (n x J), corresponding to n statistical units and J quantitative variables, where n and J are very large. Clustering is the analysis which identifies homogeneous clusters of units, thus it might be meant as a way to reduce their dimension. Dimensionality reduction techniques are methods to obtain latent dimensions (less than manifest variables), so they reduce the dimensionality of the variables space. In this paper, we apply Double Hierarchical Parsimonious Means Clustering (Cavicchia et al., 2019) in order to get a simultaneous hierarchical parsimonious clustering of units - aggregated around centroids - and dimensionality reduction of variables - aggregated around components - on Asia-Europe Meeting (ASEM) data set. The model is estimated by using the LS method and an efficient coordinate descent algorithm is given. The goodness of fit of the double hierarchical parsimonious trees can be computed to assess the quality of the two hierarchical partitions.

Cavicchia, C., Vichi, M., Zaccaria, G. (2019). Hierarchical clustering and dimensionality reduction for big data. In Smart statistics for smart applications. Book of short paper SIS2019 (pp.173-180). Pearson.

Hierarchical clustering and dimensionality reduction for big data

Giorgia Zaccaria
2019

Abstract

The development of new technologies and methods of data collection produces the necessity to summarise the large quantity of information that is available. Usually, we face a data matrix X of size (n x J), corresponding to n statistical units and J quantitative variables, where n and J are very large. Clustering is the analysis which identifies homogeneous clusters of units, thus it might be meant as a way to reduce their dimension. Dimensionality reduction techniques are methods to obtain latent dimensions (less than manifest variables), so they reduce the dimensionality of the variables space. In this paper, we apply Double Hierarchical Parsimonious Means Clustering (Cavicchia et al., 2019) in order to get a simultaneous hierarchical parsimonious clustering of units - aggregated around centroids - and dimensionality reduction of variables - aggregated around components - on Asia-Europe Meeting (ASEM) data set. The model is estimated by using the LS method and an efficient coordinate descent algorithm is given. The goodness of fit of the double hierarchical parsimonious trees can be computed to assess the quality of the two hierarchical partitions.
paper
Clustering; Dimensionality reduction,; Big data; Hierarchy
English
SIS 2019
2019
Smart statistics for smart applications. Book of short paper SIS2019
9788891915108
2019
173
180
open
Cavicchia, C., Vichi, M., Zaccaria, G. (2019). Hierarchical clustering and dimensionality reduction for big data. In Smart statistics for smart applications. Book of short paper SIS2019 (pp.173-180). Pearson.
File in questo prodotto:
File Dimensione Formato  
Cavicchia_Hierarchical-Clustering_2019.pdf.pdf

accesso aperto

Descrizione: Intervento a convegno
Tipologia di allegato: Publisher’s Version (Version of Record, VoR)
Dimensione 703.74 kB
Formato Adobe PDF
703.74 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/394540
Citazioni
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
Social impact