The development of new technologies and methods of data collection produces the necessity to summarise the large quantity of information that is available. Usually, we face a data matrix X of size (n x J), corresponding to n statistical units and J quantitative variables, where n and J are very large. Clustering is the analysis which identifies homogeneous clusters of units, thus it might be meant as a way to reduce their dimension. Dimensionality reduction techniques are methods to obtain latent dimensions (less than manifest variables), so they reduce the dimensionality of the variables space. In this paper, we apply Double Hierarchical Parsimonious Means Clustering (Cavicchia et al., 2019) in order to get a simultaneous hierarchical parsimonious clustering of units - aggregated around centroids - and dimensionality reduction of variables - aggregated around components - on Asia-Europe Meeting (ASEM) data set. The model is estimated by using the LS method and an efficient coordinate descent algorithm is given. The goodness of fit of the double hierarchical parsimonious trees can be computed to assess the quality of the two hierarchical partitions.
Cavicchia, C., Vichi, M., Zaccaria, G. (2019). Hierarchical clustering and dimensionality reduction for big data. In Smart statistics for smart applications. Book of short paper SIS2019 (pp.173-180). Pearson.
Hierarchical clustering and dimensionality reduction for big data
Giorgia Zaccaria
2019
Abstract
The development of new technologies and methods of data collection produces the necessity to summarise the large quantity of information that is available. Usually, we face a data matrix X of size (n x J), corresponding to n statistical units and J quantitative variables, where n and J are very large. Clustering is the analysis which identifies homogeneous clusters of units, thus it might be meant as a way to reduce their dimension. Dimensionality reduction techniques are methods to obtain latent dimensions (less than manifest variables), so they reduce the dimensionality of the variables space. In this paper, we apply Double Hierarchical Parsimonious Means Clustering (Cavicchia et al., 2019) in order to get a simultaneous hierarchical parsimonious clustering of units - aggregated around centroids - and dimensionality reduction of variables - aggregated around components - on Asia-Europe Meeting (ASEM) data set. The model is estimated by using the LS method and an efficient coordinate descent algorithm is given. The goodness of fit of the double hierarchical parsimonious trees can be computed to assess the quality of the two hierarchical partitions.File | Dimensione | Formato | |
---|---|---|---|
Cavicchia_Hierarchical-Clustering_2019.pdf.pdf
accesso aperto
Descrizione: Intervento a convegno
Tipologia di allegato:
Publisher’s Version (Version of Record, VoR)
Dimensione
703.74 kB
Formato
Adobe PDF
|
703.74 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.