The rise of Big Data era calls for more efficient and effective Data Exploration and analysis tools. In this respect, the need to support advanced analytics on Big Data is driving data scientist’ interest toward massively parallel distributed systems and software platforms, such as Map-Reduce and Spark, that make possible their scalable utilization. However, when complex data mining algorithms are required, their fully scalable deployment on such platforms faces a number of technical challenges that grow with the complexity of the algorithms involved. Thus algorithms, that were originally designed for a sequential nature, must often be redesigned in order to effectively use the distributed computational resources. In this paper, we explore these problems, and then propose a solution which has proven to be very effective on the complex hierarchical clustering algorithm CLUBS+. By using four stages of successive refinements, CLUBS+ delivers high-quality clusters of data grouped around their centroids, working in a totally unsupervised fashion. Experimental results confirm the accuracy and scalability of CLUBS+ on platforms tailored for Big Data management.

Ianni, M., Masciari, E., Mazzeo, G., Mezzanzanica, M., Zaniolo, C. (2020). Fast and effective Big Data exploration by clustering. FUTURE GENERATION COMPUTER SYSTEMS, 102, 84-94 [10.1016/j.future.2019.07.077].

Fast and effective Big Data exploration by clustering

Mazzeo, GM;Mezzanzanica, M;
2020

Abstract

The rise of Big Data era calls for more efficient and effective Data Exploration and analysis tools. In this respect, the need to support advanced analytics on Big Data is driving data scientist’ interest toward massively parallel distributed systems and software platforms, such as Map-Reduce and Spark, that make possible their scalable utilization. However, when complex data mining algorithms are required, their fully scalable deployment on such platforms faces a number of technical challenges that grow with the complexity of the algorithms involved. Thus algorithms, that were originally designed for a sequential nature, must often be redesigned in order to effectively use the distributed computational resources. In this paper, we explore these problems, and then propose a solution which has proven to be very effective on the complex hierarchical clustering algorithm CLUBS+. By using four stages of successive refinements, CLUBS+ delivers high-quality clusters of data grouped around their centroids, working in a totally unsupervised fashion. Experimental results confirm the accuracy and scalability of CLUBS+ on platforms tailored for Big Data management.
Articolo in rivista - Articolo scientifico
Big Data; Clustering; Data exploration;
English
6-ago-2019
2020
102
84
94
none
Ianni, M., Masciari, E., Mazzeo, G., Mezzanzanica, M., Zaniolo, C. (2020). Fast and effective Big Data exploration by clustering. FUTURE GENERATION COMPUTER SYSTEMS, 102, 84-94 [10.1016/j.future.2019.07.077].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/241526
Citazioni
  • Scopus 51
  • ???jsp.display-item.citation.isi??? 31
Social impact