Bicocca Open Archive

Nowadays, the global amount of written texts grows faster and faster. Since 2011 the number of posts per minute on Facebook increased from 650K to 3M. These unstructured data represent the source of an enormous amount of information that should be extracted by using automatic engines. This can be mainly accomplished by employing Natural Language Processing (NLP), which is a field of Artificial Intelligence devoted to analyzing and understanding human language as it is spoken and written. One common task of NLP is topic identification, related to the recognition of a text’s topic(s). Two popular methods for modeling latent topics are latent Dirichlet allocation (LDA) and correlated topic model (CTM). Both assume that each word composing a document is associated with a latent topic, but they differ in the prior distribution assigned to topics, thus showing different pros and cons. In this work, LDA and CTM are tested and compared in a big-data context by analyzing a large set of short documents automatically downloaded from the web by employing a modern crawler. In addition, under the assumption that each document is associated with a single topic, two new methods for the automatic classification of documents according to their real topic are proposed and tested relying on LDA and CTM as (latent) topic model engines. Finally, under the more realistic hypothesis of multiple topics within a document, the two new methods together with some combinations of the two are tested as multi-class classification tools

Gerli, S., Ascari, R., Migliorati, S., Cigna, T., Borrotti, M. (2024). Beyond human labelling: an automatic topic identification framework for big web data. ELECTRONIC JOURNAL OF APPLIED STATISTICAL ANALYSIS, 17(3), 545-571 [10.1285/i20705948v17n3p545].

Beyond human labelling: an automatic topic identification framework for big web data

Gerli S.;Ascari R.;Migliorati S.;Cigna T.;Borrotti M.

2024

Abstract

Nowadays, the global amount of written texts grows faster and faster. Since 2011 the number of posts per minute on Facebook increased from 650K to 3M. These unstructured data represent the source of an enormous amount of information that should be extracted by using automatic engines. This can be mainly accomplished by employing Natural Language Processing (NLP), which is a field of Artificial Intelligence devoted to analyzing and understanding human language as it is spoken and written. One common task of NLP is topic identification, related to the recognition of a text’s topic(s). Two popular methods for modeling latent topics are latent Dirichlet allocation (LDA) and correlated topic model (CTM). Both assume that each word composing a document is associated with a latent topic, but they differ in the prior distribution assigned to topics, thus showing different pros and cons. In this work, LDA and CTM are tested and compared in a big-data context by analyzing a large set of short documents automatically downloaded from the web by employing a modern crawler. In addition, under the assumption that each document is associated with a single topic, two new methods for the automatic classification of documents according to their real topic are proposed and tested relying on LDA and CTM as (latent) topic model engines. Finally, under the more realistic hypothesis of multiple topics within a document, the two new methods together with some combinations of the two are tested as multi-class classification tools

Scheda breve

Scheda completa

Scheda completa (DC)

	Sottotipologia
	
				Articolo in rivista - Articolo scientifico
			
	Parole chiave
	
				latent Dirichlet allocation, correlated topic model, automatic classification, textual data, topic identification
			
	Lingua del contenuto
	
				English
			
	Data ahead of print o Data prima pubblicazione Online
	
				15-dic-2024
			
	Data di pubblicazione
	
				2024
			
	Rivista
	
				ELECTRONIC JOURNAL OF APPLIED STATISTICAL ANALYSIS
			
	Numero del volume
	
				17
			
	Fascicolo
	
				3
			
	Pagina iniziale
	
				545
			
	Pagina finale
	
				571
			
	DOI dell'articolo
	
				https://dx.doi.org/10.1285/i20705948v17n3p545
			
	URL alternativo
	
				http://siba-ese.unisalento.it/index.php/ejasa/article/view/26959
			
	Fulltext
	
				open
			
	Citazione
	
				Gerli, S., Ascari, R., Migliorati, S., Cigna, T., Borrotti, M. (2024). Beyond human labelling: an automatic topic identification framework for big web data. ELECTRONIC JOURNAL OF APPLIED STATISTICAL ANALYSIS, 17(3), 545-571 [10.1285/i20705948v17n3p545].
			
	Appare nelle tipologie:
	
				01 - Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
Ascari-2024-Electronic Journal of Applied Statistical Analysis-VoR.pdf accesso aperto Descrizione: CC BY NC ND This work is licensed under a Creative Commons Attribuzione - Non commerciale - Non opere derivate 3.0 Italia License Tipologia di allegato: Publisher’s Version (Version of Record, VoR) Licenza: Creative Commons Dimensione 715.34 kB Formato Adobe PDF Visualizza/Apri	715.34 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/446358

Citazioni

ND

ND

Social impact