Nowadays, the global amount of written texts grows faster and faster. Since 2011 the number of posts per minute on Facebook increased from 650K to 3M. These unstructured data represent the source of an enormous amount of information that should be extracted by using automatic engines. This can be mainly accomplished by employing Natural Language Processing (NLP), which is a field of Artificial Intelligence devoted to analyzing and understanding human language as it is spoken and written. One common task of NLP is topic identification, related to the recognition of a text’s topic(s). Two popular methods for modeling latent topics are latent Dirichlet allocation (LDA) and correlated topic model (CTM). Both assume that each word composing a document is associated with a latent topic, but they differ in the prior distribution assigned to topics, thus showing different pros and cons. In this work, LDA and CTM are tested and compared in a big-data context by analyzing a large set of short documents automatically downloaded from the web by employing a modern crawler. In addition, under the assumption that each document is associated with a single topic, two new methods for the automatic classification of documents according to their real topic are proposed and tested relying on LDA and CTM as (latent) topic model engines. Finally, under the more realistic hypothesis of multiple topics within a document, the two new methods together with some combinations of the two are tested as multi-class classification tools

Gerli, S., Ascari, R., Migliorati, S., Cigna, T., Borrotti, M. (2024). Beyond human labelling: an automatic topic identification framework for big web data. ELECTRONIC JOURNAL OF APPLIED STATISTICAL ANALYSIS, 17(3), 545-571 [10.1285/i20705948v17n3p545].

Beyond human labelling: an automatic topic identification framework for big web data

Gerli S.;Ascari R.;Migliorati S.;Borrotti M.
2024

Abstract

Nowadays, the global amount of written texts grows faster and faster. Since 2011 the number of posts per minute on Facebook increased from 650K to 3M. These unstructured data represent the source of an enormous amount of information that should be extracted by using automatic engines. This can be mainly accomplished by employing Natural Language Processing (NLP), which is a field of Artificial Intelligence devoted to analyzing and understanding human language as it is spoken and written. One common task of NLP is topic identification, related to the recognition of a text’s topic(s). Two popular methods for modeling latent topics are latent Dirichlet allocation (LDA) and correlated topic model (CTM). Both assume that each word composing a document is associated with a latent topic, but they differ in the prior distribution assigned to topics, thus showing different pros and cons. In this work, LDA and CTM are tested and compared in a big-data context by analyzing a large set of short documents automatically downloaded from the web by employing a modern crawler. In addition, under the assumption that each document is associated with a single topic, two new methods for the automatic classification of documents according to their real topic are proposed and tested relying on LDA and CTM as (latent) topic model engines. Finally, under the more realistic hypothesis of multiple topics within a document, the two new methods together with some combinations of the two are tested as multi-class classification tools
Articolo in rivista - Articolo scientifico
latent Dirichlet allocation, correlated topic model, automatic classification, textual data, topic identification
English
15-dic-2024
2024
17
3
545
571
open
Gerli, S., Ascari, R., Migliorati, S., Cigna, T., Borrotti, M. (2024). Beyond human labelling: an automatic topic identification framework for big web data. ELECTRONIC JOURNAL OF APPLIED STATISTICAL ANALYSIS, 17(3), 545-571 [10.1285/i20705948v17n3p545].
File in questo prodotto:
File Dimensione Formato  
Ascari-2024-Electronic Journal of Applied Statistical Analysis-VoR.pdf

accesso aperto

Descrizione: CC BY NC ND This work is licensed under a Creative Commons Attribuzione - Non commerciale - Non opere derivate 3.0 Italia License
Tipologia di allegato: Publisher’s Version (Version of Record, VoR)
Licenza: Creative Commons
Dimensione 715.34 kB
Formato Adobe PDF
715.34 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/446358
Citazioni
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
Social impact