Nowadays, the global amount of written texts grows faster and faster. Since 2011 the number of posts per minute on Facebook increased from 650K to 3M. These unstructured data represent the source of an enormous amount of information that should be extracted by using automatic engines. This can be mainly accomplished by employing Natural Language Processing (NLP), which is a field of Artificial Intelligence devoted to analyzing and understanding human language as it is spoken and written. One common task of NLP is topic identification, related to the recognition of a text’s topic(s). Two popular methods for modeling latent topics are latent Dirichlet allocation (LDA) and correlated topic model (CTM). Both assume that each word composing a document is associated with a latent topic, but they differ in the prior distribution assigned to topics, thus showing different pros and cons. In this work, LDA and CTM are tested and compared in a big-data context by analyzing a large set of short documents automatically downloaded from the web by employing a modern crawler. In addition, under the assumption that each document is associated with a single topic, two new methods for the automatic classification of documents according to their real topic are proposed and tested relying on LDA and CTM as (latent) topic model engines. Finally, under the more realistic hypothesis of multiple topics within a document, the two new methods together with some combinations of the two are tested as multi-class classification tools
Gerli, S., Ascari, R., Migliorati, S., Cigna, T., Borrotti, M. (2024). Beyond human labelling: an automatic topic identification framework for big web data. ELECTRONIC JOURNAL OF APPLIED STATISTICAL ANALYSIS, 17(3), 545-571 [10.1285/i20705948v17n3p545].
Beyond human labelling: an automatic topic identification framework for big web data
Gerli S.;Ascari R.;Migliorati S.;Borrotti M.
2024
Abstract
Nowadays, the global amount of written texts grows faster and faster. Since 2011 the number of posts per minute on Facebook increased from 650K to 3M. These unstructured data represent the source of an enormous amount of information that should be extracted by using automatic engines. This can be mainly accomplished by employing Natural Language Processing (NLP), which is a field of Artificial Intelligence devoted to analyzing and understanding human language as it is spoken and written. One common task of NLP is topic identification, related to the recognition of a text’s topic(s). Two popular methods for modeling latent topics are latent Dirichlet allocation (LDA) and correlated topic model (CTM). Both assume that each word composing a document is associated with a latent topic, but they differ in the prior distribution assigned to topics, thus showing different pros and cons. In this work, LDA and CTM are tested and compared in a big-data context by analyzing a large set of short documents automatically downloaded from the web by employing a modern crawler. In addition, under the assumption that each document is associated with a single topic, two new methods for the automatic classification of documents according to their real topic are proposed and tested relying on LDA and CTM as (latent) topic model engines. Finally, under the more realistic hypothesis of multiple topics within a document, the two new methods together with some combinations of the two are tested as multi-class classification toolsFile | Dimensione | Formato | |
---|---|---|---|
Ascari-2024-Electronic Journal of Applied Statistical Analysis-VoR.pdf
accesso aperto
Descrizione: CC BY NC ND This work is licensed under a Creative Commons Attribuzione - Non commerciale - Non opere derivate 3.0 Italia License
Tipologia di allegato:
Publisher’s Version (Version of Record, VoR)
Licenza:
Creative Commons
Dimensione
715.34 kB
Formato
Adobe PDF
|
715.34 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.