Bicocca Open Archive

A software system for topic discovery and document tagging is described. The system discovers the topics hidden in a given document collection, labels them according to user supplied taxonomy and tags new documents. It implements an information processing pipeline which consists of document preprocessing, topic extraction, automatic labeling of topics, and multi-label document classification. The preprocessing module allows importing of several kinds of documents and offers different document representations: binary, term frequency and term frequency inverse document frequency. The topic extraction module is implemented through a proprietary version of the Latent Dirichlet Allocation model. The optimal number of topics is selected through hierarchical clustering. The topic labeling module optimizes a set of similarity measures defined over the user supplied taxonomy. It is implemented through an algorithm over a topic tree. The document tagging module solves a multi-label classification problem through multi-net Naïve Bayes without the need to perform any learning tasks.

Magatti, D., Stella, F. (2011). Probabilistic Topic Discovery and Automatic Document Tagging. In R. Brena, A. Guzman (a cura di), Quantitative Semantics and Soft Computing Methods for the Web Perspectives and Applications (pp. 25-50). Information Science Pub [10.4018/978-1-60960-881-1].

Probabilistic Topic Discovery and Automatic Document Tagging

Magatti, D;STELLA, FABIO ANTONIO

2011

Abstract

A software system for topic discovery and document tagging is described. The system discovers the topics hidden in a given document collection, labels them according to user supplied taxonomy and tags new documents. It implements an information processing pipeline which consists of document preprocessing, topic extraction, automatic labeling of topics, and multi-label document classification. The preprocessing module allows importing of several kinds of documents and offers different document representations: binary, term frequency and term frequency inverse document frequency. The topic extraction module is implemented through a proprietary version of the Latent Dirichlet Allocation model. The optimal number of topics is selected through hierarchical clustering. The topic labeling module optimizes a set of similarity measures defined over the user supplied taxonomy. It is implemented through an algorithm over a topic tree. The document tagging module solves a multi-label classification problem through multi-net Naïve Bayes without the need to perform any learning tasks.

Scheda breve

Scheda completa

Scheda completa (DC)

	Sottotipologia
	
				Capitolo o saggio
			
	Parole chiave
	
				Topic models; document tagging; Bayesian learning; text mining
			
	Lingua del contenuto
	
				English
			
	Titolo del volume
	
				Quantitative Semantics and Soft Computing Methods for the Web Perspectives and Applications
			
	Curatori del volume
	
				Brena, R; Guzman, A
			
	Data di pubblicazione
	
				2011
			
	ISBN del volume
	
				9781609608811
			
	Editore
	
				Information Science Pub
			
	Pagina iniziale
	
				25
			
	Pagina finale
	
				50
			
	DOI del contributo
	
				https://dx.doi.org/10.4018/978-1-60960-881-1
			
	Citazione
	
				Magatti, D., Stella, F. (2011). Probabilistic Topic Discovery and Automatic Document Tagging. In R. Brena, A. Guzman (a cura di), Quantitative Semantics and Soft Computing Methods for the Web Perspectives and Applications (pp. 25-50). Information Science Pub [10.4018/978-1-60960-881-1].
			
	Fulltext
	
				none
			
	Appare nelle tipologie:
	
				03 - Contributo in libro

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/25316

Citazioni

ND

ND

Social impact