Bicocca Open Archive

A document representation model that has been used for years in NLP and Text Mining tasks is TF-IDF (Term Frequency-Inverse Document Frequency). This model is indeed effective for various tasks like Information Retrieval and Document Classification. However, it may fall short when it comes to capturing the deeper semantic and contextual meaning of a text, which is where Transformer-based Pre-trained Language Models (PLMs) such as BERT have been gaining significant traction in recent years. Despite this, these models also face specific challenges related to Transformers and their attention mechanism limits, especially when dealing with long documents. Therefore, this paper proposes a novel approach to exploit the advantages of the TF-IDF representation while incorporating semantic context, by introducing a Latent Concept Frequency-Inverse Document Frequency (LCF-IDF) document representation model. Its effectiveness is tested with respect to the Long Document Classification task. The results obtained show promising performance of the proposed solution compared to TF-IDF and BERT-like representation models, including those specifically for long documents such as Longformer as well as those designed for particular domains, especially when it comes to Single Label Multi-Class (SLMC) classification.

Principe, R., Chiarini, N., Viviani, M. (2024). An LCF-IDF Document Representation Model Applied to Long Document Classification. In 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings (pp.1129-1135). European Language Resources Association (ELRA).

An LCF-IDF Document Representation Model Applied to Long Document Classification

Principe R. A.;Chiarini N.;Viviani M.

2024

Abstract

A document representation model that has been used for years in NLP and Text Mining tasks is TF-IDF (Term Frequency-Inverse Document Frequency). This model is indeed effective for various tasks like Information Retrieval and Document Classification. However, it may fall short when it comes to capturing the deeper semantic and contextual meaning of a text, which is where Transformer-based Pre-trained Language Models (PLMs) such as BERT have been gaining significant traction in recent years. Despite this, these models also face specific challenges related to Transformers and their attention mechanism limits, especially when dealing with long documents. Therefore, this paper proposes a novel approach to exploit the advantages of the TF-IDF representation while incorporating semantic context, by introducing a Latent Concept Frequency-Inverse Document Frequency (LCF-IDF) document representation model. Its effectiveness is tested with respect to the Long Document Classification task. The results obtained show promising performance of the proposed solution compared to TF-IDF and BERT-like representation models, including those specifically for long documents such as Longformer as well as those designed for particular domains, especially when it comes to Single Label Multi-Class (SLMC) classification.

Scheda breve

Scheda completa

Scheda completa (DC)

	Tipo di intervento
	
				poster + paper
			
	Parole chiave
	
				BERT; Classification; Clustering; Pre-Trained Language Models; Text Representation; TF-IDF;
			
	Lingua del contenuto
	
				English
			
	Nome del convegno
	
				Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024 - 20 May 2024 through 25 May 2024
			
	Anno del convegno
	
				2024
			
	Curatori della monografia
	
				Calzolari, N; Kan, MY; Hoste, V; Lenci, A; Sakti, S; Xue, N
			
	Titolo degli atti
	
				2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings
			
	ISBN del volume degli atti
	
				9782493814104
			
	Data di pubblicazione
	
				2024
			
	Pagina iniziale
	
				1129
			
	Pagina finale
	
				1135
			
	Fulltext
	
				reserved
			
	Citazione
	
				Principe, R., Chiarini, N., Viviani, M. (2024). An LCF-IDF Document Representation Model Applied to Long Document Classification. In 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings (pp.1129-1135). European Language Resources Association (ELRA).
			
	Appare nelle tipologie:
	
				02 - Intervento a convegno

File in questo prodotto:

File	Dimensione	Formato
Principe-2024-LREC-COLING-PrePrint.pdf Solo gestori archivio Tipologia di allegato: Submitted Version (Pre-print) Licenza: Tutti i diritti riservati Dimensione 398.55 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	398.55 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/490481

Citazioni

0

ND

Social impact