Bicocca Open Archive

Science communication has a number of bottlenecks that include the rising number of published research papers and its non-machine-accessible and document-based paradigm, which makes the exploration, reading, and reuse of research outcomes rather inefficient. Recently, Knowledge Graphs (KG), i.e., semantic interlinked networks of entities, have been proposed as a new core technology to describe and curate scholarly information with the goal to make it machine readable and understandable. However, the main drawback of the use of such a technology is that researchers are asked to manually annotate their research papers and add their contributions within the KGs. To address this problem, in this paper we propose SCICERO, a novel KG generation approach that takes in input text from research articles and generates a KG of research entities. SCICERO uses Natural Language Processing techniques to parse the content of scientific papers to discover entities and relationships, exploits state-of-the-art Deep Learning Transformer models to make sense and validate extracted information, and uses Semantic Web best practices to formally represent the extracted entities and relationships, making the written content of research papers machine-actionable. SCICERO has been tested on a dataset of 6.7M papers about Computer Science generating a KG of about 10M entities. It has been evaluated on a manually generated gold standard of 3,600 triples that cover three Computer Science subdomains (Information Retrieval, Natural Language Processing, and Machine Learning) obtaining remarkable results.

Dessi, D., Osborne, F., Reforgiato Recupero, D., Buscaldi, D., Motta, E. (2022). SCICERO: A deep learning and NLP approach for generating scientific knowledge graphs in the computer science domain. KNOWLEDGE-BASED SYSTEMS, 258(22 December 2022) [10.1016/j.knosys.2022.109945].

SCICERO: A deep learning and NLP approach for generating scientific knowledge graphs in the computer science domain

Dessi D.;Osborne F.;Reforgiato Recupero D.;Buscaldi D.;Motta E.

2022

Abstract

Science communication has a number of bottlenecks that include the rising number of published research papers and its non-machine-accessible and document-based paradigm, which makes the exploration, reading, and reuse of research outcomes rather inefficient. Recently, Knowledge Graphs (KG), i.e., semantic interlinked networks of entities, have been proposed as a new core technology to describe and curate scholarly information with the goal to make it machine readable and understandable. However, the main drawback of the use of such a technology is that researchers are asked to manually annotate their research papers and add their contributions within the KGs. To address this problem, in this paper we propose SCICERO, a novel KG generation approach that takes in input text from research articles and generates a KG of research entities. SCICERO uses Natural Language Processing techniques to parse the content of scientific papers to discover entities and relationships, exploits state-of-the-art Deep Learning Transformer models to make sense and validate extracted information, and uses Semantic Web best practices to formally represent the extracted entities and relationships, making the written content of research papers machine-actionable. SCICERO has been tested on a dataset of 6.7M papers about Computer Science generating a KG of about 10M entities. It has been evaluated on a manually generated gold standard of 3,600 triples that cover three Computer Science subdomains (Information Retrieval, Natural Language Processing, and Machine Learning) obtaining remarkable results.

Scheda breve

Scheda completa

Scheda completa (DC)

	Sottotipologia
	
				Articolo in rivista - Articolo scientifico
			
	Parole chiave
	
				Artificial intelligence; Knowledge graph; Scholarly domain; Scientific facts;
			
	Lingua del contenuto
	
				English
			
	Data ahead of print o Data prima pubblicazione Online
	
				5-ott-2022
			
	Data di pubblicazione
	
				2022
			
	Rivista
	
				KNOWLEDGE-BASED SYSTEMS
			
	Numero del volume
	
				258
			
	Fascicolo
	
				22 December 2022
			
	Article number
	
				109945
			
	DOI dell'articolo
	
				https://dx.doi.org/10.1016/j.knosys.2022.109945
			
	Fulltext
	
				open
			
	Citazione
	
				Dessi, D., Osborne, F., Reforgiato Recupero, D., Buscaldi, D., Motta, E. (2022). SCICERO: A deep learning and NLP approach for generating scientific knowledge graphs in the computer science domain. KNOWLEDGE-BASED SYSTEMS, 258(22 December 2022) [10.1016/j.knosys.2022.109945].
			
	Appare nelle tipologie:
	
				01 - Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
Dessi-2022-Knowledge Based Sys-preprint.pdf accesso aperto Descrizione: Research Article Tipologia di allegato: Submitted Version (Pre-print) Licenza: Creative Commons Dimensione 1.25 MB Formato Adobe PDF Visualizza/Apri	1.25 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/411695

Citazioni

31

13

Social impact