Bicocca Open Archive

Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project. A major contribution of the 1000 Genomes Project is the characterization of a broad spectrum of genetic variations in the human genome, including the discovery of novel variations in the South Asian, African and European populations—thus enhancing the catalogue of variability within the reference genome. Currently, the need to take into account the high variability in population genomes as well as the specificity of an individual genome in a personalized approach to medicine is rapidly pushing the abandonment of the traditional paradigm of using a single reference genome. A graph-based representation of multiple genomes, or a graph pangenome, is replacing the linear reference genome. This means completely rethinking well-established procedures to analyze, store, and access information from genome representations. Properly addressing these challenges is crucial to face the computational tasks of ambitious healthcare projects aiming to characterize human diversity by sequencing 1M individuals (Stark et al. 2019). This tutorial aims to introduce readers to the most recent advances in the theory of data structures for the representation of graph pangenomes. We discuss efficient representations of haplotypes and the variability of genotypes in graph pangenomes, and highlight applications in solving computational problems in human and microbial (viral) pangenomes.

Baaijens, J., Bonizzoni, P., Boucher, C., Della Vedova, G., Pirola, Y., Rizzi, R., et al. (2022). Computational graph pangenomics: a tutorial on data structures and their applications. NATURAL COMPUTING, 21(1), 81-108 [10.1007/s11047-022-09882-6].

Computational graph pangenomics: a tutorial on data structures and their applications

Baaijens, Jasmijn A.;Bonizzoni, Paola;Boucher, Christina;Della Vedova, Gianluca;Pirola, Yuri;Rizzi, Raffaella;Sirén, Jouni

2022

Abstract

Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project. A major contribution of the 1000 Genomes Project is the characterization of a broad spectrum of genetic variations in the human genome, including the discovery of novel variations in the South Asian, African and European populations—thus enhancing the catalogue of variability within the reference genome. Currently, the need to take into account the high variability in population genomes as well as the specificity of an individual genome in a personalized approach to medicine is rapidly pushing the abandonment of the traditional paradigm of using a single reference genome. A graph-based representation of multiple genomes, or a graph pangenome, is replacing the linear reference genome. This means completely rethinking well-established procedures to analyze, store, and access information from genome representations. Properly addressing these challenges is crucial to face the computational tasks of ambitious healthcare projects aiming to characterize human diversity by sequencing 1M individuals (Stark et al. 2019). This tutorial aims to introduce readers to the most recent advances in the theory of data structures for the representation of graph pangenomes. We discuss efficient representations of haplotypes and the variability of genotypes in graph pangenomes, and highlight applications in solving computational problems in human and microbial (viral) pangenomes.

Scheda breve

Scheda completa

Scheda completa (DC)

	Sottotipologia
	
				Articolo in rivista - Review Essay
			
	Parole chiave
	
				pangenomics; algorithms;
			
	Lingua del contenuto
	
				English
			
	Data ahead of print o Data prima pubblicazione Online
	
				4-mar-2022
			
	Data di pubblicazione
	
				2022
			
	Rivista
	
				NATURAL COMPUTING
			
	Numero del volume
	
				21
			
	Fascicolo
	
				1
			
	Pagina iniziale
	
				81
			
	Pagina finale
	
				108
			
	DOI dell'articolo
	
				https://dx.doi.org/10.1007/s11047-022-09882-6
			
	Fulltext
	
				open
			
	Citazione
	
				Baaijens, J., Bonizzoni, P., Boucher, C., Della Vedova, G., Pirola, Y., Rizzi, R., et al. (2022). Computational graph pangenomics: a tutorial on data structures and their applications. NATURAL COMPUTING, 21(1), 81-108 [10.1007/s11047-022-09882-6].
			
	Appare nelle tipologie:
	
				01 - Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
s11047-022-09882-6.pdf accesso aperto Descrizione: Published Version Tipologia di allegato: Publisher’s Version (Version of Record, VoR) Dimensione 1.54 MB Formato Adobe PDF Visualizza/Apri	1.54 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/357987

Citazioni

32

21

Social impact