Graph pangenomics is a new emerging field in computational biology that is changing the traditional view of a reference genome from a linear sequence to a new paradigm: a sequence graph (pangenome graph or simply pangenome) that represents the main similarities and differences in multiple evolutionary related genomes. The speed in producing large amounts of genome data, driven by advances in sequencing technologies, is far from the slow progress in developing new methods for constructing and analyzing a pangenome. Most recent advances in the field are still based on notions rooted in established and quite old literature on combinatorics on words, formal languages and space efficient data structures. In this paper we discuss two novel notions that may help in managing and analyzing multiple genomes by addressing a relevant question: how can we summarize sequence similarities and dissimilarities in large sequence data? The first notion is related to variants of the Lyndon factorization and allows to represent sequence similarities for a sample of reads, while the second one is that of sample specific string as a tool to detect differences in a sample of reads. New perspectives opened by these two notions are discussed.

Bonizzoni, P., De Felice, C., Pirola, Y., Rizzi, R., Zaccagnino, R., Zizza, R. (2022). Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?. In Developments in Language Theory (pp.3-12). Cham : Springer [10.1007/978-3-031-05578-2_1].

Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?

Bonizzoni, Paola;Pirola, Yuri;Rizzi, Raffaella;
2022

Abstract

Graph pangenomics is a new emerging field in computational biology that is changing the traditional view of a reference genome from a linear sequence to a new paradigm: a sequence graph (pangenome graph or simply pangenome) that represents the main similarities and differences in multiple evolutionary related genomes. The speed in producing large amounts of genome data, driven by advances in sequencing technologies, is far from the slow progress in developing new methods for constructing and analyzing a pangenome. Most recent advances in the field are still based on notions rooted in established and quite old literature on combinatorics on words, formal languages and space efficient data structures. In this paper we discuss two novel notions that may help in managing and analyzing multiple genomes by addressing a relevant question: how can we summarize sequence similarities and dissimilarities in large sequence data? The first notion is related to variants of the Lyndon factorization and allows to represent sequence similarities for a sample of reads, while the second one is that of sample specific string as a tool to detect differences in a sample of reads. New perspectives opened by these two notions are discussed.
paper
Lyndon factorization, pangenomics, bioinformatics, formal languages
English
26th International Conference on Developments in Language Theory, DLT 2022 - 9 May 2022 through 13 May 2022
2022
Developments in Language Theory
978-3-031-05577-5
2022
13257
3
12
open
Bonizzoni, P., De Felice, C., Pirola, Y., Rizzi, R., Zaccagnino, R., Zizza, R. (2022). Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?. In Developments in Language Theory (pp.3-12). Cham : Springer [10.1007/978-3-031-05578-2_1].
File in questo prodotto:
File Dimensione Formato  
main.pdf

accesso aperto

Descrizione: Author submitted version
Tipologia di allegato: Submitted Version (Pre-print)
Dimensione 282.32 kB
Formato Adobe PDF
282.32 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/378700
Citazioni
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 2
Social impact