Background: The construction of a pangenome graph is a fundamental task in pangenomics. A natural theoretical question is how to formalize the computational problem of building an optimal pangenome graph, making explicit the underlying optimization criterion and the set of feasible solutions. Current approaches build a pangenome graph with some heuristics, without assuming some explicit optimization criteria. Thus it is unclear how a specific optimization criterion affects the graph topology and downstream analysis, like read mapping and variant calling. Results: In this paper, by leveraging the notion of maximal block in a Multiple Sequence Alignment (MSA), we reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC). Then we propose an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph. We provide an implementation of the ILP approach for solving the MWBC and we evaluate it on SARS-CoV-2 complete genomes, showing how different objective functions lead to pangenome graphs that have different properties, hinting that the specific downstream task can drive the graph construction phase. Conclusion: We show that a customized construction of a pangenome graph based on selecting objective functions has a direct impact on the resulting graphs. In particular, our formalization of the MWBC problem, based on finding an optimal subset of blocks covering an MSA, paves the way to novel practical approaches to graph representations of an MSA where the user can guide the construction.

Avila Cartes, J., Bonizzoni, P., Ciccolella, S., Della Vedova, G., Denti, L. (2024). PangeBlocks: customized construction of pangenome graphs via maximal blocks. BMC BIOINFORMATICS, 25(1) [10.1186/s12859-024-05958-5].

PangeBlocks: customized construction of pangenome graphs via maximal blocks

Avila Cartes, Jorge;Bonizzoni, Paola
;
Ciccolella, Simone;Della Vedova, Gianluca;Denti, Luca
2024

Abstract

Background: The construction of a pangenome graph is a fundamental task in pangenomics. A natural theoretical question is how to formalize the computational problem of building an optimal pangenome graph, making explicit the underlying optimization criterion and the set of feasible solutions. Current approaches build a pangenome graph with some heuristics, without assuming some explicit optimization criteria. Thus it is unclear how a specific optimization criterion affects the graph topology and downstream analysis, like read mapping and variant calling. Results: In this paper, by leveraging the notion of maximal block in a Multiple Sequence Alignment (MSA), we reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC). Then we propose an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph. We provide an implementation of the ILP approach for solving the MWBC and we evaluate it on SARS-CoV-2 complete genomes, showing how different objective functions lead to pangenome graphs that have different properties, hinting that the specific downstream task can drive the graph construction phase. Conclusion: We show that a customized construction of a pangenome graph based on selecting objective functions has a direct impact on the resulting graphs. In particular, our formalization of the MWBC problem, based on finding an optimal subset of blocks covering an MSA, paves the way to novel practical approaches to graph representations of an MSA where the user can guide the construction.
Articolo in rivista - Articolo scientifico
Integer linear programming; Maximal blocks; Pangenome graphs; Variation graph construction;
English
4-nov-2024
2024
25
1
344
open
Avila Cartes, J., Bonizzoni, P., Ciccolella, S., Della Vedova, G., Denti, L. (2024). PangeBlocks: customized construction of pangenome graphs via maximal blocks. BMC BIOINFORMATICS, 25(1) [10.1186/s12859-024-05958-5].
File in questo prodotto:
File Dimensione Formato  
Avila Cartes-2024-BMC Bioinformatics-VoR.pdf

accesso aperto

Descrizione: This article is licensed under a Creative Commons Attribution 4.0 International License To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Tipologia di allegato: Publisher’s Version (Version of Record, VoR)
Licenza: Creative Commons
Dimensione 2.63 MB
Formato Adobe PDF
2.63 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/524779
Citazioni
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
Social impact