Nowadays sequencing organisms is essentially routine, as we have witnessed during the SARS-CoV-2 pandemic, when millions of viral genomes have been sequenced. Indeed, the introduction of Next-Generation Sequencing (NGS) technologies in 2006 made sequencing cheaper and more accessible. Later on, a new sub-area of research in computational biology was consolidated to address the intrinsic challenges introduced by the availability of several genomes, named computational pangenomics. In computational pangenomics, a pangenome is a collection of genomic sequences to be analyzed jointly, or to be used as a reference. Pangenome graphs have demonstrated their ability to encompass more comprehensive formation, notably in the domain of crops, bovine, and human data, with important implications for the accurate identification of structural variations, especially when contrasted with conventional linear reference genome assemblies. Pangenomes, either as a graph or as a collection of genomes, inherently capture more variability than a single reference genome. To make the transition from a reference genome as a string to a pangenome graph, it is important to have procedures for the construction of pangenome graphs that are suitable for the application of sequence-to-graph tools The construction of pangenome graphs has been addressed mainly from heuristics, and the comparison of their quality has relied upon downstream analyses instead of the graph itself. The establishment of an optimal representation of genomics sequences as a graph has been discussed only to extend good properties known for indexing strings to these graphs. In this direction, one task that is not trivial to extend to pangenome graphs is the alignment between graphs. We present an approach to construct variation graphs starting from a multiple sequence alignment (MSA), leveraging the notion of maximal blocks, called pangeblocks. The MSA naturally highlights similarities and differences of a set of genomic sequences, and blocks capture a subset of sequences in an interval of columns sharing a substring in the MSA. pangeblocks is an Integer Linear programming approach that finds a tiling of the MSA using blocks. The construction is guided by several objective function criteria that aim to force the desired properties of the final graph, using the most natural criteria, like the number of nodes, the length of node labels, and others intended to ensure good properties of the graph for downstream analyses, like optimizing the number of seeds for sequence-to-graph tools. The second contribution of this thesis combines the best of indexes and deep learning approaches. We exploit an encoding called the Chaos Game Representation of DNA (CGR), on top of which is a k-mer-based representation of a sequence, known as the Frequency Matrix of the CGR (FCGR). We develop architectures for exploiting the FCGR, and propose an embedding-based index for the largest bacterial dataset, allowing data curation, fast queries at the assembly level, and accurate taxonomic classification at species and genus levels. In the last part of this thesis, we explore the alignment between variation graphs under the realm of Riemannian Geometry, specifically by modeling pangenome graphs as discrete manifolds and proposing a mathematical model based on an Integer Linear Programming formulation for assessing isomorphism between manifolds by leveraging the Ricci-Flow algorithm on discrete manifolds
Oggigiorno, il sequenziamento degli organismi è essenzialmente una procedura di routine, come abbiamo visto durante la pandemia di SARS-CoV-2, quando milioni di genomi virali sono stati sequenziati. L'introduzione delle tecnologie di Next-Generation Sequencing (NGS) nel 2006 ha infatti reso il sequenziamento più economico e accessibile. Successivamente, si è consolidata una nuova sotto-area di ricerca nella biologia computazionale per affrontare le sfide intrinseche introdotte dalla disponibilità di numerosi genomi, chiamata pangenomica computazionale. Nella pangenomica computazionale, un pangenoma è una collezione di sequenze genomiche da analizzare congiuntamente o da utilizzare come riferimento. I grafi pangenomici hanno dimostrato la loro capacità di includere informazioni più complete, in particolare nei domini delle colture, dei dati bovini e umani, con importanti implicazioni per l'identificazione accurata delle variazioni strutturali, specialmente rispetto agli assemblaggi genomici di riferimento lineari convenzionali. I pangenomi, sia come grafi sia come collezioni di genomi, catturano intrinsecamente una maggiore variabilità rispetto a un singolo genoma di riferimento. Per passare da un genoma di riferimento come stringa a un grafo pangenomico, è importante disporre di procedure per la costruzione di grafi pangenomici che siano adeguate all'applicazione di strumenti di sequenza-a-grafo. La costruzione di grafi pangenomici è stata affrontata principalmente tramite approcci euristici, e il confronto della loro qualità si è basato su analisi successive piuttosto che sul grafo stesso. L’istituzione di una rappresentazione ottimale delle sequenze genomiche come grafo è stata discussa solo per estendere a questi grafi le buone proprietà note per l’indicizzazione delle stringhe. In questa direzione, un compito non banale da estendere ai grafi pangenomici è l’allineamento tra grafi. Presentiamo un approccio per costruire variation graphs a partire da un allineamento di sequenze multiple (MSA), sfruttando la nozione di blocchi massimali, chiamati pangeblocks. L'MSA evidenzia naturalmente somiglianze e differenze tra un insieme di sequenze genomiche, e i blocchi catturano un sottoinsieme di sequenze in un intervallo di colonne che condividono una sottostringa nell'MSA. Pangeblocks è un approccio basato sulla programmazione lineare intera che trova un'ottimizzazione dell'MSA utilizzando blocchi. La costruzione è guidata da vari criteri di funzione obiettivo che mirano a conferire al grafo finale le proprietà desiderate, utilizzando i criteri più naturali, come il numero di nodi, la lunghezza delle etichette dei nodi e altri criteri destinati a garantire buone proprietà del grafo per analisi successive, come l'ottimizzazione del numero di semi per strumenti di sequenza-a-grafo. Il secondo contributo di questa tesi combina il meglio degli indici e degli approcci di deep learning. Sfruttiamo una codifica chiamata Chaos Game Representation del DNA (CGR), su cui si basa una rappresentazione di sequenza basata su k-mer, nota come Frequency Matrix of the CGR (FCGR). Sviluppiamo architetture per sfruttare l'FCGR e proponiamo un indice basato su embedding per il più grande dataset batterico, permettendo la curazione dei dati, interrogazioni rapide a livello di assembly e una classificazione tassonomica accurata a livello di specie e genere. Nell'ultima parte di questa tesi, esploriamo l'allineamento tra i variation graphs nell'ambito della geometria riemanniana, modellando specificamente i grafi pangenomici come manifolds discreti e proponendo un modello matematico basato su una formulazione di programmazione lineare intera per valutare l'isomorfismo tra manifolds sfruttando l'algoritmo di Ricci-Flow su manifolds discreti.
(2025). COMPUTATIONAL METHODS IN EVOLUTION-AWARE PANGENOMICS FOR GRAPH AND SEQUENCE ANALYSES. (Tesi di dottorato, , 2025).
COMPUTATIONAL METHODS IN EVOLUTION-AWARE PANGENOMICS FOR GRAPH AND SEQUENCE ANALYSES
AVILA CARTES, JORGE EDUARDO
2025
Abstract
Nowadays sequencing organisms is essentially routine, as we have witnessed during the SARS-CoV-2 pandemic, when millions of viral genomes have been sequenced. Indeed, the introduction of Next-Generation Sequencing (NGS) technologies in 2006 made sequencing cheaper and more accessible. Later on, a new sub-area of research in computational biology was consolidated to address the intrinsic challenges introduced by the availability of several genomes, named computational pangenomics. In computational pangenomics, a pangenome is a collection of genomic sequences to be analyzed jointly, or to be used as a reference. Pangenome graphs have demonstrated their ability to encompass more comprehensive formation, notably in the domain of crops, bovine, and human data, with important implications for the accurate identification of structural variations, especially when contrasted with conventional linear reference genome assemblies. Pangenomes, either as a graph or as a collection of genomes, inherently capture more variability than a single reference genome. To make the transition from a reference genome as a string to a pangenome graph, it is important to have procedures for the construction of pangenome graphs that are suitable for the application of sequence-to-graph tools The construction of pangenome graphs has been addressed mainly from heuristics, and the comparison of their quality has relied upon downstream analyses instead of the graph itself. The establishment of an optimal representation of genomics sequences as a graph has been discussed only to extend good properties known for indexing strings to these graphs. In this direction, one task that is not trivial to extend to pangenome graphs is the alignment between graphs. We present an approach to construct variation graphs starting from a multiple sequence alignment (MSA), leveraging the notion of maximal blocks, called pangeblocks. The MSA naturally highlights similarities and differences of a set of genomic sequences, and blocks capture a subset of sequences in an interval of columns sharing a substring in the MSA. pangeblocks is an Integer Linear programming approach that finds a tiling of the MSA using blocks. The construction is guided by several objective function criteria that aim to force the desired properties of the final graph, using the most natural criteria, like the number of nodes, the length of node labels, and others intended to ensure good properties of the graph for downstream analyses, like optimizing the number of seeds for sequence-to-graph tools. The second contribution of this thesis combines the best of indexes and deep learning approaches. We exploit an encoding called the Chaos Game Representation of DNA (CGR), on top of which is a k-mer-based representation of a sequence, known as the Frequency Matrix of the CGR (FCGR). We develop architectures for exploiting the FCGR, and propose an embedding-based index for the largest bacterial dataset, allowing data curation, fast queries at the assembly level, and accurate taxonomic classification at species and genus levels. In the last part of this thesis, we explore the alignment between variation graphs under the realm of Riemannian Geometry, specifically by modeling pangenome graphs as discrete manifolds and proposing a mathematical model based on an Integer Linear Programming formulation for assessing isomorphism between manifolds by leveraging the Ricci-Flow algorithm on discrete manifoldsFile | Dimensione | Formato | |
---|---|---|---|
phd_unimib_892428.pdf
accesso aperto
Descrizione: Tesi di Avila Cartes Jorge - 892428
Tipologia di allegato:
Doctoral thesis
Dimensione
4.61 MB
Formato
Adobe PDF
|
4.61 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.