Bicocca Open Archive

Motivation: The Positional Burrows-Wheeler Transform (PBWT) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in O(hw) time. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over Biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory. Results: In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as μ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the μ-PBWT reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, μ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. μ-PBWT is an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel.

Cozzi, D., Rossi, M., Rubinacci, S., Gagie, T., Köppl, D., Boucher, C., et al. (2023). μ-PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data. BIOINFORMATICS, 39(9) [10.1093/bioinformatics/btad552].

μ-PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data

Cozzi, D.;Rossi, M;Rubinacci, S;Gagie, T;Köppl, D;Boucher, C;Bonizzoni P

2023

Abstract

Motivation: The Positional Burrows-Wheeler Transform (PBWT) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in O(hw) time. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over Biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory. Results: In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as μ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the μ-PBWT reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, μ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. μ-PBWT is an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel.

Scheda breve

Scheda completa

Scheda completa (DC)

	Sottotipologia
	
				Articolo in rivista - Articolo scientifico
			
	Parole chiave
	
				Succinct data structures, Burrows-Wheeler transform, Positional Burrows-Wheeler transform, Pattern matching
			
	Lingua del contenuto
	
				English
			
	Data ahead of print o Data prima pubblicazione Online
	
				9-set-2023
			
	Data di pubblicazione
	
				2023
			
	Rivista
	
				BIOINFORMATICS
			
	Numero del volume
	
				39
			
	Fascicolo
	
				9
			
	Article number
	
				btad552
			
	DOI dell'articolo
	
				https://dx.doi.org/10.1093/bioinformatics/btad552
			
	URL alternativo
	
				https://academic.oup.com/bioinformatics/article/39/9/btad552/7265394?
			
	Fulltext
	
				open
			
	Citazione
	
				Cozzi, D., Rossi, M., Rubinacci, S., Gagie, T., Köppl, D., Boucher, C., et al. (2023). μ-PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data. BIOINFORMATICS, 39(9) [10.1093/bioinformatics/btad552].
			
	Appare nelle tipologie:
	
				01 - Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
Cozzi-2023-Bioinformatics-VoR.pdf accesso aperto Descrizione: Article Tipologia di allegato: Publisher’s Version (Version of Record, VoR) Licenza: Creative Commons Dimensione 801.6 kB Formato Adobe PDF Visualizza/Apri	801.6 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/446438

Citazioni

1

0

Social impact