Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multi-string generalization of the Burrows-Wheeler Transform (BWT): large requirements of in-memory approaches have stimulated recent developments on external memory algorithms. The related problem of computing the Longest Common Prefix (LCP) array of a set of strings is instrumental to compute the suffix-prefix overlaps among strings, which is an essential step for many genome assembly algorithms. In a previous paper, we presented an in-memory divide-and-conquer method for building the BWT and LCP where we merge partial BWTs with a forward approach to sort suffixes. In this paper, we propose an alternative backward strategy to develop an external memory method to simultaneously build the BWT and the LCP array on a collection of m strings of different lengths. The algorithm over a set of strings having constant length k has O(mkl) time and I/O volume, using O(k+m) main memory, where l is the maximum value in the LCP array.

Bonizzoni, P., Della Vedova, G., Pirola, Y., Previtali, M., Rizzi, R. (2021). Computing the multi-string BWT and LCP array in external memory. THEORETICAL COMPUTER SCIENCE, 862(16 March 2021), 42-58 [10.1016/j.tcs.2020.11.041].

Computing the multi-string BWT and LCP array in external memory

Bonizzoni, P
;
Della Vedova, G;Pirola, Y;Previtali, M;Rizzi, R
2021

Abstract

Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multi-string generalization of the Burrows-Wheeler Transform (BWT): large requirements of in-memory approaches have stimulated recent developments on external memory algorithms. The related problem of computing the Longest Common Prefix (LCP) array of a set of strings is instrumental to compute the suffix-prefix overlaps among strings, which is an essential step for many genome assembly algorithms. In a previous paper, we presented an in-memory divide-and-conquer method for building the BWT and LCP where we merge partial BWTs with a forward approach to sort suffixes. In this paper, we propose an alternative backward strategy to develop an external memory method to simultaneously build the BWT and the LCP array on a collection of m strings of different lengths. The algorithm over a set of strings having constant length k has O(mkl) time and I/O volume, using O(k+m) main memory, where l is the maximum value in the LCP array.
Articolo in rivista - Articolo scientifico
Burrows-Wheeler transform, Longest common prefix array, External-memory algorithms;
Burrows–Wheeler transform; External-memory algorithms; Longest common prefix array;
English
30-nov-2020
2021
862
16 March 2021
42
58
none
Bonizzoni, P., Della Vedova, G., Pirola, Y., Previtali, M., Rizzi, R. (2021). Computing the multi-string BWT and LCP array in external memory. THEORETICAL COMPUTER SCIENCE, 862(16 March 2021), 42-58 [10.1016/j.tcs.2020.11.041].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/296482
Citazioni
  • Scopus 5
  • ???jsp.display-item.citation.isi??? 3
Social impact