Representations of biological sequences facilitating sequence comparison are crucial in several bioinformatics tasks. Recently, the Lyndon factorization has been proved to preserve common factors in overlapping reads, thus leading to the idea of using factorizations of sequences to define measures of similarity between reads. In this paper we propose as a signature of sequencing reads the notion of fingerprint, i.e., the sequence of lengths of consecutive factors in Lyndon-based factorizations of the reads. Surprisingly, fingerprints of reads are effective in preserving sequence similarities while providing a compact representation of the read, and so, k-mers extracted from a fingerprint, called k-fingers, can be used to capture sequence similarity between reads. We first provide a probabilistic framework to estimate the behaviour of fingerprints. Then we experimentally evaluate the effectiveness of this representation for machine learning algorithms for classifying biological sequences. In particular, we considered the problem of assigning RNA-Seq reads to the most likely gene from which they were generated. Our results show that fingerprints can provide an effective machine learning interpretable representation, successfully preserving sequence similarity.

Bonizzoni, P., De Felice, C., Petescia, A., Pirola, Y., Rizzi, R., Stoye, J., et al. (2021). Can We Replace Reads by Numeric Signatures? Lyndon Fingerprints as Representations of Sequencing Reads for Machine Learning. In Algorithms for Computational Biology : 8th International Conference, AlCoB 2021, Missoula, MT, USA, June 7–11, 2021, Proceedings (pp.16-28). Cham : Springer Science and Business Media Deutschland GmbH [10.1007/978-3-030-74432-8_2].

Can We Replace Reads by Numeric Signatures? Lyndon Fingerprints as Representations of Sequencing Reads for Machine Learning

Bonizzoni, Paola
;
Pirola, Yuri;Rizzi, Raffaella;
2021

Abstract

Representations of biological sequences facilitating sequence comparison are crucial in several bioinformatics tasks. Recently, the Lyndon factorization has been proved to preserve common factors in overlapping reads, thus leading to the idea of using factorizations of sequences to define measures of similarity between reads. In this paper we propose as a signature of sequencing reads the notion of fingerprint, i.e., the sequence of lengths of consecutive factors in Lyndon-based factorizations of the reads. Surprisingly, fingerprints of reads are effective in preserving sequence similarities while providing a compact representation of the read, and so, k-mers extracted from a fingerprint, called k-fingers, can be used to capture sequence similarity between reads. We first provide a probabilistic framework to estimate the behaviour of fingerprints. Then we experimentally evaluate the effectiveness of this representation for machine learning algorithms for classifying biological sequences. In particular, we considered the problem of assigning RNA-Seq reads to the most likely gene from which they were generated. Our results show that fingerprints can provide an effective machine learning interpretable representation, successfully preserving sequence similarity.
paper
Lyndon factorization; Machine learning; Read representation; Sequence analysis; Sequence mining;
Sequence analysis;Lyndon factorization;Read representation;Machine learning;Sequence mining
English
8th International Conference on Algorithms for Computational Biology, AlCoB 2021
2021
Martín-Vide, C; Vega-Rodríguez, MA; Wheeler, T
Algorithms for Computational Biology : 8th International Conference, AlCoB 2021, Missoula, MT, USA, June 7–11, 2021, Proceedings
978-3-030-74431-1
2021
12715
16
28
partially_open
Bonizzoni, P., De Felice, C., Petescia, A., Pirola, Y., Rizzi, R., Stoye, J., et al. (2021). Can We Replace Reads by Numeric Signatures? Lyndon Fingerprints as Representations of Sequencing Reads for Machine Learning. In Algorithms for Computational Biology : 8th International Conference, AlCoB 2021, Missoula, MT, USA, June 7–11, 2021, Proceedings (pp.16-28). Cham : Springer Science and Business Media Deutschland GmbH [10.1007/978-3-030-74432-8_2].
File in questo prodotto:
File Dimensione Formato  
conf-paper-21-alcob.pdf

Solo gestori archivio

Tipologia di allegato: Publisher’s Version (Version of Record, VoR)
Dimensione 316.68 kB
Formato Adobe PDF
316.68 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
postprint.pdf

accesso aperto

Descrizione: Author's Accepted Manuscript (AAM)
Tipologia di allegato: Author’s Accepted Manuscript, AAM (Post-print)
Dimensione 402.47 kB
Formato Adobe PDF
402.47 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/315998
Citazioni
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
Social impact