Bicocca Open Archive

The massive amount of genomic data appearing for SARS-CoV-2 since the beginning of the COVID-19 pandemic has challenged traditional methods for studying its dynamics. As a result, new methods such as Pangolin, which can scale to the millions of samples of SARS-CoV-2 currently available, have appeared. Such a tool is tailored to take as input assembled, aligned, and curated full-length sequences, such as those found in the GISAID database. As high-throughput sequencing technologies continue to advance, such assembly, alignment, and curation may become a bottleneck, creating a need for methods that can process raw sequencing reads directly. In this article, we propose Reads2Vec, an alignment-free embedding approach that can generate a fixed-length feature vector representation directly from the raw sequencing reads without requiring assembly. Furthermore, since such an embedding is a numerical representation, it may be applied to highly optimized classification and clustering algorithms. Experiments on simulated data show that our proposed embedding obtains better classification results and better clustering properties contrary to existing alignment-free baselines. In a study on real data, we show that alignment-free embeddings have better clustering properties than the Pangolin tool and that the spike region of the SARS-CoV-2 genome heavily informs the alignment-free clusterings, which is consistent with current biological knowledge of SARS-CoV-2.

Chourasia, P., Ali, S., Ciccolella, S., Della Vedova, G., Patterson, M. (2023). Reads2Vec: Efficient Embedding of Raw High-Throughput Sequencing Reads Data. JOURNAL OF COMPUTATIONAL BIOLOGY, 30(4), 469-491 [10.1089/cmb.2022.0424].

Reads2Vec: Efficient Embedding of Raw High-Throughput Sequencing Reads Data

Chourasia P.;Ali S.;Ciccolella S.;Della Vedova G.;Patterson M.

2023

Abstract

The massive amount of genomic data appearing for SARS-CoV-2 since the beginning of the COVID-19 pandemic has challenged traditional methods for studying its dynamics. As a result, new methods such as Pangolin, which can scale to the millions of samples of SARS-CoV-2 currently available, have appeared. Such a tool is tailored to take as input assembled, aligned, and curated full-length sequences, such as those found in the GISAID database. As high-throughput sequencing technologies continue to advance, such assembly, alignment, and curation may become a bottleneck, creating a need for methods that can process raw sequencing reads directly. In this article, we propose Reads2Vec, an alignment-free embedding approach that can generate a fixed-length feature vector representation directly from the raw sequencing reads without requiring assembly. Furthermore, since such an embedding is a numerical representation, it may be applied to highly optimized classification and clustering algorithms. Experiments on simulated data show that our proposed embedding obtains better classification results and better clustering properties contrary to existing alignment-free baselines. In a study on real data, we show that alignment-free embeddings have better clustering properties than the Pangolin tool and that the spike region of the SARS-CoV-2 genome heavily informs the alignment-free clusterings, which is consistent with current biological knowledge of SARS-CoV-2.

Scheda breve

Scheda completa

Scheda completa (DC)

	Sottotipologia
	
				Articolo in rivista - Articolo scientifico
			
	Parole chiave
	
				alignment-free; assembly; classification; clustering; high-Throughput sequencing; SARS-CoV-2;
			
	Lingua del contenuto
	
				English
			
	Data ahead of print o Data prima pubblicazione Online
	
				18-apr-2023
			
	Data di pubblicazione
	
				2023
			
	Rivista
	
				JOURNAL OF COMPUTATIONAL BIOLOGY
			
	Numero del volume
	
				30
			
	Fascicolo
	
				4
			
	Pagina iniziale
	
				469
			
	Pagina finale
	
				491
			
	DOI dell'articolo
	
				https://dx.doi.org/10.1089/cmb.2022.0424
			
	Fulltext
	
				none
			
	Citazione
	
				Chourasia, P., Ali, S., Ciccolella, S., Della Vedova, G., Patterson, M. (2023). Reads2Vec: Efficient Embedding of Raw High-Throughput Sequencing Reads Data. JOURNAL OF COMPUTATIONAL BIOLOGY, 30(4), 469-491 [10.1089/cmb.2022.0424].
			
	Appare nelle tipologie:
	
				01 - Articolo su rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/516382

Citazioni

6

3

Social impact