The massive amount of genomic data appearing over the past two years for SARS-CoV-2 has challenged traditional methods for studying the dynamics of the COVID-19 pandemic. As a result, new methods, such as the Pangolin tool, have appeared which can scale to the millions of samples of SARS-CoV-2 currently available. Such a tool is tailored to take assembled, aligned and curated full-length sequences, such as those provided by GISAID, as input. As high-throughput sequencing technologies continue to advance, such assembly, alignment and curation may become a bottleneck, creating a need for methods which can process raw sequencing reads directly. In this paper, we propose several alignment-free embedding approaches, which can generate a fixed-length feature vector representation directly from the raw sequencing reads, without the need for assembly. Moreover, because such an embedding is a numerical representation, it can be passed to already highly optimized clustering methods such as k-mea...
Chourasia, P., Ali, S., Ciccolella, S., Della Vedova, G., Patterson, M. (2022). Clustering SARS-CoV-2 Variants from Raw High-Throughput Sequencing Reads Data. In Computational Advances in Bio and Medical Sciences. 11th International Conference, ICCABS 2021 (pp.133-148). Springer Science and Business Media Deutschland GmbH [10.1007/978-3-031-17531-2_11].
Clustering SARS-CoV-2 Variants from Raw High-Throughput Sequencing Reads Data
Ciccolella, S;Della Vedova, G;
2022
Abstract
The massive amount of genomic data appearing over the past two years for SARS-CoV-2 has challenged traditional methods for studying the dynamics of the COVID-19 pandemic. As a result, new methods, such as the Pangolin tool, have appeared which can scale to the millions of samples of SARS-CoV-2 currently available. Such a tool is tailored to take assembled, aligned and curated full-length sequences, such as those provided by GISAID, as input. As high-throughput sequencing technologies continue to advance, such assembly, alignment and curation may become a bottleneck, creating a need for methods which can process raw sequencing reads directly. In this paper, we propose several alignment-free embedding approaches, which can generate a fixed-length feature vector representation directly from the raw sequencing reads, without the need for assembly. Moreover, because such an embedding is a numerical representation, it can be passed to already highly optimized clustering methods such as k-mea...I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.