Speaker verification is the task of examining a speech signal to authenticate the claimed identity of a speaker as true or false. In order to deal with utterances having different lengths, and to accumulate information along the time dimension, different temporal aggregators have been proposed inside speaker verification pipelines. In this paper we investigate the behavior of five different temporal aggregators in the state of art, namely Temporal Average Pooling (TAP), Global Statistical Pooling (GSP), Self-Attentive Pooling (SAP), Attentive Statistical Pooling (ASP), and Vector of Locally Aggregated Descriptors (VLAD) at varying lengths of the two utterances. Starting from a speaker verification method in the state of the art, the experimental results on the VoxCeleb2 dataset show that there is a sweet spot for utterance length where speaker verification performance is higher independently from the temporal aggregator used.
Piccoli, F., Olearo, L., Bianco, S. (2022). A comparison of temporal aggregators for speaker verification. In IEEE International Conference on Consumer Electronics - Berlin, ICCE-Berlin (pp.1-6). IEEE Computer Society [10.1109/ICCE-Berlin56473.2022.9937132].
A comparison of temporal aggregators for speaker verification
Piccoli F.;Bianco S.
2022
Abstract
Speaker verification is the task of examining a speech signal to authenticate the claimed identity of a speaker as true or false. In order to deal with utterances having different lengths, and to accumulate information along the time dimension, different temporal aggregators have been proposed inside speaker verification pipelines. In this paper we investigate the behavior of five different temporal aggregators in the state of art, namely Temporal Average Pooling (TAP), Global Statistical Pooling (GSP), Self-Attentive Pooling (SAP), Attentive Statistical Pooling (ASP), and Vector of Locally Aggregated Descriptors (VLAD) at varying lengths of the two utterances. Starting from a speaker verification method in the state of the art, the experimental results on the VoxCeleb2 dataset show that there is a sweet spot for utterance length where speaker verification performance is higher independently from the temporal aggregator used.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.