In this study we investigate the effectiveness of deep neural networks in predicting valence and arousal solely from visual information of video sequences. Several recent Convolutional Neural Network (CNN) and Transformer architectures are used as backbone of the proposed model. We also assess the impact of pretraining on model performance by comparing the results of trained from scratch versus pre-trained models. Experimental results on the One-Minute Gradual-Emotion Recognition Challenge dataset suggest that pre-training on emotion recognition datasets is beneficial for most models. Comparison with the state-of-the-art reveals similar performance on valence Concordance Correlation Coefficient (CCC) and lower performance on arousal CCC. However, the predictions in our experiments are not statistically different in most cases. The study concludes by emphasizing the complexity of video emotion recognition and the need for further research to enhance the robustness and accuracy of emotion recognition models. The source code used for the experiments is made publicly available.
Alchieri, L., Celona, L., Bianco, S. (2024). Video-Based Emotion Estimation Using Deep Neural Networks: A Comparative Study. In Image Analysis and Processing - ICIAP 2023 Workshops Udine, Italy, September 11–15, 2023, Proceedings, Part I (pp.255-269). Springer Science and Business Media Deutschland GmbH [10.1007/978-3-031-51023-6_22].
Video-Based Emotion Estimation Using Deep Neural Networks: A Comparative Study
Celona, Luigi;Bianco, Simone
2024
Abstract
In this study we investigate the effectiveness of deep neural networks in predicting valence and arousal solely from visual information of video sequences. Several recent Convolutional Neural Network (CNN) and Transformer architectures are used as backbone of the proposed model. We also assess the impact of pretraining on model performance by comparing the results of trained from scratch versus pre-trained models. Experimental results on the One-Minute Gradual-Emotion Recognition Challenge dataset suggest that pre-training on emotion recognition datasets is beneficial for most models. Comparison with the state-of-the-art reveals similar performance on valence Concordance Correlation Coefficient (CCC) and lower performance on arousal CCC. However, the predictions in our experiments are not statistically different in most cases. The study concludes by emphasizing the complexity of video emotion recognition and the need for further research to enhance the robustness and accuracy of emotion recognition models. The source code used for the experiments is made publicly available.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.