Video memorability prediction aims to quantify how much a given video content will be remembered over time. The main attributes affecting the prediction of memorability are not yet fully understood and many of the methods in the literature are based on features extracted from content recognition models. In this paper we demonstrate that features extracted from a model trained with natural language supervision are effective for estimating video memorability. The proposed method exploits a Vision Transformer pretrained using Contrastive Language-Image Pretraining (CLIP) for encoding video frames. A temporal attention mechanism is then used to select and aggregate relevant frame representations into a video-level feature vector. Finally, a multi-layer perceptron maps the video-level features into a score. We test several types of encoding and temporal aggregation modules and submit our best solution to the MediaEval 2022 Predicting Media Memorability task. We achieve a correlation of 0.707 in subtask 1 (i.e. the Memento10k dataset). In task 2 we obtain a Pearson correlation of 0.487 by training on Memento10k and testing on videoMem and of 0.529 by training on videoMem and testing on Memento10k.

Agarla, M., Celona, L., Schettini, R. (2023). Predicting Video Memorability Using a Model Pretrained with Natural Language Supervision. In Working Notes Proceedings of the MediaEval 2022 Workshop. CEUR-WS.

Predicting Video Memorability Using a Model Pretrained with Natural Language Supervision

Mirko Agarla
;
Luigi Celona;Raimondo Schettini
2023

Abstract

Video memorability prediction aims to quantify how much a given video content will be remembered over time. The main attributes affecting the prediction of memorability are not yet fully understood and many of the methods in the literature are based on features extracted from content recognition models. In this paper we demonstrate that features extracted from a model trained with natural language supervision are effective for estimating video memorability. The proposed method exploits a Vision Transformer pretrained using Contrastive Language-Image Pretraining (CLIP) for encoding video frames. A temporal attention mechanism is then used to select and aggregate relevant frame representations into a video-level feature vector. Finally, a multi-layer perceptron maps the video-level features into a score. We test several types of encoding and temporal aggregation modules and submit our best solution to the MediaEval 2022 Predicting Media Memorability task. We achieve a correlation of 0.707 in subtask 1 (i.e. the Memento10k dataset). In task 2 we obtain a Pearson correlation of 0.487 by training on Memento10k and testing on videoMem and of 0.529 by training on videoMem and testing on Memento10k.
slide + paper
Video memorability; CLIP; Temporal attention module
English
2022 MediaEval Workshop, MediaEval 2022 - 12-13 January 2023
2023
Working Notes Proceedings of the MediaEval 2022 Workshop
2023
3583
https://2022.multimediaeval.com/paper2382.pdf
none
Agarla, M., Celona, L., Schettini, R. (2023). Predicting Video Memorability Using a Model Pretrained with Natural Language Supervision. In Working Notes Proceedings of the MediaEval 2022 Workshop. CEUR-WS.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/403112
Citazioni
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
Social impact