Texto completo
|
PDF (Portable Document Format)
- Se necesita un visor de ficheros PDF, como GSview, Xpdf o Adobe Acrobat Reader
Descargar (1MB) |
ORCID: https://orcid.org/0009-0004-2769-9752, Esteban Romero, Sergio
ORCID: https://orcid.org/0009-0008-6336-7877, Gil Martín, Manuel
ORCID: https://orcid.org/0000-0002-4285-6224 and Fernández Martínez, Fernando
ORCID: https://orcid.org/0000-0003-3877-0089
(2026).
A comprehensive study on contrastive pre-training and fine tuning of vision and text transformers for video memorability prediction.
"Multimedia Tools and Applications", v. 85
(n. 30);
ISSN 1380-7501.
https://doi.org/10.1007/s11042-026-21260-3.
| Título: | A comprehensive study on contrastive pre-training and fine tuning of vision and text transformers for video memorability prediction |
|---|---|
| Autor/es: |
|
| Tipo de Documento: | Artículo |
| Título de Revista/Publicación: | Multimedia Tools and Applications |
| Fecha: | 27 Enero 2026 |
| ISSN: | 1380-7501 |
| Volumen: | 85 |
| Número: | 30 |
| Materias: | |
| ODS: | |
| Palabras Clave Informales: | Video memorability prediction; contrastive language image pre-training (CLIP); multimodal content analysis; semantic knowledge integration |
| Escuela: | E.T.S.I. Telecomunicación (UPM) |
| Departamento: | Ingeniería Electrónica |
| Licencias Creative Commons: | Reconocimiento |
|
PDF (Portable Document Format)
- Se necesita un visor de ficheros PDF, como GSview, Xpdf o Adobe Acrobat Reader
Descargar (1MB) |
Video memorability prediction has emerged as a key challenge for improving information retrieval, content design, and user engagement. Prior work has shown that semantic cues play a crucial role in determining memorability, with recent studies leveraging Contrastive Language-Image Pre-training (CLIP) encoders to incorporate semantic information. However, the specific improvements attributable to CLIP models remain unclear, as few studies systematically compare their performance against equivalent unimodal encoders or explore fine-tuning strategies. This work addresses that gap through a comprehensive, controlled evaluation of CLIP-based and unimodal encoders for video memorability prediction. We propose FCLIP, a domain-adapted extension of CLIP that undergoes additional contrastive pre-training on memorability-specific image-text pairs. Our experiments assess both feature extraction and supervised fine-tuning, ensuring fair comparisons across models with matched architecture and parameter count. Results show that FCLIP image encoders achieve a Spearman Rank Correlation Coefficient (SRCC) of 0.672 on the Memento10k dataset, significantly outperforming unimodal Vision Transformers. FCLIP text encoders similarly outperform unimodal baselines, reaching an SRCC of 0.632. These findings demonstrate that contrastive learning and domain adaptation substantially improve memorability prediction, highlighting the importance of semantic and multimodal pre-training in developing advanced content analysis systems.
| ID de Registro: | 94134 |
|---|---|
| Identificador DC: | https://oa.upm.es/94134/ |
| Identificador OAI: | oai:oa.upm.es:94134 |
| URL Portal Científico: | https://portalcientifico.upm.es/es/ipublic/item/10449384 |
| Identificador DOI: | 10.1007/s11042-026-21260-3 |
| URL Oficial: | https://link.springer.com/article/10.1007/s11042-0... |
| Depositado por: | iMarina Portal Científico |
| Depositado el: | 19 Feb 2026 08:35 |
| Ultima Modificación: | 19 Feb 2026 08:35 |
Publicar en el Archivo Digital desde el Portal Científico