A comprehensive study on contrastive pre-training and fine tuning of vision and text transformers for video memorability prediction

Martín Fernández, Iván ORCID: https://orcid.org/0009-0004-2769-9752, Esteban Romero, Sergio ORCID: https://orcid.org/0009-0008-6336-7877, Gil Martín, Manuel ORCID: https://orcid.org/0000-0002-4285-6224 and Fernández Martínez, Fernando ORCID: https://orcid.org/0000-0003-3877-0089 (2026). A comprehensive study on contrastive pre-training and fine tuning of vision and text transformers for video memorability prediction. "Multimedia Tools and Applications", v. 85 (n. 30); ISSN 1380-7501. https://doi.org/10.1007/s11042-026-21260-3.

Descripción

Título: A comprehensive study on contrastive pre-training and fine tuning of vision and text transformers for video memorability prediction
Autor/es:
Tipo de Documento: Artículo
Título de Revista/Publicación: Multimedia Tools and Applications
Fecha: 27 Enero 2026
ISSN: 1380-7501
Volumen: 85
Número: 30
Materias:
ODS:
Palabras Clave Informales: Video memorability prediction; contrastive language image pre-training (CLIP); multimodal content analysis; semantic knowledge integration
Escuela: E.T.S.I. Telecomunicación (UPM)
Departamento: Ingeniería Electrónica
Licencias Creative Commons: Reconocimiento

Texto completo

[thumbnail of 10449384.pdf] PDF (Portable Document Format) - Se necesita un visor de ficheros PDF, como GSview, Xpdf o Adobe Acrobat Reader
Descargar (1MB)

Resumen

Video memorability prediction has emerged as a key challenge for improving information retrieval, content design, and user engagement. Prior work has shown that semantic cues play a crucial role in determining memorability, with recent studies leveraging Contrastive Language-Image Pre-training (CLIP) encoders to incorporate semantic information. However, the specific improvements attributable to CLIP models remain unclear, as few studies systematically compare their performance against equivalent unimodal encoders or explore fine-tuning strategies. This work addresses that gap through a comprehensive, controlled evaluation of CLIP-based and unimodal encoders for video memorability prediction. We propose FCLIP, a domain-adapted extension of CLIP that undergoes additional contrastive pre-training on memorability-specific image-text pairs. Our experiments assess both feature extraction and supervised fine-tuning, ensuring fair comparisons across models with matched architecture and parameter count. Results show that FCLIP image encoders achieve a Spearman Rank Correlation Coefficient (SRCC) of 0.672 on the Memento10k dataset, significantly outperforming unimodal Vision Transformers. FCLIP text encoders similarly outperform unimodal baselines, reaching an SRCC of 0.632. These findings demonstrate that contrastive learning and domain adaptation substantially improve memorability prediction, highlighting the importance of semantic and multimodal pre-training in developing advanced content analysis systems.

Proyectos asociados

Tipo
Código
Acrónimo
Responsable
Título
Horizonte Europa
101071191
ASTOUND
Sin especificar
Improving social competences of virtual agents through artificial consciousness based on the Attention Schema Theory
Gobierno de España
PID2020-118112RB-C22
GOMINOLA
Sin especificar
Agentes conversacionales sensibles a usuario, adaptativos y socio-afectivos basados en microservicios
Gobierno de España
PID2023-150584OB-C21
TRUSTBOOST
Sin especificar
Armonizando Flexibilidad y Conformidad en Sistemas de Inteligencia Artificial Conversacional
Gobierno de España
PID2021-126061OB-C43
BeWORD
Sin especificar
Descubriendo el significado y la intención más allá de la palabra hablada: hacia un entorno inteligente para abordar los documentos multimedia

Más información

ID de Registro: 94134
Identificador DC: https://oa.upm.es/94134/
Identificador OAI: oai:oa.upm.es:94134
URL Portal Científico: https://portalcientifico.upm.es/es/ipublic/item/10449384
Identificador DOI: 10.1007/s11042-026-21260-3
URL Oficial: https://link.springer.com/article/10.1007/s11042-0...
Depositado por: iMarina Portal Científico
Depositado el: 19 Feb 2026 08:35
Ultima Modificación: 19 Feb 2026 08:35