Synthesizing olfactory understanding: multimodal language models for image-text smell matching

Esteban Romero, Sergio ORCID: https://orcid.org/0009-0008-6336-7877, Martín Fernández, Iván ORCID: https://orcid.org/0009-0004-2769-9752, Gil Martín, Manuel ORCID: https://orcid.org/0000-0002-4285-6224 and Fernández Martínez, Fernando ORCID: https://orcid.org/0000-0003-3877-0089 (2025). Synthesizing olfactory understanding: multimodal language models for image-text smell matching. "Symmetry", v. 17 (n. 8); p. 1349. ISSN 2073-8994. https://doi.org/10.3390/sym17081349.

Descripción

Título: Synthesizing olfactory understanding: multimodal language models for image-text smell matching
Autor/es:
Tipo de Documento: Artículo
Título de Revista/Publicación: Symmetry
Fecha: 18 Agosto 2025
ISSN: 2073-8994
Volumen: 17
Número: 8
Materias:
ODS:
Palabras Clave Informales: Olfactory understanding; multimodal perception; Contrastive Language–Image Pretraining (CLIP); Multimodal Large Language Models (MM-LLMs)
Escuela: E.T.S.I. Telecomunicación (UPM)
Departamento: Ingeniería Electrónica
Licencias Creative Commons: Reconocimiento - Sin obra derivada - No comercial

Texto completo

[thumbnail of 10384303.pdf] PDF (Portable Document Format) - Se necesita un visor de ficheros PDF, como GSview, Xpdf o Adobe Acrobat Reader
Descargar (3MB)

Resumen

Olfactory information, crucial for human perception, is often underrepresented compared to visual and textual data. This work explores methods for understanding smell descriptions within a multimodal context, where scent information is conveyed indirectly through text and images. We address the challenges of the Multimodal Understanding of Smells in Texts and Images (MUSTI) task by proposing novel approaches that leverage language-specific models and state-of-the-art multimodal large language models (MM-LLMs). Our core contribution is a multimodal framework using language-specific encoders for text and image data. This allows for a joint embedding space that explores the semantic symmetry between smells, texts, and images to identify olfactory-related connections shared across the modalities. While ensemble learning with language-specific models achieved good performance, MM-LLMs demonstrated exceptional potential. Fine-tuning a quantized version of the Qwen-VL-Chat model achieved a state-of-the-art macro F1-score of 0.7618 on the MUSTI task. This highlights the effectiveness of MM-LLMs in capturing task requirements and adapting to specific formats.

Proyectos asociados

Tipo
Código
Acrónimo
Responsable
Título
Horizonte Europa
101071191
ASTOUND
Sin especificar
Improving social competences of virtual agents through artificial consciousness based on the Attention Schema Theory
Gobierno de España
PID2023-150584OB-C21
TRUSTBOOST
Sin especificar
Armonizando Flexibilidad y Conformidad en Sistemas de Inteligencia Artificial Conversacional
Gobierno de España
PID2020-118112RB-C22
GOMINOLA
Sin especificar
Agentes conversacionales sensibles a usuario, adaptativos y socio-afectivos basados en microservicios
Gobierno de España
PID2021-126061OB-C43
BEWORD
Sin especificar
Descubriendo el significado y la intención más allá de la palabra hablada: hacia un entorno inteligente para abordar los documentos multimedia

Más información

ID de Registro: 90955
Identificador DC: https://oa.upm.es/90955/
Identificador OAI: oai:oa.upm.es:90955
URL Portal Científico: https://portalcientifico.upm.es/es/ipublic/item/10384303
Identificador DOI: 10.3390/sym17081349
URL Oficial: https://www.mdpi.com/2073-8994/17/8/1349
Depositado por: iMarina Portal Científico
Depositado el: 02 Oct 2025 09:23
Ultima Modificación: 09 Abr 2026 14:42