Exploring Dimensionality Reduction Techniques in Multilingual Transformers

Huertas Tato, Javier ORCID: https://orcid.org/0000-0003-4127-5505, Huertas García, Álvaro ORCID: https://orcid.org/0000-0003-2165-0144, Martín García, Alejandro ORCID: https://orcid.org/0000-0002-0800-7632 and Camacho Fernández, David ORCID: https://orcid.org/0000-0002-5051-3475 (2022). Exploring Dimensionality Reduction Techniques in Multilingual Transformers. "Cognitive Computation", v. 15 ; pp. 590-612. ISSN 1866-9956. https://doi.org/10.1007/s12559-022-10066-8.

Descripción

Título: Exploring Dimensionality Reduction Techniques in Multilingual Transformers
Autor/es:
Tipo de Documento: Artículo
Título de Revista/Publicación: Cognitive Computation
Fecha: 29 Octubre 2022
ISSN: 1866-9956
Volumen: 15
Materias:
ODS:
Palabras Clave Informales: Dimensionality Reduction, Natural Language Processing, Semantic Textual Similarity, Multilingual Transformers, Language models
Escuela: E.T.S.I. de Sistemas Informáticos (UPM)
Departamento: Sistemas Informáticos
Licencias Creative Commons: Reconocimiento

Texto completo

[thumbnail of 9974149.pdf] PDF (Portable Document Format) - Se necesita un visor de ficheros PDF, como GSview, Xpdf o Adobe Acrobat Reader
Descargar (916kB)

Resumen

In scientific literature and industry, semantic and context-aware Natural Language Processing-based solutions have been gaining importance in recent years. The possibilities and performance shown by these models when dealing with complex Human Language Understanding tasks are unquestionable, from conversational agents to the fight against disinformation in social networks. In addition, considerable attention is also being paid to developing multilingual models to tackle the language bottleneck. An increase in size has accompanied the growing need to provide more complex models implementing all these features without being conservative in the number of dimensions required. This paper aims to provide a comprehensive account of the impact of a wide variety of dimensional reduction techniques on the performance of different state-of-the-art multilingual siamese transformers, including unsupervised dimensional reduction techniques such as linear and nonlinear feature extraction, feature selection, and manifold techniques. In order to evaluate the effects of these techniques, we considered the multilingual extended version of Semantic Textual Similarity Benchmark (mSTSb) and two different baseline approaches, one using the embeddings from the pre-trained version of five models and another using their fine-tuned STS version. The results evidence that it is possible to achieve an average reduction of 91.58 % ± 2.59 % in the number of dimensions of embeddings from pre-trained models requiring a fitting time 96.68 % ± 0.68 % faster than the fine-tuning process. Besides, we achieve 54.65 % ± 32.20 % dimensionality reduction in embeddings from fine-tuned models. The results of this study will significantly contribute to the understanding of how different tuning approaches affect performance on semantic-aware tasks and how dimensional reduction techniques deal with the high-dimensional embeddings computed for the STS task and their potential for other highly demanding NLP tasks.

Proyectos asociados

Tipo
Código
Acrónimo
Responsable
Título
Gobierno de España
PID2020-117263GB-100
FightDIS
Sin especificar
Sin especificar
Horizonte Europa
PLEC2021-007681
XAI-Disinfodemics
Sin especificar
Sin especificar
Comunidad de Madrid
S2018/ TCS-4566
CYNAMON
Sin especificar
Sin especificar
Horizonte Europa
2020-EU-IA-0252
IBERIFIER
Sin especificar
Sin especificar

Más información

ID de Registro: 88877
Identificador DC: https://oa.upm.es/88877/
Identificador OAI: oai:oa.upm.es:88877
URL Portal Científico: https://portalcientifico.upm.es/es/ipublic/item/9974149
Identificador DOI: 10.1007/s12559-022-10066-8
URL Oficial: https://link.springer.com/article/10.1007/s12559-0...
Depositado por: iMarina Portal Científico
Depositado el: 05 May 2025 14:59
Ultima Modificación: 05 May 2025 15:29