Establishing vocabulary tests as a benchmark for evaluating large language models

Martínez Ruiz, Gonzalo ORCID: https://orcid.org/0000-0002-9125-6225, Conde Díaz, Javier ORCID: https://orcid.org/0000-0002-5304-0626, Merino Gómez, Elena ORCID: https://orcid.org/0000-0003-4129-4626, Bermúdez Margaretto, Beatriz, Hernández Gutiérrez, José Alberto ORCID: https://orcid.org/0000-0002-9551-4308, Reviriego Vasallo, Pedro ORCID: https://orcid.org/0000-0003-2540-5234 and Brysbaert, Marc ORCID: https://orcid.org/0000-0002-3645-3189 (2024). Establishing vocabulary tests as a benchmark for evaluating large language models. "PLOS ONE", v. 19 (n. 12); pp. 1-17. https://doi.org/10.1371/journal.pone.0308259.

Descripción

Título: Establishing vocabulary tests as a benchmark for evaluating large language models
Autor/es:
Tipo de Documento: Artículo
Título de Revista/Publicación: PLOS ONE
Fecha: Diciembre 2024
Volumen: 19
Número: 12
Materias:
Palabras Clave Informales: AI, LLMs, Evaluation
Escuela: E.T.S.I. Telecomunicación (UPM)
Departamento: Ingeniería de Sistemas Telemáticos
Grupo Investigación UPM: Internet de Nueva Generación
Licencias Creative Commons: Reconocimiento

Texto completo

[thumbnail of journal_pone_0308259.pdf] PDF (Portable Document Format) - Se necesita un visor de ficheros PDF, como GSview, Xpdf o Adobe Acrobat Reader
Descargar (723kB)

Resumen

Vocabulary tests, once a cornerstone of language modeling evaluation, have been largely overlooked in the current landscape of Large Language Models (LLMs) like Llama 2, Mistral, and GPT. While most LLM evaluation benchmarks focus on specific tasks or domain-specific knowledge, they often neglect the fundamental linguistic aspects of language understanding. In this paper, we advocate for the revival of vocabulary tests as a valuable tool for assessing LLM performance. We evaluate seven LLMs using two vocabulary test formats across two languages and uncover surprising gaps in their lexical knowledge. These findings shed light on the intricacies of LLM word representations, their learning mechanisms, and performance variations across models and languages. Moreover, the ability to automatically generate and perform vocabulary tests offers new opportunities to expand the approach and provide a more complete picture of LLMs’ language skills.

Proyectos asociados

Tipo
Código
Acrónimo
Responsable
Título
Universidad Politécnica de Madrid
Cybertutor
Sin especificar
Sin especificar
“Primeros Proyectos” call from ETSIT
Gobierno de España
PID2022-136684OB-C22
Sin especificar
Sin especificar
Sin especificar
Gobierno de España
TED2021-130118B-I00
Sin especificar
Sin especificar
Sin especificar
Horizonte Europa
101140087
SMARTY
Sin especificar
Scalable and Quantum Resilient Heterogeneous Edge Computing enabling Trustworthy AI

Más información

ID de Registro: 85330
Identificador DC: https://oa.upm.es/85330/
Identificador OAI: oai:oa.upm.es:85330
URL Portal Científico: https://portalcientifico.upm.es/es/ipublic/item/10333324
Identificador DOI: 10.1371/journal.pone.0308259
URL Oficial: https://journals.plos.org/plosone/article?id=10.13...
Depositado por: Javier Conde Díaz
Depositado el: 15 Dic 2024 18:37
Ultima Modificación: 15 Oct 2025 01:01