Establishing vocabulary tests as a benchmark for evaluating large language models

Martínez Ruiz, Gonzalo

, Conde Díaz, Javier

, Merino Gómez, Elena

, Bermúdez Margaretto, Beatriz, Hernández Gutiérrez, José Alberto

, Reviriego Vasallo, Pedro

and Brysbaert, Marc

(2024). Establishing vocabulary tests as a benchmark for evaluating large language models. "PLOS ONE", v. 19 (n. 12); pp. 1-17. https://doi.org/10.1371/journal.pone.0308259.

Descripción

Título:	Establishing vocabulary tests as a benchmark for evaluating large language models
Autor/es:	Martínez Ruiz, Gonzalo https://orcid.org/0000-0002-9125-6225 Conde Díaz, Javier https://orcid.org/0000-0002-5304-0626 Merino Gómez, Elena https://orcid.org/0000-0003-4129-4626 Bermúdez Margaretto, Beatriz Hernández Gutiérrez, José Alberto https://orcid.org/0000-0002-9551-4308 Reviriego Vasallo, Pedro https://orcid.org/0000-0003-2540-5234 Brysbaert, Marc https://orcid.org/0000-0002-3645-3189
Tipo de Documento:	Artículo
Título de Revista/Publicación:	PLOS ONE
Fecha:	Diciembre 2024
Volumen:	19
Número:	12
Materias:	Informática Ciencias Sociales Telecomunicaciones
Palabras Clave Informales:	AI, LLMs, Evaluation
Escuela:	E.T.S.I. Telecomunicación (UPM)
Departamento:	Ingeniería de Sistemas Telemáticos
Grupo Investigación UPM:	Internet de Nueva Generación
Licencias Creative Commons:	Reconocimiento

Texto completo

PDF (Portable Document Format) - Se necesita un visor de ficheros PDF, como GSview, Xpdf o Adobe Acrobat Reader
Descargar (723kB)

Resumen

Vocabulary tests, once a cornerstone of language modeling evaluation, have been largely overlooked in the current landscape of Large Language Models (LLMs) like Llama 2, Mistral, and GPT. While most LLM evaluation benchmarks focus on specific tasks or domain-specific knowledge, they often neglect the fundamental linguistic aspects of language understanding. In this paper, we advocate for the revival of vocabulary tests as a valuable tool for assessing LLM performance. We evaluate seven LLMs using two vocabulary test formats across two languages and uncover surprising gaps in their lexical knowledge. These findings shed light on the intricacies of LLM word representations, their learning mechanisms, and performance variations across models and languages. Moreover, the ability to automatically generate and perform vocabulary tests offers new opportunities to expand the approach and provide a more complete picture of LLMs’ language skills.

Proyectos asociados

Tipo

Código

Acrónimo

Responsable

Título

Universidad Politécnica de Madrid

Cybertutor

Sin especificar

“Primeros Proyectos” call from ETSIT

Gobierno de España

PID2022-136684OB-C22

Sin especificar

Gobierno de España

TED2021-130118B-I00

Sin especificar

Horizonte Europa

101140087

SMARTY

Sin especificar

Scalable and Quantum Resilient Heterogeneous Edge Computing enabling Trustworthy AI

Más información

ID de Registro:	85330
Identificador DC:	https://oa.upm.es/85330/
Identificador OAI:	oai:oa.upm.es:85330
URL Portal Científico:	https://portalcientifico.upm.es/es/ipublic/item/10333324
Identificador DOI:	10.1371/journal.pone.0308259
URL Oficial:	https://journals.plos.org/plosone/article?id=10.13...
Depositado por:	Javier Conde Díaz
Depositado el:	15 Dic 2024 18:37
Ultima Modificación:	15 Oct 2025 01:01

Estadísticas

Exportar cita

Editar (sólo personal del Archivo)

En esta página

Menú principal

Buscar

Establishing vocabulary tests as a benchmark for evaluating large language models

Cita

Descripción

Texto completo

Resumen

Proyectos asociados

Más información

Acciones

Metrics

Altmetrics probando

Dimensions

Documentos

El repositorio

Agrupados por ...

Datos Investigación

Financiadores

Especiales

En otros formatos

Redes sociales

Información adicional