Measuring and improving the energy efficiency of large language models inference

Argerich, Mauricio Fadel ORCID: https://orcid.org/0009-0008-9348-8426 and Patiño Martínez, Marta ORCID: https://orcid.org/0000-0003-2997-3722 (2024). Measuring and improving the energy efficiency of large language models inference. "IEEE Access", v. 12 ; pp. 80194-80207. ISSN 2169-3536. https://doi.org/10.1109/ACCESS.2024.3409745.

Descripción

Título: Measuring and improving the energy efficiency of large language models inference
Autor/es:
Tipo de Documento: Artículo
Título de Revista/Publicación: IEEE Access
Fecha: 5 Junio 2024
ISSN: 2169-3536
Volumen: 12
Materias:
Palabras Clave Informales: Computational modeling, Deep learning, Energy consumption, Energy efficiency, Energy measurement, Graphics processing units, Large language model, Large language models, Machine learning, Software, Software measurement, Training
Escuela: E.T.S. de Ingenieros Informáticos (UPM)
Departamento: Lenguajes y Sistemas Informáticos e Ingeniería del Software
Licencias Creative Commons: Reconocimiento - Sin obra derivada - No comercial

Texto completo

[thumbnail of 10226248.pdf] PDF (Portable Document Format) - Se necesita un visor de ficheros PDF, como GSview, Xpdf o Adobe Acrobat Reader
Descargar (5MB)

Resumen

Recent improvements in the accuracy of machine learning (ML) models in the language domain have propelled their use in a multitude of products and services, touching millions of lives daily. These new levels of accuracy have been attained mainly through exponential growth in model size, creating a new category of models known as Large Language Models (LLMs) and leading to a substantial increase in computing and energy demands. While recent studies have focused on measuring and improving the energy consumption of LLMs during training, inference has received little attention. In this article, we present an approach to profile the energy consumption of LLMs during inference and leverage it to improve energy efficiency. For this, we deploy several state-of-the-art LLMs and observe how model size, number of layers, parallelized attention, and even vocabulary size affect their energy consumption. In addition, we leverage input batch size and different quantization levels to optimize their inference energy efficiency and latency.

Proyectos asociados

Tipo
Código
Acrónimo
Responsable
Título
Horizonte 2020
101004480
Sin especificar
Sin especificar
Sin especificar

Más información

ID de Registro: 86674
Identificador DC: https://oa.upm.es/86674/
Identificador OAI: oai:oa.upm.es:86674
URL Portal Científico: https://portalcientifico.upm.es/es/ipublic/item/10226248
Identificador DOI: 10.1109/ACCESS.2024.3409745
URL Oficial: https://ieeexplore.ieee.org/document/10549890
Depositado por: iMarina Portal Científico
Depositado el: 23 Ene 2025 10:59
Ultima Modificación: 23 Ene 2025 10:59