Robustness against Faults in Configuration Memories of FPGA-based LLMs

Gao, Zhen ORCID: https://orcid.org/0000-0001-9887-1418, Yuan, Lini, Wang, Jingya, Liu, Qiang, Conde Díaz, Javier ORCID: https://orcid.org/0000-0002-5304-0626, Reviriego Vasallo, Pedro ORCID: https://orcid.org/0000-0003-2540-5234, Zeng, Shulin, Wang, Yu, Liu, Shanshan and Lombardi, Fabrizio ORCID: https://orcid.org/0000-0003-3152-3245 (2025). Robustness against Faults in Configuration Memories of FPGA-based LLMs. "IEEE Transactions on Circuits and Systems for Artificial Intelligence" ; pp. 1-12. https://doi.org/10.1109/TCASAI.2025.3552735.

Descripción

Título: Robustness against Faults in Configuration Memories of FPGA-based LLMs
Autor/es:
Tipo de Documento: Artículo
Título de Revista/Publicación: IEEE Transactions on Circuits and Systems for Artificial Intelligence
Fecha: Marzo 2025
Materias:
ODS:
Palabras Clave Informales: Field programmable gate arrays; Robustness; Hardware; Artificial intelligence; Transformers; Graphics processing units; Integrated circuit modeling; Fault location; Circuit faults; Sparse matrices; Dependability; Large Language Models; FPGAs
Escuela: E.T.S.I. Telecomunicación (UPM)
Departamento: Ingeniería de Sistemas Telemáticos
Grupo Investigación UPM: Internet de Nueva Generación
Licencias Creative Commons: Reconocimiento - Compartir igual

Texto completo

[thumbnail of Safe_FlightLLM-final.pdf] PDF (Portable Document Format) - Se necesita un visor de ficheros PDF, como GSview, Xpdf o Adobe Acrobat Reader
Descargar (1MB)

Resumen

Large Language Models (LLMs) pose significant challenges in terms of speed and energy dissipation of AI systems. Dependability is a further important issue for LLM implementations; this is especially relevant for FPGAs that are vulnerable to soft errors in the configuration memory. Moreover, as current GPU based implementations are not energy efficient, there is interest in running LLMs on different technology platforms, such as FlightLLM (an FPGA based accelerator designed to run LLMs for energy efficiency). In this paper, we analyze and evaluate the robustness of FPGA-based LLMs against faults/errors in the configuration memories. For the evaluation, we first propose a PyTorch based fault injection simulator and based on the analysis of FlightLLM and we study its robustness against stuck-at faults on the configuration memory. Furthermore, we propose an efficient error detection technique based on a concurrent classifier. Evaluation results show that stuck-at errors on high bits of the logic units can dramatically degrade the LLM performance, and the proposed concurrent classifier can effectively detect errors with negligible complexity and overhead. Finally, a low-cost fault location scheme is proposed, so that the fault can be easily recovered by dynamic partial reconfiguration. The combination of the concurrent classifier error detection and fault location can be used to improve the robustness of a FPGA-based LLM efficiently, such as FlightLLM

Proyectos asociados

Tipo
Código
Acrónimo
Responsable
Título
Gobierno de España
PID2022-136684OB-C22
Sin especificar
Sin especificar
Sin especificar
Gobierno de España
PCI2024-153434
Sin especificar
Sin especificar
Sin especificar
Horizonte Europa
101140087
Sin especificar
Sin especificar
Sin especificar

Más información

ID de Registro: 88428
Identificador DC: https://oa.upm.es/88428/
Identificador OAI: oai:oa.upm.es:88428
Identificador DOI: 10.1109/TCASAI.2025.3552735
URL Oficial: https://ieeexplore.ieee.org/document/10932828
Depositado por: Profesor Pedro Reviriego
Depositado el: 23 Mar 2025 09:04
Ultima Modificación: 23 Mar 2025 09:04