Perturbation-based error detection and correction (PBEDC) in dependable large-scale machine learning systems

Wang, Ziheng ORCID: https://orcid.org/0000-0001-9668-7318, Reviriego Vasallo, Pedro ORCID: https://orcid.org/0000-0003-2540-5234, Liu, Shanshan ORCID: https://orcid.org/0000-0001-6226-2880, Niknia, Farzad ORCID: https://orcid.org/0000-0002-4062-3638, Tang, Xiaochen ORCID: https://orcid.org/0000-0003-2590-5810, Gao, Zhen ORCID: https://orcid.org/0000-0001-9887-1418 and Lombardi, Fabrizio ORCID: https://orcid.org/0000-0003-3152-3245 (2025). Perturbation-based error detection and correction (PBEDC) in dependable large-scale machine learning systems. "Future Generation Computer Systems", v. 173 ; p. 107928. ISSN 0167-739X. https://doi.org/10.1016/j.future.2025.107928.

Descripción

Título: Perturbation-based error detection and correction (PBEDC) in dependable large-scale machine learning systems
Autor/es:
Tipo de Documento: Artículo
Título de Revista/Publicación: Future Generation Computer Systems
Fecha: Diciembre 2025
ISSN: 0167-739X
Volumen: 173
Materias:
ODS:
Palabras Clave Informales: Error detection, Error correction, Large-scale neural networks, Soft errors, CLIP
Escuela: E.T.S.I. Telecomunicación (UPM)
Departamento: Ingeniería de Sistemas Telemáticos
Grupo Investigación UPM: Internet de Nueva Generación
Licencias Creative Commons: Ninguna

Texto completo

[thumbnail of PBEDC_manuscript.pdf] PDF (Portable Document Format) - Se necesita un visor de ficheros PDF, como GSview, Xpdf o Adobe Acrobat Reader
Descargar (1MB)

Resumen

Conventional error-tolerant schemes for Neural Networks (NNs) usually require either redundancy, or changes in normal operation, leading to considerable overheads. They are not feasible for large-scale Machine Learning (ML) systems that typically employ several complex networks. This paper proposes a Perturbation-Based Error Detection and Correction (PBEDC) scheme designed to perform error detection and correction by reutilizing the inference process. Dependable performance is defined by the ability to operate correctly in the presence of errors and is a key characteristic under consideration. PBEDC employs a compact set of representative samples that are selected to monitor a few check nodes with intermediate signals. The effectiveness of PBEDC is evaluated by taking Contrastive Language-Image Pre-Training (CLIP) networks as a case study. Compared with traditional schemes that use the final prediction as the check node, PBEDC achieves a superior error detection rate (> 95 ) and can handle single bit-flip errors in the weights (which cannot be captured in existing schemes). This also enables the correction of errors when the proposed scheme is combined with the use of parity codes. Furthermore, in this paper, the analysis and simulation results show that the number of PBEDC samples required for achieving a satisfactory error tolerance is very small; the complexity of the proposed scheme does not scale up with the network size and this advantage is very pronounced with large-scale ML systems.

Más información

ID de Registro: 89163
Identificador DC: https://oa.upm.es/89163/
Identificador OAI: oai:oa.upm.es:89163
URL Portal Científico: https://portalcientifico.upm.es/es/ipublic/item/10381053
Identificador DOI: 10.1016/j.future.2025.107928
URL Oficial: https://www.sciencedirect.com/science/article/pii/...
Depositado por: Profesor Pedro Reviriego
Depositado el: 25 May 2025 08:42
Ultima Modificación: 15 Oct 2025 01:01