Perturbation-based error detection and correction (PBEDC) in dependable large-scale machine learning systems

Wang, Ziheng

, Reviriego Vasallo, Pedro

, Liu, Shanshan

, Niknia, Farzad

, Tang, Xiaochen

, Gao, Zhen

and Lombardi, Fabrizio

(2025). Perturbation-based error detection and correction (PBEDC) in dependable large-scale machine learning systems. "Future Generation Computer Systems", v. 173 ; p. 107928. ISSN 0167-739X. https://doi.org/10.1016/j.future.2025.107928.

Descripción

Título:	Perturbation-based error detection and correction (PBEDC) in dependable large-scale machine learning systems
Autor/es:	Wang, Ziheng https://orcid.org/0000-0001-9668-7318 Reviriego Vasallo, Pedro https://orcid.org/0000-0003-2540-5234 Liu, Shanshan https://orcid.org/0000-0001-6226-2880 Niknia, Farzad https://orcid.org/0000-0002-4062-3638 Tang, Xiaochen https://orcid.org/0000-0003-2590-5810 Gao, Zhen https://orcid.org/0000-0001-9887-1418 Lombardi, Fabrizio https://orcid.org/0000-0003-3152-3245
Tipo de Documento:	Artículo
Título de Revista/Publicación:	Future Generation Computer Systems
Fecha:	Diciembre 2025
ISSN:	0167-739X
Volumen:	173
Materias:	Electrónica Informática Telecomunicaciones
ODS:	09. Industria, innovación e infraestructura
Palabras Clave Informales:	Error detection, Error correction, Large-scale neural networks, Soft errors, CLIP
Escuela:	E.T.S.I. Telecomunicación (UPM)
Departamento:	Ingeniería de Sistemas Telemáticos
Grupo Investigación UPM:	Internet de Nueva Generación
Licencias Creative Commons:	Ninguna

Texto completo

PDF (Portable Document Format) - Se necesita un visor de ficheros PDF, como GSview, Xpdf o Adobe Acrobat Reader
Descargar (1MB)

Resumen

Conventional error-tolerant schemes for Neural Networks (NNs) usually require either redundancy, or changes in normal operation, leading to considerable overheads. They are not feasible for large-scale Machine Learning (ML) systems that typically employ several complex networks. This paper proposes a Perturbation-Based Error Detection and Correction (PBEDC) scheme designed to perform error detection and correction by reutilizing the inference process. Dependable performance is defined by the ability to operate correctly in the presence of errors and is a key characteristic under consideration. PBEDC employs a compact set of representative samples that are selected to monitor a few check nodes with intermediate signals. The effectiveness of PBEDC is evaluated by taking Contrastive Language-Image Pre-Training (CLIP) networks as a case study. Compared with traditional schemes that use the final prediction as the check node, PBEDC achieves a superior error detection rate (> 95 ) and can handle single bit-flip errors in the weights (which cannot be captured in existing schemes). This also enables the correction of errors when the proposed scheme is combined with the use of parity codes. Furthermore, in this paper, the analysis and simulation results show that the number of PBEDC samples required for achieving a satisfactory error tolerance is very small; the complexity of the proposed scheme does not scale up with the network size and this advantage is very pronounced with large-scale ML systems.

Más información

ID de Registro:	89163
Identificador DC:	https://oa.upm.es/89163/
Identificador OAI:	oai:oa.upm.es:89163
URL Portal Científico:	https://portalcientifico.upm.es/es/ipublic/item/10381053
Identificador DOI:	10.1016/j.future.2025.107928
URL Oficial:	https://www.sciencedirect.com/science/article/pii/...
Depositado por:	Profesor Pedro Reviriego
Depositado el:	25 May 2025 08:42
Ultima Modificación:	15 Oct 2025 01:01

Estadísticas

Exportar cita

Editar (sólo personal del Archivo)

En esta página

Menú principal

Buscar

Perturbation-based error detection and correction (PBEDC) in dependable large-scale machine learning systems

Cita

Descripción

Texto completo

Resumen

Más información

Acciones

Metrics

Altmetrics probando

Dimensions

Documentos

El repositorio

Agrupados por ...

Datos Investigación

Financiadores

Especiales

En otros formatos

Redes sociales

Información adicional