Failure detector abstractions for MapReduce-based systems

Memishi, Bunjamin and Perez Hernandez, Maria de los Santos and Antoniu, Gabriel (2017). Failure detector abstractions for MapReduce-based systems. "Information Sciences", v. 379 ; pp. 112-127. ISSN 0020-0255. https://doi.org/10.1016/j.ins.2016.08.013.

Description

Title: Failure detector abstractions for MapReduce-based systems
Author/s:
  • Memishi, Bunjamin
  • Perez Hernandez, Maria de los Santos
  • Antoniu, Gabriel
Item Type: Article
Título de Revista/Publicación: Information Sciences
Date: 2017
ISSN: 0020-0255
Volume: 379
Subjects:
Freetext Keywords: MapReduce; Reliability; Failure detection; Timeout; Heartbeat
Faculty: E.T.S. de Ingenieros Informáticos (UPM)
Department: Arquitectura y Tecnología de Sistemas Informáticos
Creative Commons Licenses: Recognition - No derivative works - Non commercial

Full text

[img]
Preview
PDF - Requires a PDF viewer, such as GSview, Xpdf or Adobe Acrobat Reader
Download (2MB) | Preview

Abstract

Omission failures represent an important source of problems in data-intensive computing systems. In these frameworks, omission failures are caused by slow tasks, known as strag- glers, which can strongly jeopardize the workload performance. In the case of MapReduce- based systems, many state-of-the-art approaches have preferred to explore and extend speculative execution mechanisms. Other alternatives have based their contributions in doubling the computing resources for their tasks. Nevertheless, none of these approaches has addressed a fundamental aspect related to the detection and further solving of the omission failures, that is, the timeout service adjustment. In this paper, we have studied the omission failures in MapReduce systems, formalizing their failure detector abstraction by means of three different algorithms for defining the timeout. The first abstraction, called High Relax Failure Detector (HR-FD), acts as a static alternative to the default timeout, which is able to estimate the completion time for the user workload. The second abstraction, called Medium Relax Failure Detector (MR-FD), dy- namically modifies the timeout, according to the progress score of each workload. Finally, taking into account that some of the user requests are strictly deadline-bounded, we have introduced the third abstraction, called Low Relax Failure Detector (LR-FD), which is able to merge the MapReduce dynamic timeout with an external monitoring system, in order to enforce more accurate failure detections. Whereas HR-FD shows performance improvements for most of the user request (in partic- ular, small workloads), MR-FD and LR-FD enhance significantly the current timeout selec- tion, for any kind of scenario, regardless of the workload type and failure injection time.

Funding Projects

TypeCodeAcronymLeaderTitle
Horizon 2020642963BigStorageUnspecifiedBigStorage: Storage-based Convergence between HPC and Cloud to handle Big Data

More information

Item ID: 50261
DC Identifier: http://oa.upm.es/50261/
OAI Identifier: oai:oa.upm.es:50261
DOI: 10.1016/j.ins.2016.08.013
Official URL: https://www.sciencedirect.com/science/article/pii/S0020025516305837?
Deposited by: Memoria Investigacion
Deposited on: 21 Dec 2018 12:18
Last Modified: 14 May 2019 12:38
  • Logo InvestigaM (UPM)
  • Logo GEOUP4
  • Logo Open Access
  • Open Access
  • Logo Sherpa/Romeo
    Check whether the anglo-saxon journal in which you have published an article allows you to also publish it under open access.
  • Logo Dulcinea
    Check whether the spanish journal in which you have published an article allows you to also publish it under open access.
  • Logo de Recolecta
  • Logo del Observatorio I+D+i UPM
  • Logo de OpenCourseWare UPM