Using Machine Learning to Optimize Parallelism in Big Data Applications

Brandon, Alvaro and Perez, Maria S. and Gupta, Smrati and Muntes-Mulero, Victor (2018). Using Machine Learning to Optimize Parallelism in Big Data Applications. "Future Generation Computer Systems", v. 86 ; pp. 1076-1092. ISSN 0167-739X. https://doi.org/10.1016/j.future.2017.07.003.

Description

Title: Using Machine Learning to Optimize Parallelism in Big Data Applications
Author/s:
  • Brandon, Alvaro
  • Perez, Maria S.
  • Gupta, Smrati
  • Muntes-Mulero, Victor
Item Type: Article
Título de Revista/Publicación: Future Generation Computer Systems
Date: 1 September 2018
ISSN: 0167-739X
Volume: 86
Subjects:
Freetext Keywords: Machine learning,Spark,Parallelism,Big data
Faculty: E.T.S. de Ingenieros Informáticos (UPM)
Department: Inteligencia Artificial
UPM's Research Group: Ontology Engineering Group OEG
Creative Commons Licenses: None

Full text

[img]
Preview
PDF - Requires a PDF viewer, such as GSview, Xpdf or Adobe Acrobat Reader
Download (3MB) | Preview

Abstract

In-memory cluster computing platforms have gained momentum in the last years, due to their ability to analyse big amounts of data in parallel. These platforms are complex and difficult-to-manage environments. In addition, there is a lack of tools to better understand and optimize such platforms that consequently form backbone of big data infrastructure and technologies. This directly leads to underutilization of available resources and application failures in such environment. One of the key aspects that can address this problem is optimization of the task parallelism of application in such environments. In this paper, we propose a machine learning based method that recommends optimal parameters for task parallelization in big data workloads. By monitoring and gathering metrics at system and application level, we are able to find statistical correlations that allow us to characterize and predict the effect of different parallelism settings on performance. These predictions are used to recommend an optimal configuration to users before launching their workloads in the cluster, avoiding possible failures, performance degradation and wastage of resources. We evaluate our method with a benchmark of 15 Spark applications on the Grid5000 testbed. We observe up to a 51\% gain on performance when using the recommended parallelism settings. The model is also interpretable and can give insights to the user into how different metrics and parameters affect the performance.

Funding Projects

TypeCodeAcronymLeaderTitle
Horizon 2020642963BigStorageMaria S. PerezBigStorage

More information

Item ID: 51675
DC Identifier: http://oa.upm.es/51675/
OAI Identifier: oai:oa.upm.es:51675
DOI: 10.1016/j.future.2017.07.003
Official URL: https://www.sciencedirect.com/science/article/pii/S0167739X17314668
Deposited by: Alvaro Brandon Hernandez
Deposited on: 18 Jul 2018 06:46
Last Modified: 18 Jul 2018 06:46
  • Logo InvestigaM (UPM)
  • Logo GEOUP4
  • Logo Open Access
  • Open Access
  • Logo Sherpa/Romeo
    Check whether the anglo-saxon journal in which you have published an article allows you to also publish it under open access.
  • Logo Dulcinea
    Check whether the spanish journal in which you have published an article allows you to also publish it under open access.
  • Logo de Recolecta
  • Logo del Observatorio I+D+i UPM
  • Logo de OpenCourseWare UPM