Montes, Jesús and Sánchez Campos, Alberto and Pérez Hernández, María de los Santos
Riding out of the storm: How to deal with the complexity of grid and cloud management.
"Journal of Grid Computing", v. 10
Over the last decade, Grid computing paved the way for a new level of large scale distributed systems. This infrastructure made it possible to securely and reliably take advantage of widely separated computational resources that are part of several different organizations. Resources can be incorporated to the Grid, building a theoretical virtual supercomputer. In time, cloud computing emerged as a new type of large scale distributed system, inheriting and expanding the expertise and knowledge that have been obtained so far. Some of the main characteristics of Grids naturally evolved into clouds, others were modified and adapted and others were simply discarded or postponed. Regardless of these technical specifics, both Grids and clouds together can be considered as one of the most important advances in large scale distributed computing of the past ten years; however, this step in distributed computing has came along with a completely new level of complexity. Grid and cloud management mechanisms play a key role, and correct analysis and understanding of the system behavior are needed. Large scale distributed systems must be able to self-manage, incorporating autonomic features capable of controlling and optimizing all resources and services. Traditional distributed computing management mechanisms analyze each resource separately and adjust specific parameters of each one of them. When trying to adapt the same procedures to Grid and cloud computing, the vast complexity of these systems can make this task extremely complicated. But large scale distributed systems complexity could only be a matter of perspective. It could be possible to understand the Grid or cloud behavior as a single entity, instead of a set of resources. This abstraction could provide a different understanding of the system, describing large scale behavior and global events that probably would not be detected analyzing each resource separately. In this work we define a theoretical framework that combines both ideas, multiple resources and single entity, to develop large scale distributed systems management techniques aimed at system performance optimization, increased dependability and Quality of Service (QoS). The resulting synergy could be the key 350 J. Montes et al. to address the most important difficulties of Grid and cloud management.