Title

Analyzing and Evaluating the Resilience of Scheduling Scientific Applications on High Performance Computing Systems using a Simulation-based Methodology

Author

Nitin Sukhija

Advisor

Banicescu, Ioana

Other Advisors or Committee Members

Luke, Edward A.||Allen, Edward B.||Horstemeyer, Mark F.||Haupt, Tomasz A.

Date of Degree

5-1-2015

Original embargo terms

MSU Only Indefinitely

Document Type

Dissertation - Open Access

Degree Name

Doctor of Philosophy

Abstract

Large scale systems provide a powerful computing platform for solving large and complex scientific applications. However, the inherent complexity, heterogeneity, wide distribution, and dynamism of the computing environments can lead to performance degradation of the scientific applications executing on these computing systems. Load imbalance arising from a variety of sources such as application, algorithmic, and systemic variations is one of the major contributors to their performance degradation. In general, load balancing is achieved via scheduling. Moreover, frequently occurring resource failures drastically affect the execution of applications running on high performance computing systems. Therefore, the study of deploying support for integrated scheduling and fault-tolerance mechanisms for guaranteeing that applications deployed on computing systems are resilient to failures becomes of paramount importance. Recently, several research initiatives have started to address the issue of resilience. However, the major focus of these efforts was geared more toward achieving system level resilience with less emphasis on achieving resilience at the application level. Therefore, it is increasingly important to extend the concept of resilience to the scheduling techniques at the application level for establishing a holistic approach that addresses the performability of these applications on high performance computing systems. This can be achieved by developing a comprehensive modeling framework that can be used to evaluate the resiliency of such techniques on heterogeneous computing systems for assessing the impact of failures as well as workloads in an integrated way. This dissertation presents an experimental methodology based on discrete event simulation for the analysis and the evaluation of the resilience of scheduling scientific applications on high performance computing systems. With the aid of the methodology a wide class of dependencies existing between application and computing system are captured within a deterministic model for quantifying the performance impact expected from changes in application and system characteristics. Ideally, the results obtained by employing the proposed simulation-based performance prediction framework enabled an introspective design and investigation of scheduling heuristics to reason about how to best fully optimize various often antagonistic objectives, such as minimizing application makespan and maximizing reliability.

URI

https://hdl.handle.net/11668/16635

This document is currently not available here.

Share

COinS