Theses and Dissertations
Issuing Body
Mississippi State University
Advisor
Banicescu, Ioana
Committee Member
Luke, Edward A.
Committee Member
Allen, Edward B.
Committee Member
Horstemeyer, Mark F.
Committee Member
Haupt, Tomasz A.
Date of Degree
5-9-2015
Original embargo terms
MSU Only Indefinitely
Document Type
Dissertation - Campus Access Only
Major
Computer Science
Degree Name
Doctor of Philosophy
College
James Worth Bagley College of Engineering
Department
Department of Computer Science and Engineering
Abstract
Large scale systems provide a powerful computing platform for solving large and complex scientific applications. However, the inherent complexity, heterogeneity, wide distribution, and dynamism of the computing environments can lead to performance degradation of the scientific applications executing on these computing systems. Load imbalance arising from a variety of sources such as application, algorithmic, and systemic variations is one of the major contributors to their performance degradation. In general, load balancing is achieved via scheduling. Moreover, frequently occurring resource failures drastically affect the execution of applications running on high performance computing systems. Therefore, the study of deploying support for integrated scheduling and fault-tolerance mechanisms for guaranteeing that applications deployed on computing systems are resilient to failures becomes of paramount importance. Recently, several research initiatives have started to address the issue of resilience. However, the major focus of these efforts was geared more toward achieving system level resilience with less emphasis on achieving resilience at the application level. Therefore, it is increasingly important to extend the concept of resilience to the scheduling techniques at the application level for establishing a holistic approach that addresses the performability of these applications on high performance computing systems. This can be achieved by developing a comprehensive modeling framework that can be used to evaluate the resiliency of such techniques on heterogeneous computing systems for assessing the impact of failures as well as workloads in an integrated way. This dissertation presents an experimental methodology based on discrete event simulation for the analysis and the evaluation of the resilience of scheduling scientific applications on high performance computing systems. With the aid of the methodology a wide class of dependencies existing between application and computing system are captured within a deterministic model for quantifying the performance impact expected from changes in application and system characteristics. Ideally, the results obtained by employing the proposed simulation-based performance prediction framework enabled an introspective design and investigation of scheduling heuristics to reason about how to best fully optimize various often antagonistic objectives, such as minimizing application makespan and maximizing reliability.
URI
https://hdl.handle.net/11668/16635
Recommended Citation
Sukhija, Nitin, "Analyzing and Evaluating the Resilience of Scheduling Scientific Applications on High Performance Computing Systems using a Simulation-based Methodology" (2015). Theses and Dissertations. 671.
https://scholarsjunction.msstate.edu/td/671