Mississippi State University
Jones, Bryan A.
Dampier, David A.
Date of Degree
Original embargo terms
Dissertation - Open Access
Electrical and Computer Engineering
Doctor of Philosophy
James Worth Bagley College of Engineering
Department of Electrical and Computer Engineering
Large scale distributed computing systems have been extensively utilized to host critical applications in the fields of national defense, finance, scientific research, commerce, etc. However, applications in distributed systems face the risk of service outages due to inevitable faults. Without proper fault management methods, faults can lead to significant revenue loss and degradation of Quality of Service (QoS). An ideal fault management solution should guarantee fast and accurate fault diagnosis, scalability in distributed systems, portability for a variety of systems, and the versatility of recovering different types of faults. This dissertation presents a model-based fault management structure which automatically recovers computing systems from faults. This structure can recover a system from common faults while minimizing the impact on the system’s QoS. It covers all stages of fault management including fault detection, identification and recovery. It also has the flexibility to incorporate various fault diagnosis methods. When faults occur, the approach identifies fault types and intensity, and it accordingly computes the optimal recovery plan with minimum performance degradation, based on a cost function that defines performance objectives and a predictive control algorithm. The fault management approach has been verified on a centralized Web application testbed and a distributed big data processing testbed with four types of simulated faults: memory leak, network congestion, CPU hog and disk failure. The feasibility of the fault recovery control algorithm is also verified. Simulation results show that our approach enabled effective automatic recovery from faults. Performance evaluation reveals that CPU and memory overhead of the fault management process is negligible. To let domain engineers conveniently apply the proposed fault management structure on their specific systems, a component-based modeling environment is developed. The meta-model of the fault management structure is developed with Unified Modeling Language as an abstract of a general fault recovery solution for computing systems. It defines the fundamental reusable components that comprise such a system, including the connections among them, attributes of each component and constraints. The meta-model can be interpreted into a userriendly graphic modeling environment for creating application models of practical domain specific systems and generating executable codes on them.
Jia, Rui, "Towards Model-Based Fault Management for Computing Systems" (2016). Theses and Dissertations. 4762.