Author

Zimin Wang

Advisor

Abdelwahed, Sherif

Committee Member

Jones, A. Bryan

Committee Member

Follett, F. Randolph

Date of Degree

8-1-2011

Document Type

Graduate Thesis - Open Access

Degree Name

Master of Science

College

James Worth Bagley College of Engineering

Department

Department of Electrical and Computer Engineering

Abstract

Large-scale distributed computing systems such as data centers are hosted on heterogeneous and networked servers that execute in a dynamic and uncertain operating environment, caused by factors such as time-varying user workload and various failures. Therefore, achieving stringent quality-of-service goals is a challenging task, requiring a comprehensive approach to performance control, fault diagnosis, and failure recovery. This work presents a model-based approach for fault management, which integrates limited lookahead control (LLC), diagnosis, and fault-tolerance concepts that: (1) enables systems to adapt to environment variations, (2) maintains the availability and reliability of the system, (3) facilitates system recovery from failures. We focused on memory leak errors in this thesis. A characterization function is designed to detect memory leaks. Then, a LLC is applied to enable the computing system to adapt efficiently to variations in the workload, and to enable the system recover from memory leaks and maintain functionality.

URI

https://hdl.handle.net/11668/15332

Share

COinS