An operating system architecture for networked server infrastructure
Collections of hardware components are the foundation of computation and consist of interconnections of different types of the same core elements: processors, disks, memory cards, I/O devices, and network links. Designing a system for managing collections of hardware is challenging because modern infrastructures (i) distribute resource control across multiple autonomous sites, (ii) operate diverse sets of hardware, and (iii) support a variety of programming models for developing and executing software services.
An operating system is a software layer that manages hardware by coordinating its interaction with software. This thesis defines and evaluates an architecture for a networked operating system that manages collections of hardware in infrastructures spread across networks, such as the Internet. The foundation of a networked operating system determines how software services share a common hardware platform. A fundamental property common to all forms of resource sharing is that software services, by definition, share hardware components and do not use them forever. A lease is a natural construct for restricting the use of a shared resource to a well-defined length of time.
Our architecture employs a general neutrality principle, which states that a networked operating system should be policy-neutral, since only users and site administrators, and not operating system developers, know how to manage their software and hardware. Experience building, deploying, and using a prototype has led us to view neutrality as a guiding design principle. Our hypothesis is that an operating system architecture for infrastructure resource management that focuses narrowly on leasing control of hardware provides a foundation for multi-lateral resource negotiation, arbitration, and fault tolerance. In evaluating our hypothesis we make the following contributions: (1) Introduce a set of design principles for networked operating systems. The principles adapt and extend principles from node operating system design to a networked environment. We evaluate existing systems with respect to these principles, describe how they deviate from them, and explore how these deviations limit the capabilities of higher level software. (2) Combine the idea of a reconfigurable data center with the Sharp framework for secure resource peering to demonstrate a prototype networked operating system capable of sharing aggregations of resources in infrastructures. (3) Design, implement, and deploy the architecture using a single programming abstraction—the lease—and show how the lease abstraction embodies the design principles of a networked operating system. (4) Show that leases are a foundational primitive for addressing arbitration in a networked operating system. Leasing currency defines a configurable tradeoff between proportional-share scheduling and a market economy, and also serves as a basis for implementing other forms of arbitration. (5) Show how combining the use of leases for long-term resource management with state recovery mechanisms provides robustness to transient faults and failures in a loosely coupled distributed system that coordinates resource allocation. (6) Evaluate the flexibility and performance of a prototype by managing aggregations of physical and virtual hardware present in modern data centers, and showing that the architecture could scale to manage thousands of machines. (7) Present case studies of integrating multiple software services including the PlanetLab network testbed, the Plush distributed application manager, and the GridEngine batch scheduler, and leverage the architecture to prototype and evaluate Jaws, a new light-weight batch scheduler that instantiates one or more virtual machines per task.