Using statistical monitoring to detect failures in Internet services
Since the Internet's popular emergence in the mid-1990's, Internet services such as e-mail and messaging systems, search engines, e-commerce, news and financial sites, have become an important and often mission-critical part of our society. Unfortunately, managing these systems and keeping them running is a significant challenge. Their rapid rate of change as well as their size and complexity mean that the developers and operators of these services usually have only an incomplete idea of how the system works and even what it is supposed to do. This results in poor fault management, as operators have a hard time diagnosing faults and an even harder time detecting them.
This dissertation argues that statistical monitoring—the use of statistical analysis and machine learning techniques to analyze live observations of a system's behavior—can be an important tool in improving the manageability of Internet services. Statistical monitoring has several important features that are well suited to managing Internet services. First, the dynamic analysis of a system's behavior in statistical monitoring means that there is no dependency on specifications or descriptions that might be stale or incorrect. Second, monitoring a live, deployed system gives insight into system behavior that cannot be achieved in QA or testing environments. Third, automatic analysis through statistical monitoring can better cope with larger and more complex systems, aiding human operators as well as automating parts of the system management process.
This dissertation presents a statistical monitoring approach to three fault management problems: detecting failures in Internet services without requiring a priori knowledge of correct application behavior; automatically inferring undocumented system structure and invariants; and localizing the potential cause of a failure given its symptoms. We describe our methodology as well as our experiments with prototype implementations. Our experience provides strong support for statistical monitoring, and suggest that it may prove to be an important tool in improving the manageability and reliability of Internet services.