Using statistical monitoring to detect failures in Internet services

2005 2005

Other formats: Order a copy

Abstract (summary)

Since the Internet's popular emergence in the mid-1990's, Internet services such as e-mail and messaging systems, search engines, e-commerce, news and financial sites, have become an important and often mission-critical part of our society. Unfortunately, managing these systems and keeping them running is a significant challenge. Their rapid rate of change as well as their size and complexity mean that the developers and operators of these services usually have only an incomplete idea of how the system works and even what it is supposed to do. This results in poor fault management, as operators have a hard time diagnosing faults and an even harder time detecting them.

This dissertation argues that statistical monitoring—the use of statistical analysis and machine learning techniques to analyze live observations of a system's behavior—can be an important tool in improving the manageability of Internet services. Statistical monitoring has several important features that are well suited to managing Internet services. First, the dynamic analysis of a system's behavior in statistical monitoring means that there is no dependency on specifications or descriptions that might be stale or incorrect. Second, monitoring a live, deployed system gives insight into system behavior that cannot be achieved in QA or testing environments. Third, automatic analysis through statistical monitoring can better cope with larger and more complex systems, aiding human operators as well as automating parts of the system management process.

This dissertation presents a statistical monitoring approach to three fault management problems: detecting failures in Internet services without requiring a priori knowledge of correct application behavior; automatically inferring undocumented system structure and invariants; and localizing the potential cause of a failure given its symptoms. We describe our methodology as well as our experiments with prototype implementations. Our experience provides strong support for statistical monitoring, and suggest that it may prove to be an important tool in improving the manageability and reliability of Internet services.

Indexing (details)

Computer science
0984: Computer science
Identifier / keyword
Applied sciences; Failures; Fault detection; Internet services; Statistical monitoring
Using statistical monitoring to detect failures in Internet services
Kiciman, Emre
Number of pages
Publication year
Degree date
School code
DAI-B 66/08, Dissertation Abstracts International
Place of publication
Ann Arbor
Country of publication
United States
9780542295218, 0542295210
Fox, Armando
Stanford University
University location
United States -- California
Source type
Dissertations & Theses
Document type
Dissertation/thesis number
ProQuest document ID
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Document URL
Access the complete full text

You can get the full text of this document if it is part of your institution's ProQuest subscription.

Try one of the following:

  • Connect to ProQuest through your library network and search for the document from there.
  • Request the document from your library.
  • Go to the ProQuest login page and enter a ProQuest or My Research username / password.