Distributed data collection: Archiving, indexing, and analysis

2008 2008

Other formats: Order a copy

Abstract (summary)

As computing hardware becomes more powerful and systems become bigger, the amount of data we can collect within a system grows seemingly without bounds. These systems share a common characteristic: the volume of raw data available is far higher than can be handled by the application or user. The key problem is thus to identify some small amount of data which is of interest, obtaining it and presenting it to the user.

In this thesis we examine distributed data collection environments, and advocate data-aware and storage-centric approaches. Collected raw data is stored within the network, and only the most relevant information is presented to the application. We apply two strategies: (i) archiving, indexing and querying to retrieve discrete portions of data, and (ii) mathematical modeling to extract relationships spread more diffusely across the data. We propose mechanisms for archiving, indexing, and analysis, and describe them in the context of systems incorporating them.

First we address storage and indexing of high-speed event data in a resource-rich environment. A disk-based storage system operates on commodity hardware, yet guarantees writing at high rates to avoid loss. A signature file-based index allows indexing of high-speed data in real time, for efficient ad hoc querying. A network monitoring system based on these mechanisms is presented and evaluated.

Next we leverage the resources of a few resource-rich systems within a resource-constrained environment to index and route queries to remote sensors. These sensors send summaries of stored data to more capable proxy nodes, which use a novel search structure, the Interval Skip Graph, representing multiple records by a single imprecise key. We present a prototype implementation of a sensor network storage system based on these mechanisms.

Lastly we address analysis and model-building from feature-rich but poorly structured events. In our approach, statistical machine learning techniques are used to build models of application and system behavior in a data center, relying on an automated feature identification mechanism to identify model inputs from within the raw data stream. We present and evaluate Modellus , a data center monitoring and analysis system based on these mechanisms.

Indexing (details)

Computer science
0984: Computer science
Identifier / keyword
Applied sciences; Archiving; Data collection; Indexing
Distributed data collection: Archiving, indexing, and analysis
Desnoyers, Peter
Number of pages
Publication year
Degree date
School code
DAI-B 69/08, Dissertation Abstracts International
Place of publication
Ann Arbor
Country of publication
United States
Shenoy, Prashant
Committee member
Ganesan, Deepak; Kurose, James F.; Wolf, Tilman
University of Massachusetts Amherst
Computer Science
University location
United States -- Massachusetts
Source type
Dissertations & Theses
Document type
Dissertation/thesis number
ProQuest document ID
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Document URL
Access the complete full text

You can get the full text of this document if it is part of your institution's ProQuest subscription.

Try one of the following:

  • Connect to ProQuest through your library network and search for the document from there.
  • Request the document from your library.
  • Go to the ProQuest login page and enter a ProQuest or My Research username / password.