Distributed data collection: Archiving, indexing, and analysis
As computing hardware becomes more powerful and systems become bigger, the amount of data we can collect within a system grows seemingly without bounds. These systems share a common characteristic: the volume of raw data available is far higher than can be handled by the application or user. The key problem is thus to identify some small amount of data which is of interest, obtaining it and presenting it to the user.
In this thesis we examine distributed data collection environments, and advocate data-aware and storage-centric approaches. Collected raw data is stored within the network, and only the most relevant information is presented to the application. We apply two strategies: (i) archiving, indexing and querying to retrieve discrete portions of data, and (ii) mathematical modeling to extract relationships spread more diffusely across the data. We propose mechanisms for archiving, indexing, and analysis, and describe them in the context of systems incorporating them.
First we address storage and indexing of high-speed event data in a resource-rich environment. A disk-based storage system operates on commodity hardware, yet guarantees writing at high rates to avoid loss. A signature file-based index allows indexing of high-speed data in real time, for efficient ad hoc querying. A network monitoring system based on these mechanisms is presented and evaluated.
Next we leverage the resources of a few resource-rich systems within a resource-constrained environment to index and route queries to remote sensors. These sensors send summaries of stored data to more capable proxy nodes, which use a novel search structure, the Interval Skip Graph, representing multiple records by a single imprecise key. We present a prototype implementation of a sensor network storage system based on these mechanisms.
Lastly we address analysis and model-building from feature-rich but poorly structured events. In our approach, statistical machine learning techniques are used to build models of application and system behavior in a data center, relying on an automated feature identification mechanism to identify model inputs from within the raw data stream. We present and evaluate Modellus , a data center monitoring and analysis system based on these mechanisms.