Sketch-based Querying of Distributed Sliding-Window Data Streams

Odysseas Papapetrou, Minos Garofalakis, Antonios Deligiannakis
Technical University of Crete, Greece

{papapetrou, minos, adeli}

Abstract: While traditional data-management systems focus on evaluating single, ad- hoc queries over static data sets in a centralized setting, several emerg- ing applications require (possibly, continuous) answers to queries on dy- namic data that is widely distributed and constantly updated. Furthermore, such query answers often need to discount data that is .stale., and operate solely on a sliding window of recent data arrivals (e.g., data updates occur- ring over the last 24 hours). Such distributed data streaming applications mandate novel algorithmic solutions that are both time- and space-efficient (to manage high-speed data streams), and also communication-efficient (to deal with physical data distribution). In this paper, we consider the prob- lem of complex query answering over distributed, high-dimensional data streams in the sliding-window model. We introduce a novel sketching tech- nique (termed ECM-sketch) that allows effective summarization of stream- ing data over both time-based and count-based sliding windows with prob- abilistic accuracy guarantees. Our sketch structure enables point as well as inner-product queries, and can be employed to address a broad range of problems, such as maintaining frequency statistics, finding heavy hit- ters, and computing quantiles in the sliding-window model. Focusing on distributed environments, we demonstrate how ECM-sketches of individ- ual, local streams can be composed to generate a (low-error) ECM-sketch summary of the order-preserving aggregation of all streams; furthermore, we show how ECM-sketches can be exploited for continuous monitoring of sliding-window queries over distributed streams. Our extensive experi- mental study with two real-life data sets validates our theoretical claims and verifies the effectiveness of our techniques. To the best of our knowledge, ours is the first work to address efficient, guaranteed-error complex query answering over distributed data streams in the sliding-window model.

To appear in: VLDB'12