"Statistical Synopses for Graph-Structured XML Databases"
Neoklis Polyzotis and
Proceedings of ACM SIGMOD'2002,
Madison, Wisconsin, June 2002, pp. 358-369.
Effective support for XML query languages is becoming increasingly important
with the emergence of new applications that access large volumes of XML data.
All existing proposals for querying XML (e.g., XQuery) rely on a
pattern-specification language that allows path navigation and branching
through the XML data graph in order to reach the desired data elements. Optimizing such
queries depends crucially on the existence of concise synopsis structures that enable
accurate compile-time selectivity estimates for complex path expressions over
graph-structured XML data.
In this paper, we introduce a novel approach to building and using statistical
summaries of large XML data graphs for effective path-expression selectivity
estimation. Our proposed graph-synopsis model (termed XSketch) exploits
localized graph stability
to accurately approximate (in limited space) the path and branching distribution
in the data graph. To estimate
the selectivities of complex path expressions over concise XSketch synopses,
we develop an estimation framework that relies on appropriate statistical
(uniformity and independence) assumptions to compensate for the lack of detailed
distribution information. Given our estimation framework, we demonstrate that
the problem of building an accuracy-optimal XSketch for a given amount of space
is NP-hard, and propose an efficient heuristic algorithm based on greedy forward
selection. Briefly, our algorithm constructs an XSketch synopsis by successive
refinements of the label-split graph, the coarsest summary of the XML data graph.
Our refinement operations act locally and attempt to capture important statistical
correlations between data paths.
Extensive experimental results with synthetic as well as real-life data sets verify
the effectiveness of our approach.
To the best of our knowledge, ours is the first work to address this timely
problem in the most general setting of graph-structured data and complex
(branching) path expressions.
Alkis' talk slides
Copyright © 2002, Association for Computing Machinery, Inc. (ACM).
Permission to make digital/hard copy of all or part of this material without
fee is granted provided that copies are not made or distributed for profit or
commercial advantage, the ACM copyright/server notice, the title of the
publication and its date appear, and notice is given that copying is by
permission of the Association for Computing Machinery, Inc. (ACM).
To copy otherwise, to republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.