Technical University of Crete, SoftNet lab



XStreamCluster: an Efficient Algorithm for Streaming XML data Clustering


XML clustering finds many applications, ranging from storage to query processing. However, existing clustering algorithms focus on static XML collections, whereas modern information systems frequently deal with streaming XML data that needs to be processed online. Streaming XML clustering is a challenging task because of the high computational and space efficiency requirements implicated for online approaches. In this paper we propose XStreamCluster, which addresses the two challenges using a two-layered optimization. The bottom layer employs Bloom filters to encode the XML documents, providing a space-efficient solution to memory usage. The top layer is based on Locality Sensitive Hashing and contributes to the computational efficiency. The theoretical analysis shows that the approximate solution of XStreamCluster generates similarly good clusters as the exact solution, with high probability. The experimental results demonstrate that XStreamCluster improves both memory efficiency and computational time by at least an order of magnitude without affecting clustering quality, compared to its variants and a baseline approach.

Artificial DTD files, and DTD Generator

DTD Generator is a highly configurable java program for creating DTD files. The program gets the following arguments:
  • Number of DTDs to create
  • Maximum depth per DTD
  • Number of distinct labels, in all DTDs
  • Expected number of children per node (the actual number per node is selected randomly from a Poisson distribution)
  • Probability of '+' per node (from 1 to 100)
  • Probability of '*' per node (from 1 to 100)
  • Probability of '?' per node (from 1 to 100)
  • Output folder (needs to be available)
DTD Generator will create a set of DTD files in the output folder, labeled as gen0.dtd, gen1.dtd, etc. From these files, one can use a publicly available XML generator, e.g., IBM XML Generator to create the XML files. Requirements: DTD Generator uses the colt library for generating random numbers. Although it can be easily adapted to use the standard java random method, it is not recommended, because the random number generator of the colt library has a longer period and achieves better pseudo-randomness.

Real DTD files

The following publicly available DTDs were also used in our experiments:
Copyright Notice:
The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.


Last modified September 21 2010 20:17:13.