Text Clustering for Peer-to-Peer Networks with Probabilistic Guarantees

Odysseas Papapetrou, Wolf Siberski, Norbert Fuhr*
L3S Research Center, University of Hannover
*University of Duisburg-Essen

{papapetrou, siberski}@L3S.de, norbert.fuhr@uni-due.de

Abstract: Text clustering is an established technique for improving quality in information retrieval, for both centralized and distributed environments. However, for highly distributed environments, such as peer-to-peer networks, current clustering algorithms fail to scale. Our algorithm for peer-to-peer clustering achieves high scalability by using a probabilistic approach for assigning documents to clusters. It enables a peer to compare each of its documents only with very few selected clusters, without significant loss of clustering quality. The algorithm offers probabilistic guarantees for the correctness of each document assignment to a cluster. Extensive experimental evaluation with up to 100000 peers and 1 million documents demonstrates the scalability and effectiveness of the algorithm.

author = {Odysseas Papapetrou, Wolf Siberski, Norbert Fuhr},
title = {Text Clustering for Peer-to-Peer Networks with Probabilistic Guarantees},
booktitle = {32nd European Conference of Information Retrieval (ECIR) 2010},
year = {2010}