Decentralized Probabilistic Text Clustering

Odysseas Papapetrou1,2, Wolf Siberski2, Norbert Fuhr3

1: Department of Electronic and Computer Engineering, Technical University of Crete
2: L3S Research Center, University of Hannover
3: Faculty of Engineering Sciences, University of Duisburg-Essen

Abstract:Text clustering is an established technique for improving quality in information retrieval, for both centralized and distributed environments. However, traditional text clustering algorithms fail to scale on highly distributed environments, such as peer-to-peer networks. Our algorithm for peer-to-peer clustering achieves high scalability by using a probabilistic approach for assigning documents to clusters. It enables a peer to compare each of its documents only with very few selected clusters, without significant loss of clustering quality. The algorithm offers probabilistic guarantees for the correctness of each document assignment to a cluster. Extensive experimental evaluation with up to 1 million peers and 1 million documents demonstrates the scalability and effectiveness of the algorithm.

TKDE, 2012

Preprint here

Final version available from TKDE

author = {Odysseas Papapetrou and Wolf Siberski and Norbert Fuhr},
title = {Decentralized Probabilistic Text Clustering},
journal = {IEEE Trans. Knowl. Data Eng.},
volume = {24},
number = {10},
year = {2012},
pages = {1848-1861},
ee = {}