Exploiting Distribution Skew for Scalable P2P Text Clustering


Odysseas Papapetrou, Wolf Siberski, Fabian Leitritz, Wolfgang Nejdl
Forschungszentrum L3S
Appelstrasse 9a
30167 Hannover

{papapetrou, siberski, leitritz, nejdl}@L3S.DE


Abstract: K-Means clustering is widely used in information retrieval and data mining. Distributed K-Means variants have already been proposed, but none of the past algorithms scales to large numbers of nodes. In this work we describe a new P2P algorithm which significantly reduces the communication costs involved by exploiting distribution skew, naturally found in text and other datasets. The algorithm achieves high clustering quality and requires no synchronization between peers. An extensive evaluation with up to 100.000 peers shows the algorithm's effectiveness and scalability as well as its ability to cope with churn.

@inproceedings{conf/dbisp2p/PapapetrouSLN08,
title = {Exploiting Distribution Skew for Scalable P2P Text Clustering.},
author = {Odysseas Papapetrou and Wolf Siberski and Fabian Leitritz and Wolfgang Nejdl},
booktitle = {DBISP2P},
crossref = {conf/dbisp2p/2008},
pages = {1-12},
url = {http://dblp.uni-trier.de/db/conf/dbisp2p/dbisp2p2008.html#PapapetrouSLN08},
year = {2008},
ee = {http://www.vldb.org/conf/2008/workshops/WProc_dbisp2p08/Paper_14.pdf},
keywords = {Peer-to-peer, Distributed K-Means }
}