IPMicra: An IP-address based Location Aware Distributed Web Crawler

Odysseas Papapetrou, George Samaras
Department of Computer Science, University of Cyprus

Abstract: Distributed crawling is able to overcome important limitations of the traditional single-sourced web crawling systems. However, the optimal benefit of distributed crawling is usually limited to the sites hosting the crawlers, the rest of the URLs are by large randomly distributed to the various crawlers. In this work, we propose a location-aware method, called IPMicra, that utilizes an IP address hierarchy, and allows crawling of links in a near optimal location aware manner. Our proposal outperforms earlier distributed crawling schemes by requiring one order of magnitude less time for crawling of the same set of sites.
Keywords: distributed crawling, web crawling, location aware crawling

