HOME     |      PUBLICATIONS     |      PROJECTS     |      TEACHING     |      RESOURCES         

Ekaterini Ioannou

Software Technology and Network Applications Laboratory

Department of Electronic & Computer Engineering
Technical University of Crete
University Campus
73100, Crete, HELLAS

ioannou AT softnet.tuc.gr
EkateriniIoannou AT acm.org

Efficient Entity Resolution for Large Heterogeneous Information Spaces

Georgios Papadakis, Ekaterini Ioannou, Claudia Niederée, and Peter Fankhauser.
In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM), Feb. 2011, Hong Kong.


We have recently witnessed an enormous growth in the volume of structured and semi-structured data sets available on the Web. An important prerequisite for using and combining such data sets is the detection and merge of information that describes the same real-world entities, a task known as Entity Resolution. To make this quadratic task efficient, blocking techniques are typically employed. However, the high dynamics, loose schema binding, and heterogeneity of (semi-)structured data, impose new challenges to entity resolution. Existing blocking approaches become inapplicable because they rely on the homogeneity of the considered data and a-priory known schemata. In this paper, we introduce a novel approach for entity resolution, scaling it up for large, noisy, and heterogeneous information spaces. It combines an attribute-agnostic mechanism for building blocks with intelligent block processing techniques that boost blocks with high expected utility, propagate knowledge about identified matches, and preempt the resolution process when it gets too expensive. Our extensive evaluation on real-world, large, heterogeneous data sets verifies that the suggested approach is both effective and efficient.


     author = {George Papadakis and Ekaterini Ioannou and Claudia Nieder{\'e}e and Peter Fankhauser},
     title ={Efficient Entity Resolution for Large Heterogeneous Information Spaces},
     booktitle = {WSDM},
     pages = {535-544},
     year = {2011}

Last modified: April 2011