Name Disambiguation

ADANA: Actively Disambiguating Person Names with User Interaction

Xuezhi Wang, Jie Tang, Hong Cheng, Philip S. Yu, Bo Wang, Bo Gao

KEG Group, Tsinghua University, China

Introduction

Name ambiguity has long been viewed as a challenging problem in many applications, such as scientific literature management, people search, and social network analysis. When we search a person name in these systems, many documents (e.g., papers, webpages) containing that person’s name may be returned. Which documents are about the person we care about? Although much research has been conducted, the problem remains largely unsolved, especially with the rapid growth of the people information available on the Web.
Here we try to study this problem from a new perspective and propose an ADANA method for disambiguating person names via active user interactions. In ADANA, we first introduce a pairwise factor graph (PFG) model for person name disambiguation. The model has two advantages: automatically determining the number of distinct persons with the same name and being able to incorporate various types of features. Based on the PFG model, we propose an active name disambiguation algorithm, aiming to improve the disambiguation performance by maximizing the utility of the user’s correction. Experimental results on three different genres of data sets show that with only a few user corrections, the error rate of name disambiguation can be reduced to 3.1%. The real system (Now you're visiting) has been developed based on the proposed method and is online available.

Features

In the publication data, each paper is associated with a set of attributes: coauthor lists, title, publication venue, publication year, references, paper content, and affiliations. As our (active) name disambiguation always tries to deal with a pair of documents, instead of a single one, we take each pair of documents as the basic unit in our algorithm framework and define the following features for a document pair: Citation, CoAuthor, CoVenue, CoAffiliation, CoContent, TitleSim, CoHomepage.

Name Disambiguation via Pairwise Factor Graph Model

We formalize the problem of name disambiguation in a pairwise factor graph (PFG) model. The basic idea is to associate each pair of documents (e.g., di and dj ) with a hidden variable yij , representing whether these two documents should be assigned to a same cluster (yij = 1) or different clusters (yij = 0). For example, Figure 1 shows a simple example of a PFG.

Data Sets and Experiments

We perform our experiments on three different genres of real-world data sets: Publication, CALO, and News Stories.

1) Publication[1]. The data set is from Arnetminer.org, which collected about 1,300,000 publication papers from DBLP, 450,223 papers from IEEE, 1,343,442 papers and 3,687,675 citation relationships from ACM. By combining all the papers and removing papers with incomplete information, we finally have a publication data set of 1,632,442 papers and 3,021,489 citation relationships. For evaluation, we manually labeled 6,730 papers for 100 author names.
2) CALO[2]. It contains a labeled data set of 1,085 Web pages for 12 person names. The data set is the email directory of one participant (i.e., Melinda Gervasio) of the CALO project. The 12 names appear in headers of messages in the email directory. On average, each person name corresponds to about 15 different persons. The task is to associate the emails to different persons.
3) News Stories[3]. It consists of 755 ambiguous entities appearing in 20 news pages. The task is to cluster the ambiguous entities into different groups.

 

[1] The 100 author are listed as below, you can try by clicking on the author names and see our active name disambiguation.

1   Michael Smith 2   Philip J. Smith 3   Yoshio Tanaka 4   Yang Yu
5   Jose M. Garcia 6   John F. McDonald 7   Z. Wang 8   Yue Zhao
9   Lu Liu 10   Jing Zhang 11   David Cooper 12   John Collins
13   Wen Gao 14   Fan Wang 15   F. Wang 16   Keith Edwards
17   Hui Fang 18   Paul Wang 19   Alok Gupta 20   Hui Yu
21   Qiang Shen 22   Kai Tang 23   Ping Zhou 24   Yan Tang
25   Peter Phillips 26   Wei Xu 27   Michael Lang 28   Manuel Silva
29   Charles Smith 30   Thomas Zimmermann 31   Yu Zhang 32   Kuo Zhang
33   Thomas Meyer 34   William H. Hsu 35   Frank Mueller 36   Gang Chen
37   Xiaoming Wang 38   Eric Martin 39   Kai Zhang 40   Fei Su
41   Paul Brown 42   Jie Tang 43   Feng Liu 44   Robert Schreiber
45   Satoshi Kobayashi 46   Lei Jin 47   R. Balasubramanian 48   David Jensen
49   Thomas Wolf 50   Li Shen 51   Hao Wang 52   Robert Allen
53   Steve King 54   Lei Chen 55   Koichi Furukawa 56   Thomas Tran
57   Thomas Hermann 58   J. Guo 59   John Hale 60   Jie Yu
61   Yun Wang 62   Ji Zhang 63   Mark Davis 64   David Brown
65   Cheng Chang 66   Gang Luo 67   Xiaoyan Li 68   Bin Li
69   Bing Liu 70   R. Ramesh 71   Jianping Wang 72   Barry Wilkinson
73   David E. Goldberg 74   Feng Pan 75   David Nelson 76   Lei Fang
77   Rakesh Kumar 78   Thomas D. Taylor 79   Jeffrey Parsons 80   Richard Taylor
81   Jim Gray 82   Juan Carlos Lopez 83   Sanjay Jain 84   Ajay Gupta
85   David Levine 86   Shu lin 87   Michael Siegel 88   S. Huang
89   Bin Zhu 90   Young Park 91   Yi Deng 92   Daniel Massey
93   Bob Johnson 94   Michael Wagner 95   Ning Zhang 96   David C. Wilson
97   Ke Chen 98   Yong Chen 99   Rafael Alonso 100   Bin Yu

 

[2] The dataset is used in paper: "R. Bekkerman and A. McCallum. Disambiguating web appearances of people in a social network. In WWW’05, pages 463–470, 2005."

[3] The dataset is used in paper: "S. Cucerzan. Large-scale named entity disambiguation based on wikipedia data. In EMNLP’07, pages 708–716, 2007."

Online System

Here is a snapshot of the system showing that how we achieve active name disambiguation.

The papers in dark red on the right side indicates that it is very likely that they belong to the person in the left side.

Created by Xuezhi Wang. Last Updated 03/03/2011.

If you have any questions, please contact Xuezhi Wang <littlexxxx at 163.com>.