Xuezhi Wang, Jie Tang, Hong Cheng, Philip S. Yu, Bo Wang, Bo Gao
KEG Group, Tsinghua University, China
Introduction
Name ambiguity has long been viewed as a challenging problem in many applications, such as scientific literature management, people search, and social network analysis. When we search a person name in these systems, many documents (e.g., papers, webpages) containing that person’s name may be returned. Which documents are about the person we care about? Although much research has been conducted, the problem remains largely unsolved, especially with the rapid growth of the people information available on the Web.
Here we try to study this problem from a new perspective and propose an ADANA method for disambiguating person names via active user interactions. In ADANA, we first introduce a pairwise factor graph (PFG) model for person name disambiguation. The model has two advantages: automatically determining the number of distinct persons with the same name and being able to incorporate various types of features. Based on the PFG model, we propose an active name disambiguation algorithm, aiming to improve the disambiguation performance by maximizing the utility of the user’s correction. Experimental results on three different genres of data sets show that with only a few user corrections, the error rate of name disambiguation can be reduced to 3.1%. The real system (Now you're visiting) has been developed based on the proposed method and is online available.
Features
In the publication data, each paper is associated with a set of attributes: coauthor lists, title, publication venue, publication year, references, paper content, and affiliations. As our (active) name disambiguation always tries to deal with a pair of documents, instead of a single one, we take each pair of documents as the basic unit in our algorithm framework and define the following features for a document pair: Citation, CoAuthor, CoVenue, CoAffiliation, CoContent, TitleSim, CoHomepage.
Name Disambiguation via Pairwise Factor Graph Model
We formalize the problem of name disambiguation in a pairwise factor graph (PFG) model. The basic idea is to associate each pair of documents (e.g., di and dj ) with a hidden variable yij , representing whether these two documents should be assigned to a same cluster (yij = 1) or different clusters (yij = 0). For example, Figure 1 shows a simple example of a PFG.

Data Sets and Experiments
We perform our experiments on three different genres of real-world data sets: Publication, CALO, and News Stories.
1) Publication[1]. The data set is from Arnetminer.org, which collected about 1,300,000 publication papers from DBLP, 450,223 papers from IEEE, 1,343,442 papers and 3,687,675 citation relationships from ACM. By combining all the papers and removing papers with incomplete information, we finally have a publication data set of 1,632,442 papers and 3,021,489 citation relationships. For evaluation, we manually labeled 6,730 papers for 100 author names.
2) CALO[2]. It contains a labeled data set of 1,085 Web pages for 12 person names. The data set is the email directory of one participant (i.e., Melinda Gervasio) of the CALO project. The 12 names appear in headers of messages in the email directory. On average, each person name corresponds to about 15 different persons. The task is to associate the emails to different persons.
3) News Stories[3]. It consists of 755 ambiguous entities appearing in 20 news pages. The task is to cluster the ambiguous entities into different groups.
[1] The 100 author are listed as below, you can try by clicking on the author names and see our active name disambiguation.

[2] The dataset is used in paper: "R. Bekkerman and A. McCallum. Disambiguating web appearances of people in a social network. In WWW’05, pages 463–470, 2005."
[3] The dataset is used in paper: "S. Cucerzan. Large-scale named entity disambiguation based on wikipedia data. In EMNLP’07, pages 708–716, 2007."
Online System
Here is a snapshot of the system showing that how we achieve active name disambiguation.
The papers in dark red on the right side indicates that it is very likely that they belong to the person in the left side.
Created by Xuezhi Wang. Last Updated 03/03/2011.
If you have any questions, please contact Xuezhi Wang <littlexxxx at 163.com>.