Xuezhi Wang, Jie Tang, Hong Cheng, Philip S. Yu
KEG Group, Tsinghua University, China
Introduction
Name ambiguity has long been viewed as a challenging problem in many applications, such as scientific literature management, people search, and social network analysis. When we search a person name in these systems, many documents (e.g., papers, webpages) containing that person’s name may be returned. Which documents are about the person we care about? Although much research has been conducted, the problem remains largely unsolved, especially with the rapid growth of the people information available on the Web.
We share related data sets and our ideas for name disambiguation on this page. If you use the data for publication, please kindly cite the following papers:
Data Set [Simple Version Download, Readme] [Rich Vesion Download, Readme]
This data set is used for studying name disambiguation in digital library. It contains 110 author names and their disambiguation results (ground truth). Each author name corresponds to a raw file in the "raw-data" folder and an answer file (ground truth) in the "Answer" folder. (The simple version does not contain "citation", "co-affiliation-occur", "homepage". Refer to our ICDM 2011 paper for the definition of these features.)
- Raw file: the raw file is formatted as a XML file. In the XML file, the author name is associated with a number of publications. An example of a publication is as follow: " <publication> <title>Explanation-based Failure Recovery</title> <year>1987</year> <authors>Ajay Gupta</authors> <jconf>AAAI</jconf> <id>13048</id> <label>0</label> <organization>null</organization> </publication> " where <title> denotes the title of the publication; <year> denotes the publication year; <jconf> denotes the publication venue; <id> denotes the publication id; <label> denotes the labeled person, e.g., all publications with "<label>0</label>" can be considered as published by the same person; <organization> denotes the affiliation of the author(s). - Answer file: the answer file is the ground truth. It is actually extracted from the raw-file by viewing publications with the same "<label>0</label>" as a person. The format is in plain text. The following is an example: " #Ajay Gupta #1:13048 388794 596099 1265282 1179332 675629 39153 258611 #2:988870 1490190 #3:1393934 #4:1398544 #5:1739014 #6:1671104 515636 1678096 #7:1126381 1205032 275987 277587 276300 1549674 1034401 #8:600181 846439 149270 175996 264268 264291 299548 1384744 300057 302056 545651 1212517 #9:1316053 " where the first line denotes the author name and each of the following line indicates a disambiguate person. For example the first line indicates that an author published 8 papers. The corresponding IDs of those papers are respectively 13048, 388794, 596099, 1265282, 1179332, 675629, 39153, 258611.
ADANA: Name Disambiguation via Pairwise Factor Graph Model
We formalize the problem of name disambiguation in a pairwise factor graph (PFG) model. The basic idea is to associate each pair of documents (e.g., di and dj ) with a hidden variable yij , representing whether these two documents should be assigned to a same cluster (yij = 1) or different clusters (yij = 0). For example, Figure 1 shows a simple example of a PFG. For details of the method, please refer to our ICDM'11 paper or the TKDE'12 paper.

Features
In the publication data, each paper is associated with a set of attributes: coauthor lists, title, publication venue, publication year, references, paper content, and affiliations. As our (active) name disambiguation always tries to deal with a pair of documents, instead of a single one, we take each pair of documents as the basic unit in our algorithm framework and define the following features for a document pair: Citation, CoAuthor, CoVenue, CoAffiliation, CoContent, TitleSim, CoHomepage.
Experiments
We perform our experiments on three different genres of real-world data sets: Publication, CALO, and News Stories.
1) Publication[1]. The data set is from Arnetminer.org, which collected about 1,300,000 publication papers from DBLP, 450,223 papers from IEEE, 1,343,442 papers and 3,687,675 citation relationships from ACM. By combining all the papers and removing papers with incomplete information, we finally have a publication data set of 1,632,442 papers and 3,021,489 citation relationships. For evaluation, we manually labeled 6,730 papers for 100 author names.
2) CALO[2]. It contains a labeled data set of 1,085 Web pages for 12 person names. The data set is the email directory of one participant (i.e., Melinda Gervasio) of the CALO project. The 12 names appear in headers of messages in the email directory. On average, each person name corresponds to about 15 different persons. The task is to associate the emails to different persons.
3) News Stories[3]. It consists of 755 ambiguous entities appearing in 20 news pages. The task is to cluster the ambiguous entities into different groups.
[1] The 100 author are listed as below, you can try by clicking on the author names and see our active name disambiguation.

[2] The dataset is used in paper: "R. Bekkerman and A. McCallum. Disambiguating web appearances of people in a social network. In WWW’05, pages 463–470, 2005."
[3] The dataset is used in paper: "S. Cucerzan. Large-scale named entity disambiguation based on wikipedia data. In EMNLP’07, pages 708–716, 2007."
References
Jie Tang, A.C.M. Fong, Bo Wang, and Jing Zhang. A Unified Probabilistic Framework for Name Disambiguation in Digital Library. IEEE Transaction on Knowledge and Data Engineering (TKDE) , Volume 24, Issue 6, 2012, Pages 975-987. (if =2.236) [PDF] [URL]
Created by Xuezhi Wang and Jie Tang. Last Updated 03/07/2013.
If you have any questions, please contact Xuezhi Wang <littlexxxx at 163.com> and <jery [dot] tang [at] gmail.com>.