Name Disambiguation

Name Disambiguation: Data and Experiments

Xuezhi Wang, Jie TangHong Cheng, Philip S. Yu

KEG Group, Tsinghua University, China


Name ambiguity has long been viewed as a challenging problem in many applications, such as scientific literature management, people search, and social network analysis. When we search a person name in these systems, many documents (e.g., papers, webpages) containing that person’s name may be returned. Which documents are about the person we care about? Although much research has been conducted, the problem remains largely unsolved, especially with the rapid growth of the people information available on the Web.

We share related data sets and our ideas for name disambiguation on this page. If you use the data for publication, please kindly cite the following papers:

    author = {Jie Tang and Alvis C.M. Fong and Bo Wang and Jing Zhang},
    title = {A Unified Probabilistic Framework for Name Disambiguation in Digital Library},
    journal ={IEEE Transactions on Knowledge and Data Engineering},
    volume = {24},
    number = {6},
    year = {2012},
@INPROCEEDINGS{ wang:adana:,
AUTHOR = "Xuezhi Wang and Jie Tang and Hong Cheng and Philip S. Yu",
TITLE = "ADANA: Active Name Disambiguation",
PAGES = {794-803},
YEAR = {2011}, 
}  [PDF] [Slides]


Data Set [Simple Version Download, Readme] [Rich Vesion Download, Readme]

This data set is used for studying name disambiguation in digital library. It contains 110 author names and their disambiguation results (ground truth). Each author name corresponds to a raw file in the "raw-data" folder and an answer file (ground truth) in the "Answer" folder. (The simple version does not contain "citation", "co-affiliation-occur", "homepage". Refer to our ICDM 2011 paper for the definition of these features.)

- Raw file: the raw file is formatted as a XML file. In the XML file, the author name is associated with a number of publications. An example of a publication is as follow:
		<title>Explanation-based Failure Recovery</title>
		<authors>Ajay Gupta</authors>
where <title> denotes the title of the publication; 
<year> denotes the publication year;
<jconf> denotes the publication venue;
<id> denotes the publication id;
<label> denotes the labeled person, e.g., all publications with "<label>0</label>" can be considered as published by the same person;
<organization> denotes the affiliation of the author(s).

- Answer file: the answer file is the ground truth. It is actually extracted from the raw-file by viewing publications with the same "<label>0</label>" as a person. The format is in plain text. The following is an example:
#Ajay Gupta
#1:13048 388794 596099 1265282 1179332 675629 39153 258611
#2:988870 1490190
#6:1671104 515636 1678096
#7:1126381 1205032 275987 277587 276300 1549674 1034401
#8:600181 846439 149270 175996 264268 264291 299548 1384744 300057 302056 545651 1212517
where the first line denotes the author name and each of the following line indicates a disambiguate person. For example the first line indicates that an author published 8 papers. The corresponding IDs of those papers are respectively 13048, 388794, 596099, 1265282, 1179332, 675629, 39153, 258611.

ADANA: Name Disambiguation via Pairwise Factor Graph Model

We formalize the problem of name disambiguation in a pairwise factor graph (PFG) model. The basic idea is to associate each pair of documents (e.g., di and dj ) with a hidden variable yij , representing whether these two documents should be assigned to a same cluster (yij = 1) or different clusters (yij = 0). For example, Figure 1 shows a simple example of a PFG. For details of the method, please refer to our ICDM'11 paper or the TKDE'12 paper.



In the publication data, each paper is associated with a set of attributes: coauthor lists, title, publication venue, publication year, references, paper content, and affiliations. As our (active) name disambiguation always tries to deal with a pair of documents, instead of a single one, we take each pair of documents as the basic unit in our algorithm framework and define the following features for a document pair: Citation, CoAuthor, CoVenue, CoAffiliation, CoContent, TitleSim, CoHomepage.


We perform our experiments on three different genres of real-world data sets: Publication, CALO, and News Stories.

1) Publication[1]. The data set is from, which collected about 1,300,000 publication papers from DBLP, 450,223 papers from IEEE, 1,343,442 papers and 3,687,675 citation relationships from ACM. By combining all the papers and removing papers with incomplete information, we finally have a publication data set of 1,632,442 papers and 3,021,489 citation relationships. For evaluation, we manually labeled 6,730 papers for 100 author names.
2) CALO[2]. It contains a labeled data set of 1,085 Web pages for 12 person names. The data set is the email directory of one participant (i.e., Melinda Gervasio) of the CALO project. The 12 names appear in headers of messages in the email directory. On average, each person name corresponds to about 15 different persons. The task is to associate the emails to different persons.
3) News Stories[3]. It consists of 755 ambiguous entities appearing in 20 news pages. The task is to cluster the ambiguous entities into different groups.


[1] The 100 author are listed as below, you can try by clicking on the author names and see our active name disambiguation.

1   Michael Smith 2   Philip J. Smith 3   Yoshio Tanaka 4   Yang Yu
5   Jose M. Garcia 6   John F. McDonald 7   Z. Wang 8   Yue Zhao
9   Lu Liu 10   Jing Zhang 11   David Cooper 12   John Collins
13   Wen Gao 14   Fan Wang 15   F. Wang 16   Keith Edwards
17   Hui Fang 18   Paul Wang 19   Alok Gupta 20   Hui Yu
21   Qiang Shen 22   Kai Tang 23   Ping Zhou 24   Yan Tang
25   Peter Phillips 26   Wei Xu 27   Michael Lang 28   Manuel Silva
29   Charles Smith 30   Thomas Zimmermann 31   Yu Zhang 32   Kuo Zhang
33   Thomas Meyer 34   William H. Hsu 35   Frank Mueller 36   Gang Chen
37   Xiaoming Wang 38   Eric Martin 39   Kai Zhang 40   Fei Su
41   Paul Brown 42   Jie Tang 43   Feng Liu 44   Robert Schreiber
45   Satoshi Kobayashi 46   Lei Jin 47   R. Balasubramanian 48   David Jensen
49   Thomas Wolf 50   Li Shen 51   Hao Wang 52   Robert Allen
53   Steve King 54   Lei Chen 55   Koichi Furukawa 56   Thomas Tran
57   Thomas Hermann 58   J. Guo 59   John Hale 60   Jie Yu
61   Yun Wang 62   Ji Zhang 63   Mark Davis 64   David Brown
65   Cheng Chang 66   Gang Luo 67   Xiaoyan Li 68   Bin Li
69   Bing Liu 70   R. Ramesh 71   Jianping Wang 72   Barry Wilkinson
73   David E. Goldberg 74   Feng Pan 75   David Nelson 76   Lei Fang
77   Rakesh Kumar 78   Thomas D. Taylor 79   Jeffrey Parsons 80   Richard Taylor
81   Jim Gray 82   Juan Carlos Lopez 83   Sanjay Jain 84   Ajay Gupta
85   David Levine 86   Shu lin 87   Michael Siegel 88   S. Huang
89   Bin Zhu 90   Young Park 91   Yi Deng 92   Daniel Massey
93   Bob Johnson 94   Michael Wagner 95   Ning Zhang 96   David C. Wilson
97   Ke Chen 98   Yong Chen 99   Rafael Alonso 100   Bin Yu


[2] The dataset is used in paper: "R. Bekkerman and A. McCallum. Disambiguating web appearances of people in a social network. In WWW’05, pages 463–470, 2005."

[3] The dataset is used in paper: "S. Cucerzan. Large-scale named entity disambiguation based on wikipedia data. In EMNLP’07, pages 708–716, 2007."



Jie Tang, A.C.M. Fong, Bo Wang, and Jing Zhang. A Unified Probabilistic Framework for Name Disambiguation in Digital Library. IEEE Transaction on Knowledge and Data Engineering (TKDE) , Volume 24, Issue 6, 2012, Pages 975-987. (if =2.236) [PDF] [URL]

Xuezhi Wang, Jie Tang, Hong Cheng, and Philip S. Yu. ADANA: Active Name Disambiguation. In Proceedings of 2011 IEEE International Conference on Data Mining (ICDM'2011). pp. 794-803. [PDF] [Slides]

Created by Xuezhi Wang and Jie Tang. Last Updated 03/07/2013.

If you have any questions, please contact Xuezhi Wang <littlexxxx at> and <jery [dot] tang [at]>.