Explanation-based Failure Recovery

This data set is a rich version of the DISAMBIGUATION data set. It also contains 110 author names and their disambiguation results (ground truth). The difference lies in that the data set also contains rich features. Basically, for each author name, it has 15 files. The most important files are xxx.xml, xxx_answer.txt, and xxx.txt. The xxx_answer.txt contains the ground truth and the other two files (xxx.xml and xxx.txt) provide information (featuers) to perform the disambiguation. At the high-level, the xxx.xml file includes title, venue, coauthor, affiliation, and the xxx.txt further contains citation, co-affiliation-occur and homepage. Besides the 3 important files, each author name is also associated with 12 other intermediate fiels which can simply ignored. 

Let us use "Ajay Gupta" as the example to explain what information contained in each file.

- Ajay Gupta.xml. The raw file.
is formatted as a XML file. In the XML file, the author name is associated with a number of publications. An example of a publication is as follow:
"
  <publication>
		<title>Explanation-based Failure Recovery</title>
		<year>1987</year>
		<authors>Ajay Gupta</authors>
		<jconf>AAAI</jconf>
		<id>13048</id>
		<label>0</label>
		<organization>null</organization>
	</publication>
"	
where <title> denotes the title of the publication; 
<year> denotes the publication year;
<jconf> denotes the publication venue;
<id> denotes the publication id;
<label> denotes the labeled person, e.g., all publications with "<label>0</label>" can be considered as published by the same person;
<organization> denotes the affiliation of the author(s).


- Ajay Gupta(classify).txt: the answer file is the ground truth. It is actually extracted from the raw-file by viewing publications with the same "<label>0</label>" as a person. The format is in plain text. The following is an example:
"
#Ajay Gupta
#1:13048 388794 596099 1265282 1179332 675629 39153 258611
#2:988870 1490190
#3:1393934
#4:1398544
#5:1739014
#6:1671104 515636 1678096
#7:1126381 1205032 275987 277587 276300 1549674 1034401
#8:600181 846439 149270 175996 264268 264291 299548 1384744 300057 302056 545651 1212517
#9:1316053
"
where the first line denotes the author name and each of the following line indicates a disambiguate person. For example the first line indicates that an author published 8 papers. The corresponding IDs of those papers are respectively 13048, 388794, 596099, 1265282, 1179332, 675629, 39153, 258611.


- Ajay Gupta.txt: the intermediate feature files. It contains 8 matrices, which respectively represents 8 features: co-affiliation, coauthor, citation, co-venue, google (ignored), co-affiliation-occur, titleSim, homepage. Each matrix records the correlation between any two papers published by Ajay Gupta. Each element, e.g., m^0_{ij}, the i-th row and the j-column in the 0-th matrix, denotes whether the two papers (i and j) contain the same affiliation. In this sense, the problem of name disambiguation can be basically considered as a pairwise clustering problem. The second matrix records the number of same coauthors, except Ajay Gupta. The third matrix records whether the a paper cites another paper. The fourth matrix records whether a paper is published at the same venue with another paper. The fifth matrix records whether the two papers (titles) can be found at a same web page (e.g., conference page). (This matrix is not complete). The sixth matrix records whether the affiliation of author "a" of a paper appears in the content of another paper, or vice versa. The seventh matrix records the cosine similarity between titles of any two papers. The eightth matrix records whether two papers appear on the same homepage. Please note that the 5th-8th matrixes cannot be extracted from the raw-data file (xxx.xml) and they are generated using other program.