This data set is a rich version of the DISAMBIGUATION data set. It also contains 110 author names and their disambiguation results (ground truth). The difference lies in that the data set also contains rich features. Basically, for each author name, it has 15 files. The most important files are xxx.xml, xxx_answer.txt, and xxx.txt. The xxx_answer.txt contains the ground truth and the other two files (xxx.xml and xxx.txt) provide information (featuers) to perform the disambiguation. At the high-level, the xxx.xml file includes title, venue, coauthor, affiliation, and the xxx.txt further contains citation, co-affiliation-occur and homepage. Besides the 3 important files, each author name is also associated with 12 other intermediate fiels which can simply ignored.
Let us use "Ajay Gupta" as the example to explain what information contained in each file.
- Ajay Gupta.xml. The raw file.
is formatted as a XML file. In the XML file, the author name is associated with a number of publications. An example of a publication is as follow:
"
Explanation-based Failure Recovery1987Ajay GuptaAAAI13048null
"
where
denotes the title of the publication;
denotes the publication year;
denotes the publication venue;
denotes the publication id;