Data Sets

By Jie Tang and Jing Zhang

Representative paper:
  • Jie Tang, Jing Zhang, Ruoming Jin, Zi Yang, Keke Cai, Li Zhang, and Zhong Su. Topic Level Expertise Search over Heterogeneous Networks. Machine Learning Journal, Volume 82, Issue 2 (2011), Pages 211-237. [PDF] [URL]
  • Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. ArnetMiner: Extraction and Mining of Academic Social Networks. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD'08). pp.990-998. [PDF] [Slides] [System] [API] [Citation Data] [DBLP Citation Data] [More Data] (Top 5 cited papers among KDD 2008's papers, More...)
  • Jing Zhang, Jie Tang, and Juanzi Li. Expert Finding in a Social Network. (Poster) In Proceedings of the Twelfth Database Systems for Advanced Applications (DASFAA'2007). pp. 1066-1069 [PDF]


  • ·    Overview

    The data sets were used as a benchmark for search and mining in the personal social network, including: expert finding and association search.

    - New People Lists (Expert Lists)

    - People Lists (Expert Lists)

    - Association Search Data Sets

     

    ·    New People Lists (Expert Lists)

    We use the method of pooled relevance judgments together with human judgments. Specifically, for each query, we first pooled the top 30 results from the above three systems (Libra, Rexa, and ArnerMiner) into a single list. Then, one faculty and two graduates, from our lab, provided human judgments. Four grade scores (3, 2, 1, and 0) were assigned respectively representing top expert, expert, marginal expert, and not expert. Assessments were carried out mainly in terms of how many publications he/she has published, how many publications are related to the given query, how many top conference papers he/she has published, what distinguished awards he/she has been awarded. Finally, the judgement scores (here we only consider 3 and 2) were averaged to construct the final ground truth. The data set is as follows, we now have arranged 7 queries, including intelligent agents, information extraction, semantic web, support vector machine, planning, natural language processing, machine learning.

     You can download the new people lists here [download]

     

    ·    People Lists (Expert Lists)

    We have collected topics and their related people lists from as many sources as possible. We randomly chose 13 topics and created 13 people lists. The data sets were used as the “golden metric” for expert finding. They were also used to create the test sets for association search. The following table shows the 13 topics and statistics of people we have collected. In the 13 topics, OA and SW are from PC members of the related conferences or workshops. DM is from a list of data mining people organized by kmining.com. IE is from a list of information extraction researchers that were collected by Muslea. BS and SVM are from their official web sites, respectively. PL, IA, ML, and NLP are from a page organized by Russell and Norvig, which links to 849 pages around the web with information on Artificial Intelligence.

    Table 1. Our evaluation criterions of ten topics

    Test

    Topic

    #Expert

    Source

    OA

    Ontology Alignment

    57

    PC Members of EON2003&2004; OAEI2005&2006, OM workshop2006

    SW

    Semantic Web

    412

    PC Members from ISWC2001 to ISWC2006

    DM

    Data Mining

    351

    http://www.kmining.com/info_people.html

    IE

    Information Extraction

    91

    http://www.isi.edu/info-agents/RISE/people.html

    BS

    Boosting

    57

    http://www.boosting.org/people.html

    SVM

    Support Vector Machine

    111

    http://www.svms.org/people-frames.html

    PL

    Planning

    26

    http://aima.cs.berkeley.edu/ai.html#learning

    IA

    Intelligent Agents

    35

    http://aima.cs.berkeley.edu/ai.html#learning

    ML

    Machine Learning

    76

    http://aima.cs.berkeley.edu/ai.html#learning

    NLP

    Natural Language Processing

    54

    http://aima.cs.berkeley.edu/ai.html#learning

    CRY

    Cryptography

    174

    http://www.swcp.com/~mccurley/ cryptographers/cryptographers.html

    CV

    Computer Vision

    215

    http://www.cs.hmc.edu/~fleck/computer-vision-handbook/vision-people.html

    NN

    Neural Networks

    122

    http://dmoz.org/Computers/Artificial_Intelligence /Neural_Networks/People/

    *Compressed versions can be downloaded from here [RAR] [Zip]

    The lists were collected by Jing Zhang.

     

    To evaluate the performance of expert finding, one can use the measures: P@5, P@10, P@20, P@30, R-prec, MAP, and bpref [Buckley, 2004] [Craswell, 2005].

     

    ·    Association Search Data Sets

    To evaluate the effectiveness of our proposed association search approach, we created 8 test sets. Each of the person pair contains a source person (including his name and id) and a target person (including his name and id). The test sets were created as follows. We randomly selected 1,000 person pairs from the researcher network and create the first test set.

    We use the above people lists to create the other 8 test sets. We created four test sets by randomly selecting person pairs from SW, DM, and IE respectively. With the three test sets, we are aimed at testing association search between persons from the same research community. We created the other five test sets by selecting persons from different research fields.

    Table 2 shows the statistics of the 9 test sets. The columns respectively represent test set, number of person pairs, and research fields of source persons and target persons.

    Table 2: Statistics on test sets

    Test Set

    #Person pairs

    Field 1

    Field 2

    Random

    1000

    Random

    SW

    1000

    SW

    IE

    1000

    IE

    DM

    1000

    DM

    BS-PL

    369

    BS

    PL

    DM-SW

    1000

    DM

    SW

    ML-IE

    1000

    ML

    IE

    PL-DM

    1000

    PL

    DM

    SW-OA

    1000

    SW

    OA

    *Compressed versions can be downloaded from here [RAR] [Zip]

    The test sets were created by Jie Tang.

     

    To evaluate the performance of association search, one can use the average running time as the measure.

     

     

    Visites since Oct. 20, 2006. Free Web Counter

    Last updated date: Oct. 8, 2007, by Jie Tang.