By
·
Overview
The
data sets were used as a benchmark for search and mining in the personal social network,
including: expert finding and association search.
-
New People Lists (Expert Lists)
-
Association Search Data Sets
· New People
Lists (Expert Lists)
We use the method of
pooled relevance judgments together with human judgments. Specifically, for
each query, we first pooled the top 30 results from the above three systems
(Libra, Rexa, and ArnerMiner) into a single list. Then, one faculty and two
graduates, from our lab, provided human judgments. Four grade scores (3, 2, 1,
and 0) were assigned respectively representing top expert, expert, marginal expert,
and not expert. Assessments were carried out mainly in terms of how many
publications he/she has published, how many publications are related to the
given query, how many top conference papers he/she has published, what
distinguished awards he/she has been awarded. Finally, the judgement scores (here
we only consider 3 and 2) were averaged to construct the final ground truth.
The data set is as follows, we now have arranged 7 queries, including
intelligent agents, information extraction, semantic web, support vector
machine, planning, natural language processing, machine learning.
You can download the new people lists
here [download]
We have collected topics
and their related people lists from as many sources as possible. We randomly
chose 13 topics and created 13 people lists. The data sets were used as the
“golden metric” for expert finding. They were also used to create the test sets
for association search. The following table shows the 13 topics and statistics
of people we have collected. In the 13 topics, OA and SW are from PC members of
the related conferences or workshops. DM is from a list of data mining people
organized by kmining.com. IE is from a list of information extraction
researchers that were collected by Muslea. BS and SVM are from their official
web sites, respectively. PL, IA, ML, and NLP are from a page organized by
Russell and Norvig, which links to 849 pages around the web with information on
Artificial Intelligence.
Table 1.
Our evaluation criterions of ten topics
Test |
Topic |
#Expert |
Source |
OA |
57 |
PC Members of
EON2003&2004; OAEI2005&2006, |
|
SW |
412 |
PC Members from ISWC2001 to
ISWC2006 |
|
DM |
351 |
http://www.kmining.com/info_people.html |
|
IE |
91 |
http://www.isi.edu/info-agents/RISE/people.html |
|
BS |
57 |
http://www.boosting.org/people.html |
|
SVM |
111 |
http://www.svms.org/people-frames.html |
|
PL |
26 |
http://aima.cs.berkeley.edu/ai.html#learning |
|
IA |
35 |
http://aima.cs.berkeley.edu/ai.html#learning |
|
ML |
76 |
http://aima.cs.berkeley.edu/ai.html#learning |
|
NLP |
54 |
http://aima.cs.berkeley.edu/ai.html#learning |
|
CRY |
174 |
http://www.swcp.com/~mccurley/
cryptographers/cryptographers.html |
|
CV |
215 |
http://www.cs.hmc.edu/~fleck/computer-vision-handbook/vision-people.html |
|
NN |
122 |
http://dmoz.org/Computers/Artificial_Intelligence
/Neural_Networks/People/ |
*Compressed
versions can be downloaded from here [RAR]
[Zip]
The
lists were collected by Jing Zhang.
To evaluate
the performance of expert finding, one can use the measures: P@5, P@10, P@20,
P@30, R-prec, MAP, and bpref [Buckley, 2004] [Craswell, 2005].
· Association
Search Data Sets
To evaluate the
effectiveness of our proposed association search approach, we created 8 test
sets. Each of the person pair contains a source person (including his name and
id) and a target person (including his name and id). The test sets were created
as follows. We randomly selected 1,000 person pairs from the researcher network
and create the first test set.
We use the above people
lists to create the other 8 test sets. We created four test sets by randomly
selecting person pairs from SW, DM, and IE respectively. With the three test
sets, we are aimed at testing association search between persons from the same
research community. We created the other five test sets by selecting persons
from different research fields.
Table 2 shows the statistics
of the 9 test sets. The columns respectively represent test set, number of
person pairs, and research fields of source persons and target persons.
Table 2: Statistics on test sets
Test Set |
#Person
pairs |
Field 1 |
Field 2 |
1000 |
Random |
||
1000 |
SW |
||
1000 |
IE |
||
1000 |
DM |
||
369 |
BS |
PL |
|
1000 |
DM |
SW |
|
1000 |
ML |
IE |
|
1000 |
PL |
DM |
|
1000 |
SW |
OA |
*Compressed
versions can be downloaded from here [RAR]
[Zip]
The
test sets were created by Jie
Tang.
To evaluate
the performance of association search, one can use the average running time as
the measure.
Last updated date: Oct. 8, 2007, by Jie Tang.