|
Quantifying
Link Semantics on the Web
first
draft: by jie tang, JING
ZHANG, zI YANG
OCT
3, 2008
Spec on Quantifying
Link Semantics
The work
intends to study how to quantify link semantics. Specifically, an ideal output
of link semantics analysis is to provide users with the following information:
(1) multiple topics discussed in each page; (2) semantics of a link between two
pages; and (3) the influential strength of each link. With such an analysis, a
user could easily trace the origins of an idea/technique, analyze the evolution
and impact of a topic, filter the pages by certain categories of links, as well
as zoom in and zoom out the linkage tracing graph with the degree of influence.
This
documentation describes the specs of the quantifying link semantics. All
the specs here are focusing on web-pages in English. A more detailed technique
report will be available soon.
Generally
speaking, we can define any types of categories, but so far we only consider
the three link categories: “drill down”, “similar”, and “other”, also
called “basic theory”, “comparable work”, and “other” specific for publication
data.
Some
concepts:
Definition 1. (Source Page and Target Page): Each directed link e points from a source paper dse to a target paper dte. A source page may have links to multiple target
pages and a target page might be linked by multiple source pages.
Definition 2. (Link Context): A link context we is defined by words surrounding the link position in the source page. Words
appearing in a fixed size window of a link position are extracted to represent
its link context. Users may have different intentions to create links between
pages, which is defined as link category.
Definition 3. (Link Category): A link category ce is defined as the “intention” that the author of a
source page ds e creates a link e
to the
target page dt e. We currently define three link categories: “drill
down” (link pointing from the summary information to the detailed information),
“similar” (link between two pages with similar content), and “other” (neither
“drill down” nor “similar”).
Definition 4. (Link Influence): The link influence fij
is a
numerical weight assigned to link eij , indicating the relative strength that the target
page dj influences the source page di.
For
annotating link category and link influence, we define the following
specification:
l
Drill
down
1) The source page utilizes some important
information, theory, or information introduced in the target page.
2) Some cue words exist in the
surrounding text of link position, for example:
n “The proposed approach is based
on the theory (Ng and Li, 2006).”
n our presentation is based on that
of mo at and zobel [***]<1>
n database structure we use
inverted files to index documents [13, 14, ***]<1>.
n we have implemented the snd
algorithm described in [***]<1>
l
Similar
(or Comparable work for publication data)
n 1) The source and the target
papers solve a similar problem;
n 2) The topics described in the
source and the target papers are similar to each other;
n 3) Some cue words exist in the
surrounding text, e.g., “however”, “is similar to”, “disadvantages”.
l
Other
n Neither “Drill down” nor “similar”
(or neither “Basic theory” nor “Comparable work”).
l
Example
annotations (we will mainly use paper citation as example for the explanation.)
Tag |
Type |
Feature |
0 |
Drill down/ Basic theory |
1) The source page uses the basic idea or fundamental theory in the target page and the target page mainly introduces a principle, theory, or an algorithm; e.g. papers like LDA[3], Estimate Dirichlet[18], Finding Scientific topics[10] (strong) 2) The source page uses the basic idea, theory, algorithm the target page, however, the target page may introduce an application. (middle) 3) The source page do not directly uses the basic idea, theory or algorithm of the target page, however, the target page firstly proposes a close basic idea, theory or algorithm; (weak); e.g. papers like, mcmc[1], LSI[6], PLSA[11], Author model[15], author-topic model[22][25] ( * The target page is cited by other similar papers many times ) |
1 |
Similar/ Comparable work |
1) The target page uses some other approach to solve the same problem (or part of the problem is the same) with the source page; (strong), e.g., Citation Influence model[7], Citation relationship classification[20][23] 2) The target page solves a similar problem ( the scope can be expanded ) with the source page; (middle) e.g., Citation network analysis[2][9][12], Publication topic analysis[13][17], Citation influence factor analysis[14][21], Jointly model content and links [5][19], Publication Category[8], citation sentence alignment [24] 3) The target page uses some similar approach with the source page and the problem may be a little different. (weak) e.g., Author model[15], author-topic model[22][25], Finding Scientific topics[10], Jointly model content and links [19], Mei[16] |
2 |
Other (middle, weak) |
1) The target page states the dataset e.g., Arnetminer[26] (weak) 2) The target page states the baseline method; e.g., SVM[17] (weak) 3) The target page states the evaluation metrics; e.g., AUC[4] (weak) 4) The problem and the approach of the target page are a little far away from the source page; e.g. Publication topic analysis[13][17], Jointly model content and links [5], Mei[16][17], citation sentence alignment [24] (weak) 5) The cited paper may state some fact or some applied scene (middle)) (middle) |
For the others, the breaker is
defined as:
[\n\s\,\?\!\(\)\[\]\{\}]{1})|(?:\.\s+)|(?:\s*$)
To construct an evaluation data set that conforms
to the spec above and has the flexibility of easily adapting to the potential
spec changes, we define the following format.
§ 0:drill down or basic theory
§ 1:similar or comparable work
§ 2:other
§ 0:strong
§ 1:middle
§ 2:weak
Currently, we have two data sets and are annotating
another data set of social network.
1.
Publication_data: a data set consists of publication papers chosen from
ArnetMiner.
·
original_data.rar: original papers, some contains
the whole content, others only contain the abstract.
·
annotate_data.txt: the output of the annotation
tool. The input of the annotation tool is the same with the output except the
value of “type” and "influence" is default.
please ignore the "pair index" in the file.
2. Wikipedia_data: a data set
consists of “article” pages and “smart” links crawled from Wikipedia. All the
pages were chosen from the “computer science” category.
·
original_data.rar: the data set is crawled from http://download.wikimedia.org/enwiki/20081008/.
There are two data sets: full articles, http://download.wikimedia.org/enwiki/20081008/enwiki-20081008-pages-articles.xml.bz2,
and abstract data http://download.wikimedia.org/enwiki/20081008/enwiki-20081008-abstract.xml.
·
enwiki.rar: the data set was processed for processing in
Matlab. There are six files:
o
docmap
§
line
number: index of the
article
§
1st
token: id
of the article in the original dataset
§
2nd
token: title
of the article
o
wordmap
§
line
number: index of the
word
§
1st
token: word
§
1st
token: index
of the article
§
2nd
token: index
of the word
§
3rd
token: 1
if the abstract of the article contains the word (always takes 1 since
doc_words is a boolean sparse matrix, 0's are ignored)
o
pair_doc
§
line
number: index of the
article-article pair
§
1st
token: index
of the link source article
§
2nd
token: index
of the link destination article
§
line
number: index of the
article-article pair
§
1st
token: label
of the pair (0,1,2, or 3)
§
1st
token: index
of the pair
§
2nd
token: index
of the word
§
3rd
token: 1
if the word is near the anchor text of the pair (always takes 1 since doc_word
is a boolean sparse matrix, 0's are ignored)
3.
Tools:
·
data_prepare.pm: the perl code to translate the orginal
papers into the input of the annotation tool
·
Citation_classifier.exe: the annotation
tool.
Link
(citation) positions were identified by using regular expression (e.g., “[1]” or
“(Smith, et al., 2007)”) and 50 words surrounding each link position were
extracted as the link context.
Based on the link semantic analysis, we generate
semantic linkage tracing graphs from the Web data and currently applied it to
linkage graph retrieval and linkage graph summarization. A demonstration
system has been developed to provide a graph based search service in
ArnetMiner.org. ArnetMiner is an academic
search system, which extracts the structured academic information from the
distributed Web and currently provides services such as expert finding,
expertise conference/publication search, association search, topic browser,
etc. The system is in operation on the internet for nearly three years and has
attracted users from 180 countries from all over the world.
The demonstration is available at http://arnetminer.org/index.jsp?search_type=graphicsearch,
for searching academic data. Given a query, the graph search returns a list of
topic-based graphs, each of which represents a topic related to the query. In
each graph, nodes represent papers and edges represent citation relationships.
Graph summarization is designed to automatically generate a concise semi-structured
summary for understanding the information encoded in a graph. We hope these two
functions can help users to quickly identify whether the returned information
by a search engine is what they need. We also hope that such graph-based search
mode could suggest a new direction for the design of search engines. Thanks to
Bo Gao and Dewei Chen for helping develop the demonstration system.
Last updated date: Nov. 15, 2008, by Jie Tang.