Quantifying Link Semantics on the Web

first draft: by jie tang, JING ZHANG, zI YANG

OCT 3, 2008

 

Representative paper:
  • Jie Tang, Jing Zhang, Jeffrey Xu Yu, Zi Yang, Keke Cai, Rui Ma, Li Zhang, and Zhong Su. Topic Distributions over Links on Web. In Proceedings of 2009 IEEE International Conference on Data Mining (ICDM'2009). pp. 1055-1060. [PDF] [Slides]


  • Spec on Quantifying Link Semantics

    Introduction

    The work intends to study how to quantify link semantics. Specifically, an ideal output of link semantics analysis is to provide users with the following information: (1) multiple topics discussed in each page; (2) semantics of a link between two pages; and (3) the influential strength of each link. With such an analysis, a user could easily trace the origins of an idea/technique, analyze the evolution and impact of a topic, filter the pages by certain categories of links, as well as zoom in and zoom out the linkage tracing graph with the degree of influence.

    This documentation describes the specs of the quantifying link semantics. All the specs here are focusing on web-pages in English. A more detailed technique report will be available soon.

    General Principle

    Generally speaking, we can define any types of categories, but so far we only consider the three link categories: “drill down”, similar”, and “other”, also called “basic theory”, “comparable work”, and “other” specific for publication data.

    Some concepts:

    Definition 1. (Source Page and Target Page): Each directed link e points from a source paper dse to a target paper dte. A source page may have links to multiple target pages and a target page might be linked by multiple source pages.

     Definition 2. (Link Context): A link context we is defined by words surrounding the link position in the source page. Words appearing in a fixed size window of a link position are extracted to represent its link context. Users may have different intentions to create links between pages, which is defined as link category.

    Definition 3. (Link Category): A link category ce is defined as the “intention” that the author of a source page ds e creates a link e to the target page dt e. We currently define three link categories: “drill down” (link pointing from the summary information to the detailed information), “similar” (link between two pages with similar content), and “other” (neither “drill down” nor “similar”).

    Definition 4. (Link Influence): The link influence fij is a numerical weight assigned to link eij , indicating the relative strength that the target page dj influences the source page di.

     

    For annotating link category and link influence, we define the following specification:

    l  Drill down

    1) The source page utilizes some important information, theory, or information introduced in the target page.

    2) Some cue words exist in the surrounding text of link position, for example:

    n  “The proposed approach is based on the theory (Ng and Li, 2006).”

    n  our presentation is based on that of mo at and zobel [***]<1>

    n  database structure we use inverted files to index documents [13, 14, ***]<1>.

    n  we have implemented the snd algorithm described in [***]<1>

    l  Similar (or Comparable work for publication data)

    n  1) The source and the target papers solve a similar problem;

    n  2) The topics described in the source and the target papers are similar to each other;

    n  3) Some cue words exist in the surrounding text, e.g., “however”, “is similar to”, “disadvantages”.

    l  Other

    n  Neither “Drill down” nor “similar” (or neither “Basic theory” nor “Comparable work”).

    l  Example annotations (we will mainly use paper citation as example for the explanation.)

    Tag

    Type

    Feature

    0

    Drill down/

    Basic theory

    1)              The source page uses the basic idea or fundamental theory in the target page and the target page mainly introduces a principle, theory, or an algorithm; e.g. papers like LDA[3],  Estimate Dirichlet[18], Finding Scientific topics[10] (strong)

     

    2)       The source page uses the basic idea, theory, algorithm the target page, however, the target page may introduce an application. (middle)

    3)       The source page do not directly uses the basic idea, theory or algorithm of the target page, however, the target page firstly proposes a close basic idea, theory or algorithm; (weak); e.g. papers like, mcmc[1], LSI[6], PLSA[11], Author model[15], author-topic model[22][25]

    ( * The target page is cited by other similar papers many times )

    1

    Similar/

    Comparable work

    1)              The target page uses some other approach to solve the same problem (or part of the problem is the same) with the source page; (strong), e.g., Citation Influence model[7], Citation relationship classification[20][23]

    2)              The target page solves a similar problem ( the scope can be expanded ) with the source page; (middle)

    e.g., Citation network analysis[2][9][12], Publication topic analysis[13][17], Citation influence factor analysis[14][21], Jointly model content and links [5][19], Publication Category[8], citation sentence alignment [24]

    3) The target page uses some similar approach with the source page and the problem may be a little different. (weak)

    e.g., Author model[15], author-topic model[22][25], Finding Scientific topics[10], Jointly model content and links [19], Mei[16]

     

    2

    Other

    (middle, weak)

    1)                        The target page states the dataset

    e.g., Arnetminer[26] (weak)

    2)                        The target page states the baseline method;

    e.g., SVM[17] (weak)

    3)                        The target page states the evaluation metrics;

    e.g., AUC[4] (weak)

    4)                        The problem and the approach of the target page are a little far away from the source page;

    e.g. Publication topic analysis[13][17], Jointly model content and links [5], Mei[16][17], citation sentence alignment [24] (weak)

    5)      The cited paper may state some fact or some applied scene (middle))     (middle)

     

     

    word braker

    For the others, the breaker is defined as:

    [\n\s\,\?\!\(\)\[\]\{\}]{1})|(?:\.\s+)|(?:\s*$)

     

    Annotation Tags

    To construct an evaluation data set that conforms to the spec above and has the flexibility of easily adapting to the potential spec changes, we define the following format.

     

    §  0:drill down or basic theory

    §  1:similar or comparable work

    §  2:other

    §  0:strong

    §  1:middle

    §  2:weak

     

     

    Data sets and tools

     

    Currently, we have two data sets and are annotating another data set of social network.

    1. Publication_data: a data set consists of publication papers chosen from ArnetMiner.

    ·         original_data.rar: original papers, some contains the whole content, others only contain the abstract.

    ·         annotate_data.txt: the output of the annotation tool. The input of the annotation tool is the same with the output except the value of “type” and "influence" is default.                       please ignore the "pair index" in the file.

    2. Wikipedia_data: a data set consists of “article” pages and “smart” links crawled from Wikipedia. All the pages were chosen from the “computer science” category.

    ·         original_data.rar: the data set is crawled from http://download.wikimedia.org/enwiki/20081008/. There are two data sets: full articles, http://download.wikimedia.org/enwiki/20081008/enwiki-20081008-pages-articles.xml.bz2, and abstract data http://download.wikimedia.org/enwiki/20081008/enwiki-20081008-abstract.xml.

    ·         enwiki.rar: the data set was processed for processing in Matlab. There are six files:

    o   docmap

    §  line number:    index of the article

    §  1st token:         id of the article in the original dataset

    §  2nd token:       title of the article

    o   wordmap

    §  line number:    index of the word

    §  1st token:         word

    o   doc_words

    §  1st token:         index of the article

    §  2nd token:       index of the word

    §  3rd token:        1 if the abstract of the article contains the word (always takes 1 since doc_words is a boolean sparse matrix, 0's are ignored)

    o   pair_doc

    §  line number:    index of the article-article pair

    §  1st token:         index of the link source article

    §  2nd token:       index of the link destination article

    o   pair_label

    §  line number:    index of the article-article pair

    §  1st token:         label of the pair (0,1,2, or 3)   

    o   pair_word

    §  1st token:         index of the pair

    §  2nd token:       index of the word

    §  3rd token:        1 if the word is near the anchor text of the pair (always takes 1 since doc_word is a boolean sparse matrix, 0's are ignored)

     

    3. Tools:

    ·         data_prepare.pm: the perl code to translate the orginal papers into the input of the annotation tool

    ·         Citation_classifier.exe: the annotation tool.

     

    Link (citation) positions were identified by using regular expression (e.g., “[1]” or “(Smith, et al., 2007)”) and 50 words surrounding each link position were extracted as the link context.

     

    Demonstration

    Based on the link semantic analysis, we generate semantic linkage tracing graphs from the Web data and currently applied it to linkage graph retrieval and linkage graph summarization. A demonstration system has been developed to provide a graph based search service in ArnetMiner.org. ArnetMiner is an academic search system, which extracts the structured academic information from the distributed Web and currently provides services such as expert finding, expertise conference/publication search, association search, topic browser, etc. The system is in operation on the internet for nearly three years and has attracted users from 180 countries from all over the world.

    The demonstration is available at http://arnetminer.org/index.jsp?search_type=graphicsearch, for searching academic data. Given a query, the graph search returns a list of topic-based graphs, each of which represents a topic related to the query. In each graph, nodes represent papers and edges represent citation relationships. Graph summarization is designed to automatically generate a concise semi-structured summary for understanding the information encoded in a graph. We hope these two functions can help users to quickly identify whether the returned information by a search engine is what they need. We also hope that such graph-based search mode could suggest a new direction for the design of search engines. Thanks to Bo Gao and Dewei Chen for helping develop the demonstration system.

     

     

    Last updated date: Nov. 15, 2008, by Jie Tang.