Citation Network Dataset

A Citation Network Dataset

Each node is paper associated with rich attribute information (e.g., abstract, title, authors, etc.), released by Arnetminer.org

Overview

The data set is designed for research purpose only. The first version contains 629,814 papers and 632,752 citations. Each paper is associated with abstract, authors, year, venue, and title.

The data set can be used for clustering with network and side information, studying influence in the citation network, finding the most influential papers, topic modeling analysis, etc.

A larger version will be released soon.

Citation-network V1:  629,814 papers and >632,752 citation relationships (2010-05-15).

Citation-network V2:  1,397,240 papers and >3,021,489 citation relationships (2010-09-13).

DBLP-Citation-network V3:  1,632,442 papers and >2,327,450 citation relationships (2010-10-22).

DBLP-Citation-network V4:  1,511,035 papers and 2,084,019 citation relationships (2011-01-08).

DBLP-Citation-network V5:  1,572,277 papers and 2,084,019 citation relationships (2011-01-08).

Data set #paper #Citation Relationship Comment
Citation-network V1 629,814 >632,752  
Citation-network V2 1,397,240 >3,021,489  
DBLP-Citation-network V3 1,632,442 >2,327,450  
DBLP-Citation-network V4 1,511,035 2,084,019 Arnetminer [2011-01-08]
DBLP-Citation-network V5 1,572,277 2,084,019 Arnetminer [2011-02-21]

 

Data Description

The smaller dataset is organized into ~600,000 blocks, each for a paper. [Download]
The larger one is organized into ~1,400,000 blocks, each also for a paper. [Download]
The DBLP version is organized into 1,511,035 blocks, each also for a paper. [Download]

For each block, each line starting with a specific prefix indicates an attribute of the paper. More specifically,

For V1-V4

#* --- paperTitle
#@ --- Authors
#t ---- Year
#c  --- publication venue
#index 00---- index id of this paper
#% ---- the id of references of this paper (there are multiple lines, with each indicating a reference)
#! --- Abstract

The following is an example:

#*Information geometry of U-Boost and Bregman divergence
#@Noboru Murata,Takashi Takenouchi,Takafumi Kanamori,Shinto Eguchi
#t2004
#cNeural Computation
#index436405
#%94584
#%282290
#%605546
#%620759
#%564877
#%564235
#%594837
#%479177
#%586607
#!We aim at an extension of AdaBoost to U-Boost, in the paradigm to build a stronger classification machine from a set of weak learning machines. A geometric understanding of the Bregman divergence defined by a generic convex function U leads to the U-Boost method in the framework of information geometry extended to the space of the finite measures over a label set. We propose two versions of U-Boost learning algorithms by taking account of whether the domain is restricted to the space of probability functions. In the sequential step, we observe that the two adjacent and the initial classifiers are associated with a right triangle in the scale via the Bregman divergence, called the Pythagorean relation. This leads to a mild convergence property of the U-Boost algorithm as seen in the expectation-maximization algorithm. Statistical discussions for consistency and robustness elucidate the properties of the U-Boost methods based on a stochastic assumption for training data.

The dataset can be downloaded [V1 (smaller): here] [V2 (larger): here]. [V3 (DBLP):  here] [V4 (DBLP + ACM citation): tar.bz2 txt]

For V5

#* --- paperTitle
#@ --- Authors
#year ---- Year
#conf --- publication venue
#citation --- citation number (both -1 and 0 means none)
#index ---- index id of this paper
#arnetid ---- pid in arnetminer database
#% ---- the id of references of this paper (there are multiple lines, with each indicating a reference)
#! --- Abstract

The following is an example:

#*Spatial Data Structures.
#@Hanan Samet
#year1995
#confModern Database Systems
#citation2743
#index25
#arnetid27
#%165
#!An overview is presented of the use of spatial data structures in spatial data
bases. The focus is on hierarchical data structures, including a number of varia
nts of quadtrees, which sort the data with respect to the space occupied by it.
Such techniques are known as spatial indexing methods. Hierarchical data structu
res are based on the principle of recursive decomposition. They are attractive b
ecause they are compact and depending on the nature of the data they save space
as well as time and also facilitate operations such as search. Examples are give
n of the use of these data structures in the representation of different data ty
pes such as regions, points, rectangles, lines, and volumes.

[V5 (DBLP + ACM citation): tar.bz2 txt]

If the file format is ".tar.bz2", please use "tar -jxvf" to extract the file. 

References

If you use this data set for research, please cite one of the following papers:

Created by Jie Tang Click here to edit.  Last updated on October 20, 2010.