Open Data and Codes by Arnetminer

Open Data and Codes by Arnetminer


Arnetminer has been in operation on the internet since 2006. We have already collected 548,504 researcher profiles using an approach based on Conditional Random Fields (CRF), 2,858,504 publications, 5,042 conferences, and 32,215,473 paper-paper citation relationships, 47,443,857 coauthor relationships, and 14,720,130 paper-published-at relationships from online databases including DBLP, ACM Digital library, Citeseer, and others. The extracted/integrated data is stored into an academic network base. Based on the academic network, services such as expertise search, Bole search, citation tracing analysis, topical graph search, and topic browser have been provided. The system has received a large amount of accesses from more than 180 countries. Feedbacks from users and system logs indicate that users consider the system really help people to find and share information in the academic community. [More...]

Researcher Profiling (Researcher Profile Extraction)

We are developing extraction tools in ArnetMiner, a researcher social network system. The tool will be used to extract researcher profile from the Web page and outputs the extracted information into a researcher database. (our related papers [ICDM'07] [KDD'08]).  [Download]

Social Influence Analysis in Large-scale Network

In large social networks, nodes (users, entities) are influenced by others for various reasons. For example, the colleagues have strong influence on one??s work, while the friends have strong influence on one's daily life. How to differentiate the social influences from different angles (topics)? How to quantify the strength of those social influences? How to estimate the model on real large networks? In this work, we focus on measuring the strength of social influence quantitatively. (our related papers [KDD'09]). [Download]

Link Semantic Analysis on the Web

The work intends to study how to quantify link semantics. Specifically, an ideal output of link semantics analysis is to provide users with the following information: (1) multiple topics discussed in each page; (2) semantics of a link between two pages; and (3) the influential strength of each link. With such an analysis, a user could easily trace the origins of an idea/technique, analyze the evolution and impact of a topic, filter the pages by certain categories of links, as well as zoom in and zoom out the linkage tracing graph with the degree of influence. (our related papers [ICDM'09]). [Download]


Expert Finding and Association Search

The data set is organized for expert finding and association search. For expert finding, we chose 13 highly frequently queried keywords in, and created 13 people lists as the ground truth. Details about how we created the data set is described here, and can also refer to our KDD2008 and ICDM2008 papers. [Download]

Conference Rank

We provide new feature of conference Rank. 

We develop 3 algorithms for ranking conferences. [More] [An old version]

Representative Publication

Related publications can be referred to here.

Social Action Prediction

It is well recognized that users’ actions in a social network are influenced by various complex and subtle factors. This data set is used to learn/understand users' behavior model. Basically, it includes historical information (e.g, tweets of each user on twitter and their friendships) and the goal is to predict who will perform a specific social action at a specific time. [more...]

Link (Follow-back) Prediction

This is a dynamic Twitter network. The data set can be used for link prediction or follow-back prediction. To begin the collection process, we selected the most popular user on Twitter, i.e., “Lady Gaga”, and randomly collected 10,000 of her followers. We took these users as seed users and used a crawler to collect all followers of these users by traversing following edges. We continue the traversing process, which produced in total 13,442,659 users and 56,893,234 following links, with an average of 728,509 new links per day. The crawler monitored the change of the network structure from 10/12/2010 to 12/23/2010. [more...]

Classifying Social Relationships

There are 3 genres of real-world data sets: Publication (coauthor network of Arnetminer), Email (email network of Enron employees), Mobile (mobile network of Reality Mining Project). We explore the advisor-advisee relationships, manager-subordinate relationships and friendship relationships of these data sets respectively. [more...]

Inferring Social Ties

Another data sets for studying classifying social relationships. There are six different networks: Epinions, Slashdot, MobileU, MobileD, Coauthor, and Enron.

  • Epinions is a network of product reviewers. The data set consists of 131,828 users and 841,372 relationships, of which about 85.0\% are trust relationships. 80,668 users received at least one trust or distrust relationships. Our goal on this data set is to infer the trust relationships between users.
  • Slashdot is a network of friends. The data set is comprised of 77,357 users and 516,575 relationships of which 76.7\% are ``friend'' relationships. Our goal on this data set is to infer the ``friend'' relationships between users.
  • MobileU is a network of mobile users.  In total, the data contains 5,436 relationships. Our goal is to infer whether two users have a friend relationship. For evaluation, all users are required to complete an online survey, in which 157 pairs of users are labeled as friends.% of each other.
  • MobileD is a relatively larger mobile network of enterprise, where nodes are employees in a company and relationships are formed by calls and short messages sent between each other during a few months. In total, there are 232 users (50 managers and 182 ordinary employees) and 3,567 relationships (including calling and texting messages) between the users. The objective here is to infer manager-subordinate relationships between users based on their mobile usage patterns.
  • Coauthor is a network of authors. The data set, crawled from, is comprised of 815,946 authors and 2,792,833 coauthor relationships. 
  • Enron is an email communication network. It consists of 136,329 emails between 151 Enron employees. Our goal on this data set is to infer manager-subordinate relationships between users. 


Collaboration Recommendation

The data set is extracted from, an academic search system, which contains 1,436,990 authors and 1,932,442 publications. The data we used in our experiments spans from 1990 to 2005. We consider the following five sub-domains: 

  • Data Mining: We use papers of the following data mining conferences: KDD, SDM, ICDM, WSDM and PKDD as ground truth, which result in a network with 6,282 authors and 22,862 co-author relationships.
  • Medical Informatics: We include the following journals: Journal of the American Medical Informatics Association, Journal of Biomedical Informatics, Artificial Intelligence in Medicine, IEEE Trans. Med. Imaging and IEEE Transactions on Information and Technology in Biomedicine, from which we obtain a network of 9,150 authors and 31,851 coauthor relationships.
  • Theory: We include the following conferences, i.e., STOC, FOCS and SODA, from which we get 5,449 authors and 27,712 co-author  relationships.
  • Visualization: We include the following conferences and journals, CVPR, ICCV, VAST, TVCG, IEEE Visualization and Information Visualization. The obtained coauthor network is comprised of 5,268 authors and 19,261 co-author relationships.
  • Database: We include the following conferences, i.e., SIGMOD, VLDB and ICDE. From those conferences, we extract 7,590 authors and 37,592 co-author relationships.


Heterogeneous Social Network Integration

We are studying the heterogeneous network integration problem.  We have collected data from different social networking site. The dataset published in this page consists of two collections of social networks, where the networks within a collection are overlapped with each other (i.e. have users corresponding to the same real world person). [More...]

Created by Jie Tang. Last updated on Nov. 18, 2013.