Learning to Infer Social Ties in Large Networks
Wenbin Tang, Honglei Zhuang, Jie Tang
Introduction
This work focus on inferring social relationship type from online social networks. Different from most of the previous work concerning about a specific domain, this work contributes to establishment of a unified model which can be easily extend to different scenarios. In this work we propose a partially-labeled pairwise factor graph model(PFP-FGM) and apply it on three different data sets. This page contains the description of this work and the related data sets.
This work has been accepted in ECML PKDD 2011, and selected for the Best Student Paper Runner-up Award in Data Mining. If you use data sets provided or refer to the model, please cite:
@inproceedings{Tang:11PKDD,
author = {Wenbin Tang and
Honglei Zhuang and
Jie Tang},
title = {Learning to Infer Social Ties in Large Networks},
booktitle = {Proceedings of the ECML/PKDD 2011},
year = {2011},
pages = {381-397}
}
The full paper can be downloaded here[pdf].
Model
The input social network is partially labeled, i.e. some relationships are labeled, while others are left unlabeled. Our task is to incorporate the labeled relationships as well as the unlabeled network information to infer the type of unlabeled relationships. In this work, we model each relationship as a node in the graphical model. Correspondingly, the task of relationship mining becomes how to predict the semantic labels of each node in the model.

We have three basic intuitions: 1) the user-specific or link-specific attributes contain implicit information about the relationships; 2) relationships or different users may have a correlation; 3) some global constraints such as common knowledge or user-specific constraints. Based on the intuitions above, we propose a partially-labeled pairwise factor graph model (PLP-FGM). Each relationship ri1i2 in the social network is modeled as a node yi in PLP-FGM, and three factors are defined corresponding to three intuitions:
Given the partially-labeled social network G, the joint distribution over Y can be defined as:
p(Y |G)=∏f(yi, xi)g(yi, G(yi))h(yi, H(yi))
The three factors can be instantiated in different ways. In this work, all the three factor functions are defined by exponential-linear functions.
Model Learning Learning PLP-FGM is to estimate the parameter configuration θ so that the log-likelihood of observation information (labeled relationships) are maximized. To optimize the log-likelihood objective function, a gradient decent method is utilized. One challenge is that one needs to sum up the likelihood of possible states for all nodes including unlabeled nodes while calculating the normalization factor Z. Thus we infer the unlabeled nodes based on the labeled nodes. Another challenge is that the graphical structure in PLP-FGM is arbitrary and may contains circles. We utilize Loopy Belief Propagation (LBP) to approximate the marginal probabilities so as to obtain the gradient.
Inferring Unknown Social Ties Based on learned parameters θ, we can predict the label of each relationship by finding a label configuration which maximizes the joint probability.
Distributed Learning To scale up well with large networks, we develop a distributed learning method based on MPI (Message Passing Interface).
Experimental Results
The experimental results shows that our method consistently outperforms other methods on all the three data sets.

And we also perform an analysis to evaluate the contribution of different factors defined in our model.

Data Sets
Data sets used in our experiments of "Learning to Infer Social Ties in Large Networks" is publicly available.
There are 3 genres of real-world data sets: Publication (coauthor network of Arnetminer), Email (email network of Enron employees), Mobile (mobile network of Reality Mining Project). We explore the advisor-advisee relationships, manager-subordinate relationships and friendship relationships of these data sets respectively.
Publication The Arnetminer data set is available by using the open service of Arnetminer. A subset is available here.
Email The Email data set is related with http://www.infochimps.com/datasets/enron-email-data-with-manager-subordinate-relationship-metadata .
Mobile The Mobile data set is related with http://reality.media.mit.edu/download.php . You can mail to the original authors to get the coarse data.
Attributes used in our experiments are provided here for reference.

Updating... Last Updated: 2011/7/28