Learning to Infer Social Ties in Large Networks

Learning to Infer Social Ties in Large Networks

Wenbin Tang, Honglei Zhuang, Jie Tang

Introduction

This work focus on inferring social relationship type from online social networks. Different from most of the previous work concerning about a specific domain, this work contributes to establishment of a unified model which can be easily extend to different scenarios. In this work we propose a partially-labeled pairwise factor graph model(PFP-FGM) and apply it on three different data sets. This page contains the description of this work and the related data sets. 

This work has been accepted in ECML PKDD 2011, and selected for the Best Student Paper Runner-up Award in Data Mining. If you use data sets provided or refer to the model, please cite:

@inproceedings{Tang:11PKDD,
 author    = {Wenbin Tang and
               Honglei Zhuang and
               Jie Tang},
  title     = {Learning to Infer Social Ties in Large Networks},
  booktitle = {Proceedings of the ECML/PKDD 2011},
  year      = {2011},
  pages     = {381-397}
}

 

The full paper can be downloaded here[pdf].

Model

The input social network is partially labeled, i.e. some relationships are labeled, while others are left unlabeled. Our task is to incorporate the labeled relationships as well as the unlabeled network information to infer the type of unlabeled relationships. In this work, we model each relationship as a node in the graphical model. Correspondingly, the task of relationship mining becomes how to predict the semantic labels of each node in the model.

We have three basic intuitions: 1) the user-specific or link-specific attributes contain implicit information about the relationships; 2) relationships or different users may have a correlation; 3) some global constraints such as common knowledge or user-specific constraints. Based on the intuitions above, we propose a partially-labeled pairwise factor graph model (PLP-FGM). Each relationship ri1i2 in the social network is modeled as a node yi in PLP-FGM, and three factors are defined corresponding to three intuitions:

  • Attribute factor: f(yi, xi) represents the posterior probability of the relationship yi given the attribute vector xi;
  • Correlation factor: g(yi, G(yi)) denotes the correlation between the relationships, where G(yi) is the set of correlated relationships to yi;
  • Constraint factor: h(yi, H(yi)) reflects the constraints between relationships, where H(yi) is the set of relationships constrained on yi.

Given the partially-labeled social network G, the joint distribution over Y can be defined as:

p(Y |G)=f(yixi)g(yi, G(yi))h(yi, H(yi))

The three factors can be instantiated in different ways. In this work, all the three factor functions are defined by exponential-linear functions. 

Model Learning   Learning PLP-FGM is to estimate the parameter configuration θ so that the log-likelihood of observation information (labeled relationships) are maximized. To optimize the log-likelihood objective function, a gradient decent method is utilized. One challenge is that  one needs to sum up the likelihood of possible states for all nodes including unlabeled nodes while calculating the normalization factor Z. Thus we infer the unlabeled nodes based on the labeled nodes. Another challenge is that the graphical structure in PLP-FGM is arbitrary and may contains circles. We utilize Loopy Belief Propagation (LBP) to approximate the marginal probabilities so as to obtain the gradient. 

Inferring Unknown Social Ties   Based on learned parameters θ, we can predict the label of each relationship by finding a label configuration which maximizes the joint probability. 

Distributed Learning   To scale up well with large networks, we develop a distributed learning method based on MPI (Message Passing Interface). 

Experimental Results

The experimental results shows that our method consistently outperforms other methods on all the three data sets. 

And we also perform an analysis to evaluate the contribution of different factors defined in our model.

Data Sets

Data sets used in our experiments of "Learning to Infer Social Ties in Large Networks" is publicly available. 

There are 3 genres of real-world data sets: Publication (coauthor network of Arnetminer), Email (email network of Enron employees), Mobile (mobile network of Reality Mining Project). We explore the advisor-advisee relationships, manager-subordinate relationships and friendship relationships of these data sets respectively. 

Publication  The Arnetminer data set is available by using the open service of Arnetminer. A subset is available here

Email   The Email data set is related with http://www.infochimps.com/datasets/enron-email-data-with-manager-subordinate-relationship-metadata . 

Mobile   The Mobile data set is related with http://reality.media.mit.edu/download.php . You can mail to the original authors to get the coarse data. 

Attributes used in our experiments are provided here for reference.

 

 

Updating... Last Updated: 2011/7/28