@INPROCEEDINGS{Shi:11TIST,
AUTHOR = "Lixin Shi and Yuhang Zhao and Jie Tang",
TITLE = "Batch Mode Active Learning for Networked Data",
YEAR = {2011},
BOOKTITLE = "ACM Transactions on Intelligent Systems and Technology (TIST)",
}
Introduction
We study a novel problem of batch mode active learning for networked data. In this problem, data instances are connected with links and their labels are correlated with each, and the goal of our batch mode active learning is to exploit the link-based dependencies and node-specific content information to actively select a batch of instances to query the user for learning an accurate model to label the unknown instances in the network. We present three criteria (i.e., minimum redundancy, maximum uncertainty and maximum impact) to quantify the informativeness of a set of instances, and formalize the batch mode active learning problem as selecting a set of instances by maximizing an objective function which combines both link and content information. Solving the objective function is NP-hard, we present an efficient algorithm to optimize the objective function with a bounded approximation rate. To scale to real large networks, we develop a parallel implementation of the algorithm. Experimental results on both synthetic data sets and real-world data sets demonstrate the effectiveness and efficiency of our approach.
Machine learning algorithms often suffer from insufficiently labeled training data. The goal of active learning is, not only, as usual, to construct an accurate classifier, but also to minimize the number of labeled instances by actively selecting a few number of instances to query the user. Also, recently there has seen a new direction of machine learning field, that is how to learn an accurate model to classify the networked data. In this work, we aim to answer the question: how to actively select a set of instances from the networked data and query the users in a batch mode?
We refer to this problem as the Batch Mode active learning (BMAL) problem for networked data. Suppose we are given a data set with only one labeled sample (with positive label), and the samples are connected in a two dimensional space. We try to select two samples to query the user in a batch mode. If we only consider the content information, then we would tend to select the two center nodes (top-right figure); while if we only consider the link information, we would select two samples with the most links (middle-left figure). By combining the link and content information, our BMAL method suggests two different samples (middle-right figure). The bottom two figures show how a machine learning algorithm updates the classification model when the user provides labels to the queried samples (suppose one positive and one negative).

(The top-left figure plots the input networked data, where pink circles indicate instances with positive labels, gray circles indicate unknown labels, and blue circles indicate instances with negative labels. "w/o link'' stands for active learning without considering links; "w/o content'' stands for active learning without considering the content information; and ``BMAL'' stands for active learning by the proposed method, which considers both the content and link information.)
Technologies

We designed Q(S) as the objective function over S, a subset of the unlabeled set. It measures the informativeness of S, so collective active learning could be viewed as selecting:

There are three criteria to design this objective function:
-- Maximum Uncertainty: H(S) as summation of entropies
-- Maximum Impact: Graph-cut-based design of C(S), which could be seen as a summation of maximum impacts over samples in S
-- Minimum Redundancy: We theoretically proved that definition of C(S) minimizes redundancy
Experiments
We have conducted extensive experiments on both synthetic and real-world data sets. The following is a series of experiments on real-world document citation-network data sets:
| Name | Topics | Documents | Citings | Words |
| Cora | 7 | 2708 | 5429 | 1433 |
| Citeseer | 6 | 3312 | 4732 | 3703 |
| WebKB | 5 | 877 | 2868 | 1703 |
Here is the average classification accuracy result, comparing with several baselines: Random, Most uncertainty(MU), Active Learning using Gaussian Fields(GF, suggested by Zhu et al. 2003), Hybrid(suggested by Macskassy 2009), k-means(K-M, suggested by Rattigan et al. 2007):
![]() |
![]() |
![]() |
Data & Codes
Codes could be found here. It is mainly written in Matlab, mainly consists of the active selection process (see ./LCC) and auxiliary routine for learning (./SVM, ./BMAL_Sparse, ./GF), some toolkit. It also contains the opensource netkit toolkit (see http://wiki.netkit.org/index.php/Main_Page)
Data coud be found here. It includes the four data sets that we are using:
Full paper could be found here.
To edit this page, please click here. Powered by ArnetMiner.Org.