A communication efficient probabilistic algorithm for mining frequent itemsets from a peer-to-peer network

Statistical Analysis and Data Mining(2009)

引用 10|浏览40
暂无评分
摘要
Data intensive large-scale distributed systems like peer-to-peer (P2P) networks are becoming increasingly popular where centralization of data is impossible for mining and analysis. Unfortunately, most of the existing data mining algorithms work only when data can be accessed in its entirety. Finding all the network-wide frequent itemsets is computationally difficult and usually has large communication overhead in such environment. This paper focuses on developing a communication efficient algorithm for discovering frequent itemsets from a P2P network. A sampling-based approach is adopted to find approximate solution instead of an exact solution with probabilistic guarantee. The benefit of approximation technique is reflected in the low communication overhead in discovering majority of frequent itemsets with probabilistic guarantee. The main principal followed by the algorithm assumes that an independent and identically distributed (iid) sample of the entire data is available at one location to generate a set of candidate itemsets. Collecting iid sample from a P2P network is a challenging problem because of varying degrees of connectivity and sizes of data shared. The paper first addresses this issue and shows how an iid sample of nodes and data can be collected from a P2P network using random walk. It applies the proposed sampling technique to identify most of the frequent itemsets from a P2P network. Theoretical analysis shows how to decide about optimum sample size and minimize communication to compute the results. Experimental results show that the proposed algorithm discovers all of the network-wide frequent itemsets using communication that scales sublinearly with network and datasize. © 2009 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 2: 48-69, 2009
更多
查看译文
关键词
probabilistic algorithm,sampling,data mining
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要