A parallel text clustering method using Spark and hashing

Mohamed Aymen Ben HajKacem,Chiheb-Eddine Ben N’cir,Nadia Essoussi

COMPUTING（2021）

引用 4|浏览1

暂无评分

摘要

Clustering textual data has become an important task in data analytics since several applications require to automatically organizing large amounts of textual documents into homogeneous topics. The increasing growth of available textual data from web, social networks and open platforms have challenged this task. It becomes important to design scalable clustering method able to effectively organize huge amount of textual data into topics. In this context, we propose a new parallel text clustering method based on Spark framework and hashing. The proposed method deals simultaneously with the issue of clustering huge amount of documents and the issue of high dimensionality of textual data by respectively integrating the divide and conquer approach and implementing a new document hashing strategy. These two facts have shown an important improvement of scalability and a good approximation of clustering quality results. Experiments performed on several large collections of documents have shown the effectiveness of the proposed method compared to existing ones in terms of running time and clustering accuracy.

查看译文

关键词

Text clustering, Parallel computing, Spark framework, Hashing, High-dimensional data

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要