Distributed Evolutionary Feature Selection for Big Data Processing

VIETNAM JOURNAL OF COMPUTER SCIENCE(2022)

引用 0|浏览1
暂无评分
摘要
Feature selection has become a powerful dimensional reduction strategy and an effective tool in handling high-dimensional data. Feature selection aims to reduce the dimension of the feature space, to speed up and reduce the cost of the learning model and that by selecting the most relevant feature subset to data mining and machine learning tasks. The selection of optimal feature subset is an optimization problem that proved to be NP-hard. Metaheuristics are traditionally used to deal with NP-hard problems since they are well known for solving complex and real-world problems in reasonable period of time. Genetic algorithm (GA) is one of the most popular metaheuristics algorithms, which proved to be effective for an accurate feature selection task. However, in the last few decades, data have become progressively larger in both numbers of instances and features. This paradigm is being popularly termed as Big Data. With the tremendous growth of dataset sizes, most current feature selection algorithms and exceptionally GA become unscalable. To improve the scalability of a feature selection algorithm on big data, the distributed computing strategy is always adopted such as MapReduce model and Hadoop system. In this paper, we first present a review for the most recent works which handle the use of Parallel Genetic algorithm in large datasets. Then, we will propose a new Parallel Genetic algorithm based on the Coarse-grained parallelization model (island model). The parallelization of the process and the distribution of the partitioning of data will be performed using Hadoop system with an Amazon cluster. The performance and the scalability of the proposed method were theoretically and empirically compared to the existing feature selection methods when handling large-scale datasets and results confirm the effectiveness of our proposed method.
更多
查看译文
关键词
Genetic algorithm, feature selection, island model, parallel processing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要