Investigating Two Approaches For Adding Feature Ranking To Sampled Ensemble Learning For Software Quality Estimation

Kehan Gao,Taghi M. Khoshgoftaar,Amri Napolitano

INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING（2015）

引用 7|浏览25

暂无评分

摘要

Defect prediction is very challenging in software development practice. Classification models are useful tools that can help for such prediction. Classification models can classify program modules into quality-based classes, e.g. fault-prone (fp) or not-fault-prone (nfp). This facilitates the allocation of limited project resources. For example, more resources are assigned to program modules that are of poor quality or likely to have a high number of faults based on the classification. However, two main problems, high dimensionality and class imbalance, affect the quality of training datasets and therefore classification models. Feature selection and data sampling are often used to overcome these problems. Feature selection is a process of choosing the most important attributes from the original dataset. Data sampling alters the dataset to change its balance level. Another technique, called boosting (building multiple models, with each model tuned to work better on instances misclassified by previous models), is found to also be effective for resolving the class imbalance problem.In this study, we investigate an approach for combining feature selection with this ensemble learning (boosting) process. We focus on two different scenarios: feature selection performed prior to the boosting process and feature selection performed inside the boosting process. Ten individual base feature ranking techniques, as well as an ensemble ranker based on the ten, are examined and compared over the two scenarios. We also employ the boosting algorithm to construct classification models without performing feature selection and use the results as the baseline for further comparison. The experimental results demonstrate that feature selection is important and needed prior to the learning process. In addition, the ensemble feature ranking method generally has better or similar performance than the average of the base ranking techniques, and more importantly, the ensemble method exhibits better robustness than most base ranking techniques. As for the two scenarios, the results show that applying feature selection inside boosting performs better than using feature selection prior to boosting.

查看译文

关键词

Software defect prediction, software metrics, feature selection, data sampling, sampled ensemble learning, boosting

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要