Improving the Performance and Interpretability on Medical Datasets using Graphical Ensemble Feature Selection

Enzo Battistella, Dina Ghiassian,Albert László Barabási

Research Square (Research Square)(2023)

引用 0|浏览2
暂无评分
摘要
Abstract A major hindrance towards using Machine Learning on medical datasets is the discrepancy between a large number of variables and small sample sizes. While multiple feature selection techniques have been proposed to avoid the resulting overfitting, overall ensemble techniques offer the best selection robustness. Yet, current methods designed to combine different algorithms generally fail to leverage the dependencies identified by their components. Here, we propose Graphical Ensembling (GE), a graph-theory-based ensemble feature selection technique designed to improve the stability and relevance of the selected features. Relying on four datasets, we show that GE increases classification performance with fewer selected features. For example, on rheumatoid arthritis patient stratification, GE outperforms the baselines by 9% Balanced Accuracy with fewer features than the baseline methods. We use data on sub-cellular networks to show that the selected features (proteins) are closer to the known disease genes, and the uncovered biological mechanisms are more diversified. By successfully tackling the complex correlations between biological variables, we anticipate that GE will improve the medical applications of machine learning.
更多
查看译文
关键词
feature selection,medical datasets,ensemble,interpretability
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要