Semantic Nonnegative Matrix Factorization with Automatic Model Determination for Topic Modeling

2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA)(2020)

引用 8|浏览8
暂无评分
摘要
Non-negative Matrix Factorization (NMF) models the topics of a text corpus by decomposing the matrix of term frequency-inverse document frequency (TF-IDF) representation, X, into two low-rank non-negative matrices: W , representing the topics and H, mapping the documents onto space of topics. One challenge, common to all topic models, is the determination of the number of latent topics (aka model determination). Determining the correct number of topics is important: underestimating the number of topics results in a poor topic separation, under-fitting, while overestimating leads to noisy topics, over-fitting. Here, we introduce SeNMFk, a semantic-assisted NMF-based topic modeling method, which incorporates semantic correlations in NMF by using a word-context matrix, and employs a method for determination of the number of latent topics. SeNMFk first creates a random ensemble of matrices based on the initial TF-IDF matrix and a word-context matrix, and then applies a coupled factorization to acquire sets of stable coherent topics that are robust to noise. The latent dimension is determined based on the stability of these topics. We show that SeNMFk accurately determines the number of high-quality topics in benchmark text corpora, which leads to an accurate document clustering.
更多
查看译文
关键词
NMF,Topic models,Document clustering,NLP
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要