Topic Modeling on News Articles using Latent Dirichlet Allocation

Mykyta Kretinin,Giang Nguyen

2022 IEEE 26th International Conference on Intelligent Engineering Systems (INES)(2022)

引用 0|浏览15
Topic modeling is widely used to obtain the most vis-ible topics from a given text corpus. In this work, a demonstration of the most discussed topic modeling is presented from articles on the Reuters news website. These articles are collected and consequently processed with a Latent Dirichlet Allocation (LDA) unsupervised learning algorithm. The main goal is to build the best model(s) that accurately produces the most discussed topics. Such a model(s) can be used in real life to instantly get information about actual news to classify documents in a given dataset and extract dominated topics with their keywords. This helps to build, for example, correlations with user preferences and recommend interesting content. There are works which use different models to evaluate texts and obtain statistics about them, such as the most popular people's opinions about some question or to obtain popular and dominating subtopics of the specific topic dataset (e.g., medicine articles). As a result of the work, we were able to create a generic LDA model, trained on Wikipedia articles. The model successfully analyzes Reuters articles and extracted their topics as keyword sets. Then, they can be used to recommend content that is interesting to the target user, for example, based on the recommended content tags.
Topic Modeling,Latent Dirichlet Allocation,Reuters Articles,Wikipedia,Ukraine,War,Covid,NLP
AI 理解论文
Chat Paper