Cross-lingual Text Clustering in a Large System

Nicole R. Schneider,Jagan Sankaranarayanan,Hanan Samet

PROCEEDINGS OF 2023 7TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL, NLPIR 2023（2023）

引用 0|浏览2

暂无评分

摘要

The multilingual world needs systems that can cluster text written in multiple languages into the same thread or topic. Clustering multilingual text can be accomplished by translating and then clustering text in a canonical language, using multilingual embeddings to cluster articles in a shared embedding space, and via other language-independent methods. The performance and pitfalls of these various methods have not been well studied in the context of real-time clustering across documents written in many languages. We address this problem by generating a large dataset of news articles using a reference architecture that continuously indexed and clustered articles spanning 17 languages over the last 15 years. Through the analysis of these documents and their clusters, the clustering quality is shown to be dependent on the normalization of proper nouns, the types of georeferences, and the overall geographic focus of the document.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要