Deep Clustering for Data Cleaning and Integration

Hafiz Tayyab Rauf,Norman W. Paton,Andre Freitas

arXiv (Cornell University)(2023)

引用 0|浏览14
暂无评分
摘要
Deep Learning (DL) techniques now constitute the state-of-the-art for important problems in areas such as text and image processing, and there have been impactful results that deploy DL in several data management tasks. Deep Clustering (DC) has recently emerged as a sub-discipline of DL, in which data representations are learned in tandem with clustering, with a view to automatically identifying the features of the data that lead to improved clustering results. While DC has been used to good effect in several domains, particularly in image processing, the impact of DC on mainstream data management tasks still remains unexplored. In this paper, we address this gap, by investigating the impact of DC in canonical data cleaning and integration tasks, including schema inference, entity resolution and domain discovery, tasks which represent clustering form the perspective of tables, rows and columns, respectively. In this setting, we compare and contrast several DC and non-DC clustering algorithms using standard benchmarks. The results show, among other things, that the most effective DC algorithms consistently outperform non-DC clustering algorithms for data integration tasks. However, we also observed that the chosen embedding approaches for rows, columns, and tables significantly impacted the clustering performance.
更多
查看译文
关键词
data cleaning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要