A Study on Efficient Indexing for Table Search in Data Lakes.

Ibraheem Taha, Matteo Lissandrini,Alkis Simitsis,Yannis E. Ioannidis

IEEE International Conference on Semantic Computing（2024）

引用 0|浏览0

暂无评分

摘要

Data lakes store diverse and large volumes of datasets. One of the core challenges in data lakes is dataset discovery, which involves tasks such as finding related tables, domain discovery, and column clustering. In this paper, we focus on a popular approach for finding related tables in public or private data lakes, namely table search. Given the heterogeneity of the tables in a data lake, recent methods adopt table-representation learning and produce dense vector representations for every row, column, or even cell value. This enables advanced indexing techniques, such as HSNW, LSH, and DiskANN, which implement efficient data-structures to speed-up the core operation of approximate k-NN search in such vector spaces. However, while many indexing techniques have been employed so far, their practical value and effectiveness governed by the tradeoff of accuracy vs. performance have not been explored yet. In this paper, we aim at shedding light on this gap. We start with an overview of state-of-the-art techniques for table search in data lakes that are based on vector-search operations. Then, we present an in-depth analysis of the performances of the k-ANN indexes and techniques they adopt. This allows us to map for the first time the space of alternative implementations for these techniques when applied to data lakes, revealing strengths and weaknesses of each option, and further delineating exciting novel research directions.

查看译文

关键词

Data Exploration,Data Discovery,Data Lakes

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要