Structured Evaluation of Synthetic Tabular Data
CoRR(2024)
摘要
Tabular data is common yet typically incomplete, small in volume, and
access-restricted due to privacy concerns. Synthetic data generation offers
potential solutions. Many metrics exist for evaluating the quality of synthetic
tabular data; however, we lack an objective, coherent interpretation of the
many metrics. To address this issue, we propose an evaluation framework with a
single, mathematical objective that posits that the synthetic data should be
drawn from the same distribution as the observed data. Through various
structural decomposition of the objective, this framework allows us to reason
for the first time the completeness of any set of metrics, as well as unifies
existing metrics, including those that stem from fidelity considerations,
downstream application, and model-based approaches. Moreover, the framework
motivates model-free baselines and a new spectrum of metrics. We evaluate
structurally informed synthesizers and synthesizers powered by deep learning.
Using our structured framework, we show that synthetic data generators that
explicitly represent tabular structure outperform other methods, especially on
smaller datasets.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要