Towards Accurate and Efficient Document Analytics with Large Language Models
arxiv(2024)
摘要
Unstructured data formats account for over 80
and extracting value from such formats remains a considerable challenge. In
particular, current approaches for managing unstructured documents do not
support ad-hoc analytical queries on document collections. Moreover, Large
Language Models (LLMs) directly applied to the documents themselves, or on
portions of documents through a process of Retrieval-Augmented Generation
(RAG), fail to provide high accuracy query results, and in the LLM-only case,
additionally incur high costs. Since many unstructured documents in a
collection often follow similar templates that impart a common semantic
structure, we introduce ZenDB, a document analytics system that leverages this
semantic structure, coupled with LLMs, to answer ad-hoc SQL queries on document
collections. ZenDB efficiently extracts semantic hierarchical structures from
such templatized documents, and introduces a novel query engine that leverages
these structures for accurate and cost-effective query execution. Users can
impose a schema on their documents, and query it, all via SQL. Extensive
experiments on three real-world document collections demonstrate ZenDB's
benefits, achieving up to 30
while maintaining or improving accuracy, and surpassing RAG-based baselines by
up to 61
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要