Leveraging Collection-Wide Similarities for Unsupervised Document Structure Extraction
CoRR(2024)
摘要
Document collections of various domains, e.g., legal, medical, or financial,
often share some underlying collection-wide structure, which captures
information that can aid both human users and structure-aware models. We
propose to identify the typical structure of document within a collection,
which requires to capture recurring topics across the collection, while
abstracting over arbitrary header paraphrases, and ground each topic to
respective document locations. These requirements pose several challenges:
headers that mark recurring topics frequently differ in phrasing, certain
section headers are unique to individual documents and do not reflect the
typical structure, and the order of topics can vary between documents.
Subsequently, we develop an unsupervised graph-based method which leverages
both inter- and intra-document similarities, to extract the underlying
collection-wide structure. Our evaluations on three diverse domains in both
English and Hebrew indicate that our method extracts meaningful collection-wide
structure, and we hope that future work will leverage our method for
multi-document applications and structure-aware models.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要