Detection of Redacted Text in Legal Documents

Ruben van Heusden, Aron de Ruijter, Roderick Majoor,Maarten Marx

LINKING THEORY AND PRACTICE OF DIGITAL LIBRARIES, TPDL 2023(2023)

引用 0|浏览4
暂无评分
摘要
We present a technique for automatically detecting redacted text in legal documents, using a combination of Optical Character Recognition (OCR) and morphological operations from the Computer Vision domain, allowing us to detect a wide variety of different types of redaction blocks with little to no training data. As this is a segmentation task, we evaluate our technique using the Panoptic Quality methodology, with the algorithm obtaining F1 scores of 0.79, 0.86 and 0.76 on black, colored and outlined redaction blocks respectively, and an F1 score of 0.62 for gray blocks. The total running time of the algorithm is two seconds on average measured on a thousand pages from a government supplier, with roughly 98% of this time being used by Tesseract and the conversion from PDF to PNG, and 2% by the detection algorithm. Detecting text redaction at scale thus is feasible, allowing a more or less objective measurement of this practice.The redacted text detection code and the manually labelled dataset created for evaluation is released via Github.
更多
查看译文
关键词
Text Redaction,Image Segmentation,Panoptic Quality
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要