仓库源文，站点原文

文档结构化是很暧昧的词, 它可能的意思很多, 不过本文只考虑目录抽取.

结构化文档由各级章节标题和段落等逻辑结构组成, 比如对 HTML 来说, 逻辑结构包括 <body> <h1> <p> 等标签. 文档结构化任务基本等价于目录抽取, 因为识别出标题后剩下的就是段落. 这个领域可供搜索的关键词包括 document structure recognition, document layout analysis (版面分析) 等. 意义: 便于抽取信息, 高度定制化的展示等.

<body>

<h1>Structured document</h1>
<p>A <strong class="selflink">structured document</strong> 
is an <a href="/wiki/Electronic_document" title="Electronic document">electronic document</a> 
where some method of 
<a href="/wiki/Markup_language" title="Markup language">markup</a> 
is used to identify the whole and parts of the document 
as having various meanings beyond their formatting.</p>

</body>

PDF 解析

Java 的 Apache PDFBox 可以得到 pdf 底层信息, 然后进行推断重建文档结构. 可以得到的信息包括

每个字的盒子 (bounding box): 左上角 (不确定, 总之是某个角) 横纵坐标, 盒子高度和宽度, 颜色, 字体, 方向 (页面是否旋转) 等.
线条信息, 从而可以连接组合成有线表.

纯 PDF 解析版面分析算法的例子可以参考这里, 连字成行, 连行成段.

开源工具 pdf2htmlEX 可以高质量地还原 pdf 排版, 但是很慢而且几乎无法从其生成的 html 中读出结构信息, 可定制性低, 难以后续处理, 文件大.

起一个坏名字. (2022). PDF 系列科普视频.
北京庖丁科技. (2020, Mar 24). 为什么说从 PDF 中提取文本是一件困难的事?.
北京庖丁科技. (2020, Dec 7). 电子文档全景结构识别漫谈.

Python 包

Pdfminer.six 不太能用.
pdfplumber: Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging. Works best on machine-generated, rather than scanned, PDFs. 他实现表格抽取的逻辑可以参考冰焰虫子的博客.

难点

缺乏标注数据.
排版依赖于 PDF 解析, 后者本身就是众所周知的难题.

工业界实践

好像没有直接的识别标题层次.

BMES 序列标注@达观数据: 只是段落识别, 并没有提到怎么识别标题以及标题层次
- 把段落解析任务类比为分词任务, 把每一行看成 "字", 每个段落看成 "词".
- 分词要结合每个字符上下文的语义信息做概率判断, 但是在分段的时候, 每一行的语义信息太过丰富, 在实践上不好用. 不过分段有自己的特征, 比如缩进, 结尾标点, 独立成行的标题可能会有数字开头, 这些都可以作为特征输入到模型里去.
- 参考达观数据高级技术专家分享: 如何在金融行业应用文档结构化? (让听众思考怎么解决 badcase, 但是没有说他们的方案)

CV 版面分析@百度文库: 百度文库为了多平台展示, 要把 pdf 的版式排版解析为流式排版.
- 最开始用的也是通用方法, 连字成行再成段; 对于复杂排版 (多栏, 图文排绕等) 需要 CV 分析版面, 把页面切分成多个区域, 每个区域再用通用方案.
- 百度文库里 word 文档很多, 专门方案: 先把 doc (二进制) 格式转化为 docx (ooxml) 格式, 再对后者解析.
- 参考文档内容结构化在百度文库的技术探索
预训练模型 LayoutLM@Amazon
- Although traditional text models (like RNNs, LSTMs, or even n-gram based methods) account for the order of words in the input, many are quite rigid in their approach of treating text as a linear sequence of words. In practice, documents aren't simple strings of words, but rich canvases with features like headings, paragraphs, columns, and tables. Because the input position encoding of models like BERT can incorporate this position information, we can boost performance by training models that learn not just from the content of the text, but the size and placement too.
- 用到了微软亚研出品的 LayoutLM, 简单地说就是把 BERT 的 positional encoding 替换为了基于每个字 bounding box 四个坐标的编码, 除此之外还有 image embedding.
- 参考 Bring structure to diverse documents with Amazon Textract and transformer-based models on Amazon SageMaker
- LayoutLM 已经发展了很多代版本
  - 微软亚洲研究院提出多语言通用文档理解预训练模型 LayoutXLM (包括中文)
  - Github. 模型巨大 (1.5G), 目前没有看到国内实践经验.
  - 百度开源 ERNIE-Layout (2022/10/19)
阿里达摩院的展示. 给的实例都太简单了, 案例五效果明显有问题.
同花顺 iFinD 的展示, 对于数字开头的标题, 截图中肉眼可见的有错漏.

其他

文档结构化任务数据集介绍
微信图片翻译技术优化之路
Déjean, H., & Meunier, J.-L. (2010). Reflections on the inex structure extraction competition. Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, 301–308.
PaddleOCR新发版v2.2：开源版面分析与轻量化表格识别
CCKS测评任务5 基于 OpenCV 和 Faster R-CNN 的金融财报抽取
比如 Google 的 Document AI 重点还是解析表单, 证件等.

仓库源文，站点原文

PDF 解析

Python 包

难点

相关比赛

工业界实践

其他