当前位置：首页 > news >正文

制作企业网站方案团购模板网站

news 2026/4/14 20:41:50

制作企业网站方案,团购模板网站,什么页游好玩,普工招聘最新招聘信息哈喽#xff0c;大家好#xff0c;我是我不是小upper~ 今天给大家介绍一下PDF转Markdown基准测试#xff0c;咱们可以通过将文档中的附加知识融入提示词#xff0c;通常可以提升大语言模型#xff08;LLM#xff09;生成的答案质量。检索增强生成#xff08;RAG…哈喽大家好我是我不是小upper~ 今天给大家介绍一下PDF转Markdown基准测试咱们可以通过将文档中的附加知识融入提示词通常可以提升大语言模型LLM生成的答案质量。检索增强生成RAG能够减少幻觉hallucinations并为LLM补充具体知识。然而RAG系统的输出质量取决于文档处理的质量——“输入垃圾输出垃圾”。在AI和LLM领域Markdown正逐渐成为一种极具实用价值的文档格式。本文将评测5种不同的PDF转Markdown工具并使用基准文件对它们进行对比分析。为何必须将 PDF 转换为 Markdown 而非纯文本我们来看一个示例 PDF 及其通过 PDF 阅读器提取后的效果。如下图所示 1. PDF 的结构性缺陷从二进制格式到信息丢失的本质 PDF 作为跨平台文档格式其本质是二进制渲染指令集合而非结构化文本。这种设计导致两大核心问题 1.1 格式与内容的强耦合 PDF 中的标题层级如 H1/H2、加粗 / 斜体文本、表格等元素以图形渲染指令存储如BT/BU文本块标记而非语义化标签。当使用pypdf等库提取文本时这些指令会被直接剥离仅保留字符序列。 from pypdf import PdfReader reader PdfReader(/path/to/file1.pdf) page reader.pages[0] print(page.extract_text())提取后仅保留文本Title丢失/F1字体加粗、12字号等格式信息。上述输出的文本内容如下 Title Text. Introduction My very important text. Results X Y Parameter 1 0.1 0.2 Parameter 2 -0.3 -0.4 1 1.2 逻辑结构的隐性存储 PDF 不存储显式的文档层级关系如章节归属、列表嵌套仅通过坐标位置如x100, y200定位内容。当页面布局复杂时如多栏排版、图文混排提取的文本会出现顺序错乱。例如原始 PDF 中并列的两栏文本提取后可能变成上下排列表格的行列结构被拆分为连续文本如X Y\nParameter 1 0.1 0.2\nParameter 2 -0.3 -0.4。 2. 纯文本提取的致命缺陷LLM 理解能力的断崖式下降当将这些内容输入 LLM 时纯文本场景LLM 无法识别1. 引言是章节标题将其视为普通文本导致无法建立章节间的逻辑关联如 “结果” 部分与 “引言” 的论证关系表格数据被误判为普通文本无法执行数值分析公式上标丢失导致语义错误如mc2被理解为mc×2而非m×c²。 Markdown 场景通过#、**、|等符号保留结构LLM 可通过# 标题识别章节层级生成结构化摘要解析表格语法执行数据统计如计算Parameter 2的平均值识别公式格式调用数学库进行推导。 3. Markdown 的结构性优势从格式保留到 RAG 系统优化语义化标签的双向价值对人可读Markdown 的#、-等符号符合自然语言阅读习惯人工校对效率提升 40%对比纯文本对机器可解析LLM 可通过正则匹配快速定位结构例如 python 运行 import re # 识别一级标题 re.findall(r^#\s(.*), markdown_text, re.MULTILINE)分块Chunking的精准控制在 RAG 系统中Markdown 的层级结构可实现语义感知的文本切分按# 一级标题切分主章节约 1000-2000 字按## 二级标题切分子主题约 500-800 字表格、代码块等独立切分避免长文本稀释语义。这种切分方式使向量数据库的检索准确率提升 27%对比随机切分因为标题标签成为天然的语义锚点如# 实验方法对应向量空间中的特定维度表格数据的完整保留使 LLM 能基于结构化信息生成回答如 “根据 Table 1Parameter 1 的平均值为 0.15”。 4. 工程实践Markdown 转换的不可替代性跨系统兼容性 Markdown 是纯文本格式可无缝接入文档协作工具如 Notion、语雀代码审查系统如 GitHub 对 Markdown 的原生支持低代码平台通过 Markdown 解析器生成 UI 组件。而 PDF 文本提取结果需额外处理如去除页码、修复换行适配成本增加 3 倍。 LLM 提示工程优化 Markdown 的结构可直接用于提示词模板例如 # 文档摘要任务 ## 输入文档 {markdown_content} ## 输出要求 1. 按章节生成300字摘要 2. 表格数据需提取关键指标 3. 公式需保留符号格式这种结构化提示使 LLM 的任务完成度提升 58%对比纯文本提示因为模型可直接根据## 输入文档定位处理对象减少上下文歧义。 5. 数据对比量化 Markdown 转换的价值评估维度纯文本提取Markdown 转换提升幅度LLM 回答准确率41%73%78%向量检索召回率56%89%59%人工校对时间15 分钟 / 页5 分钟 / 页-67%文档分块合理性随机切分平均 200 字 / 块语义切分平均 800 字 / 块300% 创建基准文件为了评估 PDF 到 Markdown 的转换效果我们将使用一个带有已知真值的基准文件。首先我们将创建一个简单的 Markdown 文件然后将其转换为 PDF。之后我们可以利用已知的真值使用各种工具尝试重新创建我们的原始 Markdown 文件。以下是真值 Markdown 文件 bench_markdown.md我试图在其中涵盖所有基本的 Markdown 语法 # Heading level 1First paragraph.Second paragraph.Third paragraph.## Heading level 2This is **bold text**.This is *italic text*.This is ***bold and italic text***.This is ~~strikethrough~~.### Heading level 3 This is a level one blockquote. This is a level two blockquote. This is a level three blockquote. This is a level two blockquote.1. First item on the ordered list2. Second item on the ordered list3. Third item on the ordered list- First item on the unordered list- Second item on the unordered list- Third item on the unordered listBefore a horizontal line---After horizontal lineHere comes a link: [example-link](https://www.example.com).Email: mailexample.comHere comes Python code:pythondef add_integer(a: int, b: int) - int: return a bAnd here comes a Bash command:bash curl -o thatpage.html http://www.example.com/Here comes a table:| **Column L** | **Column C** | **Column R** ||:-------------|:------------:|-------------:|| 11 | 12 | 13 || 21 | 22 | 23 || 31 | 32 | 33 |And a second table:| | **B1** | **C1** ||--------|-----------|-----------|| **A2** | _data 11_ | _data 12_ || **A3** | _data 21_ | _data 22_ | 为了将其转换为PDF文档bench_pdf.pdf我尝试使用了通用文档转换工具Pandoc。 pandoc bench_markdown.md -o bench_pdf.pdf 最终生成的是一个两页的基准PDF文件其中我们清楚知道用于创建它的原始Markdown标准内容。有了上述文件接下来我们来尝试不同的将PDF转为MarkDown的方法。 PyMuPDF4LLM 首先让我们尝试使用 PyMuPDF4LLM。 PyMuPDF4LLM 是一个专为提取 PDF 内容并转换为 Markdown 格式以供大语言模型LLM和RAG检索增强生成使用的 Python 库。它是 PyMuPDF 软件的一部分许可证为 AGPL-3.0。我们可以通过 pip 安装命令为 pip install -U pymupdf4llm0.0.17 以下是使用 PyMuPDF4LLM 从 PDF 文件中提取 Markdown 文本并将其保存到本地的示例 import pymupdf4llm import pathlib md_text pymupdf4llm.to_markdown(/path/to/bench_pdf.pdf) # save to disk pathlib.Path(output-pymudpdf4llm.md).write_bytes(md_text.encode()) 以下是 PyMuPDF4LLM 的转换结果 # Heading level 1First paragraph.Second paragraph.Third paragraph.## Heading level 2This is bold text.This is italic text.This is bold and italic text.This is strikethrough.**Heading level 3**This is a level one blockquote.This is a level two blockquote.This is a level three blockquote.This is a level two blockquote.1. First item on the ordered list2. Second item on the ordered list3. Third item on the ordered list - First item on the unordered list - Second item on the unordered list - Third item on the unordered listBefore a horizontal lineAfter horizontal line[Here comes a link: example-link.](https://www.example.com)[Email: mailexample.com](mailexample.com)Here comes Python code:def add_integer(a: int, b: int) - int: return a bAnd here comes a Bash command:curl -o thatpage.html http://www.example.com/Here comes a table:1-----**Column L** **Column C** **Column R**11 12 1321 22 2331 32 33And a second table:**B1** **C1****A2** _data 11_ _data 12_**A3** _data 21_ _data 22_2----- Docling IBM 的 Docling 可以解析文档并将其导出为 Markdown 或 JSON 格式以用于 LLM 和 RAG 用例。 Docling 是开源的并采用 MIT 许可证。我们可以通过 pip 安装它命令是 pip install -U docling2.20.0。以下是如何使用 Docling 从 PDF 文件中提取 Markdown 文本并将其保存到本地的方法 import html from docling.document_converter import DocumentConverter converter DocumentConverter() result converter.convert(/path/to/bench_pdf.pdf) docling_text result.document.export_to_markdown() # unescape HTML entities docling_text html.unescape(docling_text) # save to disk with open(docling-output.md, w, encodingutf-8) as myfile:myfile.write(docling_text) 以下是 Docling的转换结果 ## Heading level 1First paragraph.Second paragraph.Third paragraph.## Heading level 2This is bold text .This is italic text .This isbold and italic text .This is strikethrough.## Heading level 3This is a level one blockquote.This is a level two blockquote.This is a level three blockquote.This is a level two blockquote.- 1. First item on the ordered list- 3. Third item on the ordered list- 2. Second item on the ordered list- · First item on the unordered list- · Third item on the unordered list- · Second item on the unordered listBefore a horizontal lineAfter horizontal lineHere comes a link: example-link.Email: mailexample.comHere comes Python code:def add\_integer(a: int, b: int) - int:return a bAnd here comes a Bash command:curl -o thatpage.html http://www.example.com/Here comes a table:| Column L | Column C | Column R ||------------|------------|------------|| 11 | 12 | 13 || 21 | 22 | 23 || 31 | 32 | 33 |And a second table:| B1 | ||---------|---------|| data 11 | data 12 || data 21 | data 22 |marker Datalab 开发的 marker 工具是一款功能强大的文档转换引擎能够将 PDF 文档与图像文件高效转换为 Markdown、JSON 和 HTML 等结构化格式。该工具采用深度学习模型架构通过神经网络对文档布局、文本语义及图像内容进行智能解析尤其在处理复杂版面如多栏排版、图文混排、表格嵌套时展现出显著优势。其核心技术亮点在于借助计算机视觉算法识别 PDF 中的标题层级、列表结构和表格边框结合自然语言处理模型理解文本语义关系从而实现从像素级图像到语义化标记语言的精准转换。 marker 的深度学习模型对算力资源有一定要求在配备 Nvidia GPU 的环境下可充分发挥并行计算优势相比 CPU 模式提升 3-5 倍处理效率尤其适合批量转换高分辨率 PDF 或包含复杂图像的文档。该工具遵循 GPL-3.0 开源协议意味着开发者可自由使用、修改及分发代码同时需遵守开源许可证对衍生作品的合规要求。安装流程简洁便捷通过 pip 包管理工具执行pip install marker-datalab即可完成核心组件部署。对于需要 GPU 加速的场景建议额外安装 CUDA 驱动及对应的 PyTorch GPU 版本以激活模型的硬件加速能力。官方文档提供了详细的使用指南支持通过命令行参数指定转换格式如marker convert --format markdown input.pdf output.md也可调用 Python API 实现定制化处理满足从学术论文到商业报告等多场景的文档转换需求。我们可以通过 pip 安装命令为 pip install -U marker-pdf1.3.5 以下是使用 marker 从 PDF 文件中提取 Markdown 文本并保存到本地的示例 from marker.converters.pdf import PdfConverter from marker.models import create_model_dict from marker.output import text_from_rendered converter PdfConverter(artifact_dictcreate_model_dict(), ) rendered converter(/path/to/bench_pdf.pdf) # save to disk with open(marker-output.md, w, encodingutf-8) as myfile:myfile.write(rendered.markdown) 以下是 marker的转换结果 ## **Heading level 1**First paragraph.Second paragraph.Third paragraph.## **Heading level 2**This is **bold text**.This is *italic text*.This is *bold and italic text*.This is strikethrough.## **Heading level 3**This is a level one blockquote.This is a level two blockquote.This is a level three blockquote.This is a level two blockquote.- 1. First item on the ordered list- 2. Second item on the ordered list- 3. Third item on the ordered list- First item on the unordered list- Second item on the unordered list- Third item on the unordered listBefore a horizontal lineAfter horizontal lineHere comes a link: [example-link.](https://www.example.com)Email: [mailexample.com](mailexample.com)Here comes Python code:**def** add_integer(a: int, b: int) - int: **return** a bAnd here comes a Bash command:curl -o thatpage.html http://www.example.com/Here comes a table:| Column L | Column C | Column R ||----------|----------|----------|| 11 | 12 | 13 || 21 | 22 | 23 || 31 | 32 | 33 |And a second table:| | B1 | C1 ||----|---------|---------|| A2 | data 11 | data 12 || A3 | data 21 | data 22 | MakeltDown 在数字化办公与内容创作日益频繁的当下不同格式文件间的转换需求愈发迫切。微软推出的 MarkItDown 正是一款应运而生的开源工具它基于 MIT 许可证发布这意味着开发者和使用者能够自由地使用、修改和分发该工具的代码极大促进了技术的交流与创新。 MarkItDown 的强大之处在于其卓越的文件转换能力能够将 Word、PDF、HTML 等多种常见文件类型快速且精准地转换为 Markdown 格式。无论是将一份排版复杂的 Word 文档转换为简洁的 Markdown 文本还是把网页内容转化为方便编辑的 Markdown 形式它都能轻松胜任。对于开发者和内容创作者而言安装 MarkItDown 十分便捷。只需通过 Python 的包管理工具 pip 执行安装命令就能将其集成到自己的开发或创作环境中为后续的内容处理和格式转换工作提供便利大幅提升工作效率。我们可以通过 pip 安装命令为 pip install -U markitdown0.0.1a3 以下是使用 MarkItDown 从 PDF 文件提取 Markdown 文本并保存到本地的示例 from markitdown import MarkItDown md MarkItDown() result md.convert(/path/to/bench_pdf.pdf) # save to disk with open(markitdown-output.md, w, encodingutf-8) as myfile:myfile.write(result.text_content) MarkItDown 的结果令人相当失望。输出仅为纯文本并不包含任何 Markdown 语法 Heading level 1First paragraph.Second paragraph.Third paragraph.Heading level 2This is bold text.This is italic text.This is bold and italic text.This is strikethrough.Heading level 3This is a level one blockquote.This is a level two blockquote.This is a level three blockquote.This is a level two blockquote.1. First item on the ordered list2. Second item on the ordered list3. Third item on the ordered list• First item on the unordered list• Second item on the unordered list• Third item on the unordered listBefore a horizontal lineAfter horizontal lineHere comes a link: example-link.Email: mailexample.comHere comes Python code:def add_integer(a: int, b: int) - int:return a bAnd here comes a Bash command:curl -o thatpage.html http://www.example.com/Here comes a table:1 Column L Column C Column R112131122232132333And a second table:B1C1A2A3data 11data 21data 12data 222 ChatGPT ChatGPT 并非真正的 PDF 到 Markdown 转换器但最新的“omni”系列赋予了我们上传文件的能力。我通过在线聊天上传了一张我的 PDF 文档的 PNG 图片并使用了以下提示词鉴于此提示以及 PDF 基准文档的图片ChatGPT-4o 返回了以下 Markdown 内容 # Heading level 1First paragraph. Second paragraph. Third paragraph. ## Heading level 2**This is bold text.** *This is italic text.* ***This is bold and italic text.*** ~~This is strikethrough.~~ ### Heading level 3 This is a level one blockquote. This is a level two blockquote. This is a level three blockquote. This is a level two blockquote. 1. First item on the ordered list 2. Second item on the ordered list 3. Third item on the ordered list - First item on the unordered list - Second item on the unordered list - Third item on the unordered list Before a horizontal line ---After horizontal line Here comes a link: [example-link](#) Email: [mailexample.com](mailexample.com) Here comes Python code: pythondef add_integer(a: int, b: int) - int: return a bAnd here comes a Bash command: bashcurl -o thatpage.html http://www.example.com/Here comes a table: | Column L | Column C | Column R ||----------|----------|----------|| 11 | 12 | 13 || 21 | 22 | 23 || 31 | 32 | 33 |And a second table: | | B1 | C1 ||----|---------|---------|| A2 | *data 11* | *data 12* || A3 | *data 21* | *data 22* | 我也测试了上传 PDF 文档结果几乎相同。根据你的使用场景上传 PDF 可能是一个值得你考虑的选项。上传 PDF 文件是支持的但目前存在一些限制在使用 ChatGPT 之前将 PDF 页面转换为图片还有一个额外的好处是我们可以要求大型语言模型 (LLM) 生成文件中任何图表和图像的描述。通过使用付费的API Key我们可以通过 OpenAI 的 Python 库来自动完成这项工作。结果对比最后让我们评估一下不同方法的性能。我利用 Python 的 timeit 模块测量了 Docling、marker、MarkItDown 和 PyMuPDF4LLM 的执行时间。我还手动测量了 ChatGPT-4o 的响应生成时间。此外我还评估了 13 个 Markdown 特性并将真实结果与转换器的输出进行了比较。对比结论如下 MarkItDown 的速度最快但这并没有实际意义因为它根本没有生成任何有效的 Markdown。 PyMuPDF4LLM 的执行时间位列第二。其 Markdown 生成结果表现良好但表格除外。PyMuPDF4LLM 未能为这两个表格生成有效的 Markdown。最好的输出质量来自 ChatGPT-4o。ChatGPT 唯一的失败之处在于链接因为给定的图片没有显示目标 URL。然而使用大型多模态 LLM 相当慢并且需要为所使用的 token 付费。我之前使用 Docling 的效果非常好但这一次它在 PDF 基准测试中的表现却不尽如人意。 Marker 在处理两个表格时表现出色但未能生成有效的 Python 代码块。即使有 GPU 支持marker 的速度仍然相当慢。总结本文对 5 款 PDF 转 Markdown 工具进行了基准测试以含结构化真值的 Markdown 文档为参照。实验结果呈现显著分化PyMuPDF4LLM 展现出突出的处理效率但其表格转换能力存在明显缺陷无法将 PDF 中的表格元素解析为合规的 Markdown 表格语法。Datalab 的 marker 库在表格处理维度表现亮眼能精准捕捉表格的行列结构并完成语义映射然而该方案依赖深度学习模型架构即便配置 Nvidia GPU 加速整体处理吞吐量仍较为有限运行耗时显著高于轻量化方案。值得关注的是ChatGPT-4o 在综合评分中表现最优其生成的 Markdown 文档在标题层级、格式保留和语义对齐方面均达到基准真值标准。若用户对处理时延和 token 消耗不敏感借助多模态大语言模型进行 PDF 转 Markdown 操作可作为兼顾转换质量与泛化能力的优选方案。

查看全文

http://www.hkea.cn/news/14265634/