当前位置：首页 > 科技 > 软件

从PDF和图像中提取文本，以供大型语言模型使用

来源：责编：时间：2023-11-30 09:29:09 383观看

导读想法大型语言模型已经席卷了互联网，导致更多的人没有认真关注使用这些模型最重要的部分：高质量的数据！本文旨在提供一些有效从任何类型文档中提取文本的技术。Python库本文专注于Pytesseract、easyOCR、PyPDF2和LangChai

想法

大型语言模型已经席卷了互联网，导致更多的人没有认真关注使用这些模型最重要的部分：高质量的数据！本文旨在提供一些有效从任何类型文档中提取文本的技术。

Python库

本文专注于Pytesseract、easyOCR、PyPDF2和LangChain库。实验数据是一个单页PDF文件，可在以下链接获取：

https://github.com/keitazoumana/Experimentation-Data/blob/main/Experimentation_file.pdf

由于Pytesseract和easyOCR可以处理图像，因此在执行内容提取之前需要将PDF文件转换为图像。可以使用pypdfium2进行转换，这是一个用于处理PDF文件的强大库，其实现如下：

pip install pypdfium2

以下函数以PDF作为输入，并将PDF的每一页作为图像列表返回。

def convert_pdf_to_images(file_path, scale=300/72):      pdf_file = pdfium.PdfDocument(file_path)      page_indices = [i for i in range(len(pdf_file))]      renderer = pdf_file.render(       pdfium.PdfBitmap.to_pil,       page_indices = page_indices,        scale = scale,   )      final_images = []       for i, image in zip(page_indices, renderer):              image_byte_array = BytesIO()       image.save(image_byte_array, format='jpeg', optimize=True)       image_byte_array = image_byte_array.getvalue()       final_images.append(dict({i:image_byte_array}))      return final_images

现在，我们可以使用display_images函数来可视化PDF文件的所有页面。

def display_images(list_dict_final_images):      all_images = [list(data.values())[0] for data in list_dict_final_images]      for index, image_bytes in enumerate(all_images):              image = Image.open(BytesIO(image_bytes))       figure = plt.figure(figsize = (image.width / 100, image.height / 100))              plt.title(f"----- Page Number {index+1} -----")       plt.imshow(image)       plt.axis("off")       plt.show()

通过组合上述两个函数，我们可以得到以下结果：

convert_pdf_to_images = convert_pdf_to_images('Experimentation_file.pdf')display_images(convert_pdf_to_images)

图片PDF以图像格式可视化

深入文本提取过程

1.Pytesseract

Pytesseract（Python-tesseract）是用于从图像中提取文本信息的Python OCR工具，可以使用以下pip命令进行安装：

pip install pytesseract

以下的辅助函数使用了 Pytesseract 的 image_to_string() 函数从输入图像中提取文本。

from pytesseract import image_to_stringdef extract_text_with_pytesseract(list_dict_final_images):      image_list = [list(data.values())[0] for data in list_dict_final_images]   image_content = []      for index, image_bytes in enumerate(image_list):              image = Image.open(BytesIO(image_bytes))       raw_text = str(image_to_string(image))       image_content.append(raw_text)      return "/n".join(image_content)

可以使用 extract_text_with_pytesseract 函数提取文本，如下所示：

text_with_pytesseract = extract_text_with_pytesseract(convert_pdf_to_images)print(text_with_pytesseract)

成功执行以上代码将生成以下结果：

This document provides a quick summary of some of Zoumana’s article on Medium.It can be considered as the compilation of his 80+ articles about Data Science, Machine Learning andMachine Learning Operations....Pytesseract was able to extract the content of the image.Here is how it managed to do it!Pytesseract starts by identifying rectangular shapes within the input image from top-right to bottom-right. Then it extracts the content of the individual images, and the final result is the concatenation of those extracted content. This approach works perfectly when dealing with column-based PDFs and image documents....

Pytesseract 首先通过从图像的右上角到右下角识别矩形形状。然后它提取各个图像的内容，最终的结果是这些提取内容的串联。这种方法在处理基于列的 PDF 和图像文档时效果非常好。

2.easyOCR

easyOCR 也是一个用于光学字符识别的开源 Python 库，目前支持提取 80 多种语言的文本。easyOCR需要安装Pytorch 和 OpenCV，可以使用以下指令安装：

!pip install opencv-python-headless==4.1.2.30

根据您的操作系统，安装 Pytorch 模块的方法可能不同。但所有的说明都可以在官方页面上找到。现在我们来安装 easyOCR 库：

!pip install easyocr

在使用 easyOCR 时，因为它支持多语言，所以在处理文档时需要指定语言。通过其 Reader 模块设置语言，指定语言列表。例如，fr 用于法语，en 用于英语。语言的详细列表在此处可用。

from easyocr import Reader# Load model for the English languagelanguage_reader = Reader(["en"])

文本提取过程在extract_text_with_easyocr 函数中实现：

def extract_text_with_easyocr(list_dict_final_images):      image_list = [list(data.values())[0] for data in list_dict_final_images]   image_content = []      for index, image_bytes in enumerate(image_list):              image = Image.open(BytesIO(image_bytes))       raw_text = language_reader.readtext(image)       raw_text = " ".join([res[1] for res in raw_text])                             image_content.append(raw_text)      return "/n".join(image_content)

我们可以如下执行上述函数：

text_with_easy_ocr = extract_text_with_easyocr(convert_pdf_to_images)print(text_with_easy_ocr)

easyOCR 的结果

与 Pytesseract 相比，easyOCR 的效果似乎不太高效。例如，它能够有效地读取前两个段落。然而，它不是将每个文本块视为独立的文本，而是使用基于行的方法进行读取。例如，第一个文本块中的字符串“Data Science section covers basic to advanced”已与第二个文本块中的“overfitting when training computer vision”组合在一起，这种组合完全破坏了文本的结构并使最终结果产生偏差。

3.PyPDF2

PyPDF2 也是一个专门用于 PDF 处理任务的 Python 库，例如文本和元数据的检索、合并、裁剪等。

!pip install PyPDF2

提取逻辑实现在 extract_text_with_pyPDF 函数中：

def extract_text_with_pyPDF(PDF_File):    pdf_reader = PdfReader(PDF_File)        raw_text = ''    for i, page in enumerate(pdf_reader.pages):                text = page.extract_text()        if text:            raw_text += text    return raw_texttext_with_pyPDF = extract_text_with_pyPDF("Experimentation_file.pdf")print(text_with_pyPDF)

使用 PyPDF 库进行文本提取

提取过程快速而准确，甚至保留了原始字体大小。PyPDF 的主要问题是它不能有效地从图像中提取文本。

4.LangChain

LangChain 的 UnstructuredImageLoader 和 UnstructuredFileLoader 模块可分别用于从图像和文本/PDF 文件中提取文本，并且在本节中将探讨这两个选项。

首先，我们需要按照以下方式安装 langchain 库：

!pip install langchain

(1) 从图像中提取文本

from langchain.document_loaders.image import UnstructuredImageLoader

以下是提取文本的函数：

def extract_text_with_langchain_image(list_dict_final_images):   image_list = [list(data.values())[0] for data in list_dict_final_images]   image_content = []      for index, image_bytes in enumerate(image_list):              image = Image.open(BytesIO(image_bytes))       loader = UnstructuredImageLoader(image)       data = loader.load()       raw_text = data[index].page_content                             image_content.append(raw_text)      return "/n".join(image_content)

现在，我们可以提取内容：

text_with_langchain_image = extract_text_with_langchain_image(convert_pdf_to_images)print(text_with_langchain_image)

来自 langchain UnstructuredImageLoader 的文本提取。

该库成功高效地提取了图像的内容。

(2) 从 PDF 中提取文本

以下是从 PDF 中提取内容的实现：

from langchain.document_loaders import UnstructuredFileLoaderdef extract_text_with_langchain_pdf(pdf_file):      loader = UnstructuredFileLoader(pdf_file)   documents = loader.load()   pdf_pages_content = '/n'.join(doc.page_content for doc in documents)      return pdf_pages_contenttext_with_langchain_files = extract_text_with_langchain_pdf("Experimentation_file.pdf")print(text_with_langchain_files)

类似于 PyPDF 模块，langchain 模块能够生成准确的结果，同时保持原始字体大小。

从 langchain 的 UnstructuredFileLoader 中提取文本。

本文链接：http://www.28at.com/showinfo-26-35306-0.html从PDF和图像中提取文本，以供大型语言模型使用

声明：本网页内容旨在传播知识，若有侵权等问题请及时与本网联系，我们将在第一时间删除处理。邮件：2376512515@qq.com

上一篇：全网最细：Jest+Enzyme测试React组件（包含交互、DOM、样式测试）

下一篇：聊聊Clickhouse分布式表的操作

标签：

热门焦点

鸿蒙OS 4.0公测机型公布：甚至连nova6都支持

华为全新的HarmonyOS 4.0操作系统将于今天下午正式登场，官方在发布会之前也已经正式给出了可升级的机型产品，这意味着这些机型会率先支持升级享用。这次的HarmonyOS 4.0支持
官方承诺：K60至尊版将会首批升级MIUI 15

全新的MIUI 15今天也有了消息，在官宣了K60至尊版将会搭载天玑9200+处理器和独显芯片X7的同时，Redmi给出了官方承诺，K60至尊重大更新首批升级，会首批推送MIUI 15。也就是说虽然
6月iOS设备好评榜：第一蝉联榜首近一年

作为安兔兔各种榜单里变化最小的那个，2023年6月的iOS好评榜和上个月相比没有任何排名上的变化，仅仅是部分设备好评率的下降，长年累月的用户评价和逐渐退出市场的老款机器让这
不容错过的MSBuild技巧，必备用法详解和实践指南

一、MSBuild简介MSBuild是一种基于XML的构建引擎，用于在.NET Framework和.NET Core应用程序中自动化构建过程。它是Visual Studio的构建引擎，可在命令行或其他构建工具中使用
慕岩炮轰抖音，百合网今何在？

来源：价值研究所作者：Hernanderz“难道就因为自己的一个产品牛逼了，从客服到总裁，都不愿意正视自己产品和运营上的问题，选择逃避了吗？”这一番话，出自百合网联合创
拼多多APP上线本地生活入口，群雄逐鹿万亿市场

Tech星球（微信ID：tech618）文 | 陈桥辉 Tech星球独家获悉，拼多多在其APP内上线了“本地生活”入口，位置较深，位于首页的“充值中心”内，目前主要售卖美食相关的
自律，给不了Keep自由！

来源 | 互联网品牌官作者 | 李大为编排 | 又耳审核 | 谷晓辉自律能不能给用户自由暂时不好说，但大概率不能给Keep自由。近日，全球最大的在线健身平台Keep正式登陆港交所，努力
华为Mate 60保护壳曝光：硕大后置相机模组凸起程度有惊喜

这段时间以来，关于华为新旗舰的爆料日渐密集。据此前多方爆料，今年华为将开始恢复一年双旗舰战略，除上半年推出的P60系列外，往年下半年的Mate系列也将
iQOO 11S或7月上市：搭载“鸡血版”骁龙8Gen2 史上最强5G Soc

去年底，iQOO推出了“电竞旗舰”iQOO 11系列，作为一款性能强机，iQOO 11不仅全球首发2K 144Hz E6全感屏，搭载了第二代骁龙8平台及144Hz电竞屏，同时在快充

从PDF和图像中提取文本，以供大型语言模型使用

想法

Python库

深入文本提取过程

1.Pytesseract

2.easyOCR

3.PyPDF2

4.LangChain

鸿蒙OS 4.0公测机型公布：甚至连nova6都支持

官方承诺：K60至尊版将会首批升级MIUI 15

6月iOS设备好评榜：第一蝉联榜首近一年

不容错过的MSBuild技巧，必备用法详解和实践指南

慕岩炮轰抖音，百合网今何在？

拼多多APP上线本地生活入口，群雄逐鹿万亿市场

自律，给不了Keep自由！

华为Mate 60保护壳曝光：硕大后置相机模组凸起程度有惊喜

iQOO 11S或7月上市：搭载“鸡血版”骁龙8Gen2 史上最强5G Soc

最新推荐

猜你喜欢

热门推荐

相关资讯