其他
系列教程 | 用Jina搭建PDF搜索引擎Part 2
上一篇文章我们研究了如何将PDF文件分解为可用的块,以便构建一个搜索引擎。
对句子进行编码
我们可以搜索图像到图像,文本到图像,图像到文本或文本到文本等。
1
Jina Hub针对只搜索一种类型模态的型号:
TransformerTorchEncoder,TransformerSentenceEncoder,SpacyTextEncoder
图片来源:ImageTorchEncoder
添加编码器执行器
1
以前的流程
from docarray import DocumentArray
from executors import ChunkSentencizer, ChunkMerger, ImageNormalizer
from jina import Flow
docs = DocumentArray.from_files("data/*.pdf", recursive=True)
flow = (
Flow()
.add(uses="jinahub+sandbox://PDFSegmenter", name="segmenter")
.add(uses=ChunkSentencizer, name="chunk_sentencizer")
.add(uses=ChunkMerger, name="chunk_merger")
.add(uses=ImageNormalizer, name="image_normalizer")
with flow:
indexed_docs = flow.index(docs)
2
现在的流程
from docarray import DocumentArray
from executors import ChunkSentencizer, ChunkMerger, ImageNormalizer
from jina import Flow
docs = DocumentArray.from_files("data/*.pdf", recursive=True)
flow = (
Flow()
.add(uses="jinahub+sandbox://PDFSegmenter", name="segmenter")
.add(uses=ChunkSentencizer, name="chunk_sentencizer")
.add(uses=ChunkMerger, name="chunk_merger")
.add(uses=ImageNormalizer, name="image_normalizer")
.add(uses="jinahub+sandbox://CLIPEncoder", name="encoder")
)
with flow:
indexed_docs = flow.index(docs)
失败情况分析
1
失败情况概览
encoder/rep-0@113300[E]:UnidentifiedImageError(‘cannot identify image file <_io.BytesIO object at 0x7fdc143e9810>’)2、summaries 里没有embedding的内容。
2
失败原因
3
解决方案
flow = (
Flow()
.add(uses="jinahub+sandbox://PDFSegmenter", name="segmenter")
.add(uses=ChunkSentencizer, name="chunk_sentencizer")
.add(uses=ChunkMerger, name="chunk_merger")
.add(uses=ImageNormalizer, name="image_normalizer")
.add(
uses="jinahub+sandbox://CLIPEncoder",
name="encoder",
uses_with={"traversal_paths": "@c"},
)
)
此时,输入一个句子,可以获得匹配的段落。
with flow:
indexed_docs = flow.index(docs, show_progress=True)
# See summary of indexed Documents
indexed_docs.summary()
# See summary of all the chunks of indexed Documents
indexed_docs[0].chunks.summary()
将块存储在索引中
在实际应用中,还可以使用 Weaviate 或 Qdrant 后端,或来自Jina Hub的其他效率更高的索引器(如 AnnLiteIndexer )。
需要注意的一点是:我们需要牢记块级别,因此将添加一个参数:traversal_right
flow = (
Flow()
.add(uses="jinahub+sandbox://PDFSegmenter", name="segmenter")
.add(uses=ChunkSentencizer, name="chunk_sentencizer")
.add(uses=ChunkMerger, name="chunk_merger")
.add(uses=ImageNormalizer, name="image_normalizer")
.add(
uses="jinahub+sandbox://CLIPEncoder",
name="encoder",
uses_with={"traversal_paths": "@c"},
)
.add(
uses="jinahub://SimpleIndexer",
install_requirements=True,
name="indexer",
uses_with={"traversal_right": "@c"},
)
)
1
运行情况
workspace
└── SimpleIndexer
└── 0
└── index.db
2 directories, 1 file
神经搜索、深度学习、推荐系统
教程、Demo、干货分享
扫码备注加入讨论组
往期链接