✨ PDFSegmenter

PDFSegmenter is an Executor used for extracting images and text as chunks from PDF data. It stores each images and text of each page as chunks separately, with their respective mime types. It uses the pdfplumber library.

Loading data

The PDFSegmenter expects data to be found in the Document's .blob attribute. This can be loaded from a PDF file like so

from docarray import DocumentArray, Document
from jina import Flow

doc = DocumentArray([Document(uri='cats_are_awesome.pdf')]) # adjust to your own pdf
doc[0].load_uri_to_blob()
print(doc[0])

f = Flow().add(
    uses='jinahub+docker://PDFSegmenter',
)
with f:
    resp = f.post(on='/craft', inputs=doc)
    print(f'{[c.mime_type for c in resp[0].chunks]}')

>> <Document ('id', 'blob', 'mime_type', 'uri') at 9d77e00f759bf8523e86abf452ac28a0> # notice `.blob` field is set
>> ['image/*', 'image/*', 'text/plain'] # we get both images and text from a PDF

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
tests		tests
.gitignore		.gitignore
README.md		README.md
config.yml		config.yml
manifest.yml		manifest.yml
pdf_segmenter.py		pdf_segmenter.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

✨ PDFSegmenter

Loading data

About

Releases 3

Contributors 5

Languages

jina-ai/executor-pdfsegmenter

Folders and files

Latest commit

History

Repository files navigation

✨ PDFSegmenter

Loading data

About

Resources

Stars

Watchers

Forks

Releases 3

Contributors 5

Languages