Pdf librarian

Tool assembled from a few different packages to quickly search pdfs files It will:

Scan a directory for pdf, text or image files
Keep a cache of the file contents and indexes and keep file hash to detect renamings to avoid reprocessing files (doesn't account for duplicate files)
Extract plain text from files using pdf-text-reader and tesseract.js
Index extracted text
Allow searching using meilisearch engine, can be done through:
- CLI, grep-like prompt tool
- Web interface with launched server

Inspiration

I wanted to search in a bunch of pdf files, quickly. I used to use pdfgrep but that was a bit slow even with the cache and the cache didn't last for long. I thought it'b be funnier to make my own tool rather than modifying pdfgrep to extend the cache lifespan.

Todo:

Remove ghost keys from index, the search should not include missing files based on the cached index.
- Also, I suggest not removing them from the text cache since they can be later used if the file is found again (related to the first point)

Notes

The elasticlunr index hit the max callstack while being serialized. I reckon it'd hit it again evantuaally if I just try to increase the limit so that's deprecated.
In general, I've discovered that these simple JS search engines are much less powerful than I imagined. Some use too much memory or are not flexible to be used how I wanted and don't allow or ease import/export. I reckon most of them could be useful to search simple stuff like titles and small text and/or few rows (few millions?). Some of them are very flexible in terms of search capabilities and some are very performant (supposedly). But in the end none could do what I wanted so I'll move

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
package.json		package.json
readme.md		readme.md
tsconfig.json		tsconfig.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pdf librarian

Inspiration

Todo:

Notes

About

Releases

Packages

Contributors 2

Languages

License

U1F30C/librarian

Folders and files

Latest commit

History

Repository files navigation

Pdf librarian

Inspiration

Todo:

Notes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages