Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vision models #150

Open
p5 opened this issue Sep 16, 2024 · 10 comments
Open

Vision models #150

p5 opened this issue Sep 16, 2024 · 10 comments

Comments

@p5
Copy link
Contributor

p5 commented Sep 16, 2024

Value Statement

As someone who wants a boring way to use AI
I would like to expose an image/PDF/document to the LLM
So that I can make requests and extract information, all within Ramalama

Notes

Various models now contain vision functionality, where they can ingest data from images, and answer questions about those images. Recently, the accuracy of these LLM-based OCR text extractions can exceed that of dedicated OCR tooling (even paid products like AWS Textract). The same vision models can also be used to extract information from PDF documents fairly easily after converting them to images.

We can use a similar interface to the planned Whisper.cpp implementation, since both are just contexts or data we provide to the LLMs. This has not been detailed anywhere, so below is a proposal/example of how it could look.

$ ramalama run --context-file ./document.pdf phi3.5-vision
>> When is this letter dated?
The date in the letter is `1st January 1999`

>> What is this document about?
This document is an instruction manual detailing how to use Ramalama, a cool new way to run LLMs (Large Language Models) across Linux and MacOS.  It supports text and vision-based models.

$ ramalama run --context-file ./painting.png phi3.5-vision
>> What is in the painting?
This is an abstract oil painting about something and something else.  It seems to be inspired by some artist.

The primary issue is neither ollama or llama.cpp support vision models at this moment, so would either need a custom implementation, or would require adding something like vllm.

@ericcurtin
Copy link
Collaborator

ericcurtin commented Sep 17, 2024

We had intended on merging vllm support soon, we started it here:

#97

this is what we think an outline of what it should look like, basically we want to introduce a --runtime flag, kinda like like the podman one that switches between crun, runc, krun, but in this case allows one to switch between llama.cpp, vllm, and whatever other runtimes people would like to integrate in future.

Above is a key feature we want, it's one of the reasons we don't simply use Ollama.

Now that we have a vllm v0.6.1 , we are ready to complete that work:

v0.6.1

Vision models like this would be useful for sure.

Personally I'm gonna be out a little bit in the next week or two, have a wedding and other things I need to take some time for.

Anybody who wants to pick up --runtime, vllm support, vision model support, like you @p5 or others, be my guest.

@ericcurtin
Copy link
Collaborator

@rhatdan merged the first vllm-related PR, I dunno if you want to take a stab at implementing the other things you had in mind @p5

@rhatdan
Copy link
Member

rhatdan commented Oct 14, 2024

@p5 still interested in this?

@p5
Copy link
Contributor Author

p5 commented Oct 14, 2024

Hey Dan, Eric

My free time is very limited at the minute. Starting a new job in 2 weeks and there's a lot to get in order.

I still feel vision models would be a great addition to ramalama, but I'm going to be in a Windows-only environment :sigh: so unsure how much I'll be able to help out.

@rhatdan
Copy link
Member

rhatdan commented Oct 14, 2024

Thanks @p5, good luck with the new job.

@ericcurtin
Copy link
Collaborator

Best of luck @p5 @bmahabirbu did have success running on Windows recently:

https://github.com/containers/ramalama/tree/main/docs/readme

@p5

This comment has been minimized.

@ericcurtin
Copy link
Collaborator

FYI - Ollama is now implementing vision models, so once v0.4 is released, it might be easier to integrate here.

Indirectly maybe, we inherit from the same backend llama.cpp, we don't actually use any Ollama stuff directly even though to a user it might appear that way!

@p5
Copy link
Contributor Author

p5 commented Oct 22, 2024

Oh, apologies. I thought Ramalama used both llama.cpp and ollama runtimes 🤦
Now I can see you use Ollama's registry and transport, served via llama.cpp runtime.

@ericcurtin
Copy link
Collaborator

And we wrote the Ollama transport from scratch, so we use zero Ollama code.

What a lot of people don't realize is it's llama.cpp that does most of the heavy lifting for Ollama.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants