Vision models #150

p5 · 2024-09-16T16:56:17Z

Value Statement

As someone who wants a boring way to use AI
I would like to expose an image/PDF/document to the LLM
So that I can make requests and extract information, all within Ramalama

Notes

Various models now contain vision functionality, where they can ingest data from images, and answer questions about those images. Recently, the accuracy of these LLM-based OCR text extractions can exceed that of dedicated OCR tooling (even paid products like AWS Textract). The same vision models can also be used to extract information from PDF documents fairly easily after converting them to images.

We can use a similar interface to the planned Whisper.cpp implementation, since both are just contexts or data we provide to the LLMs. This has not been detailed anywhere, so below is a proposal/example of how it could look.

$ ramalama run --context-file ./document.pdf phi3.5-vision
>> When is this letter dated?
The date in the letter is `1st January 1999`

>> What is this document about?
This document is an instruction manual detailing how to use Ramalama, a cool new way to run LLMs (Large Language Models) across Linux and MacOS.  It supports text and vision-based models.

$ ramalama run --context-file ./painting.png phi3.5-vision
>> What is in the painting?
This is an abstract oil painting about something and something else.  It seems to be inspired by some artist.

The primary issue is neither ollama or llama.cpp support vision models at this moment, so would either need a custom implementation, or would require adding something like vllm.

The text was updated successfully, but these errors were encountered:

ericcurtin · 2024-09-17T14:53:12Z

We had intended on merging vllm support soon, we started it here:

#97

this is what we think an outline of what it should look like, basically we want to introduce a --runtime flag, kinda like like the podman one that switches between crun, runc, krun, but in this case allows one to switch between llama.cpp, vllm, and whatever other runtimes people would like to integrate in future.

Above is a key feature we want, it's one of the reasons we don't simply use Ollama.

Now that we have a vllm v0.6.1 , we are ready to complete that work:

v0.6.1

Vision models like this would be useful for sure.

Personally I'm gonna be out a little bit in the next week or two, have a wedding and other things I need to take some time for.

Anybody who wants to pick up --runtime, vllm support, vision model support, like you @p5 or others, be my guest.

ericcurtin · 2024-09-24T00:53:35Z

@rhatdan merged the first vllm-related PR, I dunno if you want to take a stab at implementing the other things you had in mind @p5

rhatdan · 2024-10-14T12:11:32Z

@p5 still interested in this?

p5 · 2024-10-14T12:20:08Z

Hey Dan, Eric

My free time is very limited at the minute. Starting a new job in 2 weeks and there's a lot to get in order.

I still feel vision models would be a great addition to ramalama, but I'm going to be in a Windows-only environment :sigh: so unsure how much I'll be able to help out.

rhatdan · 2024-10-14T12:22:12Z

Thanks @p5, good luck with the new job.

ericcurtin · 2024-10-14T12:50:41Z

Best of luck @p5 @bmahabirbu did have success running on Windows recently:

https://github.com/containers/ramalama/tree/main/docs/readme

ericcurtin · 2024-10-22T16:13:54Z

FYI - Ollama is now implementing vision models, so once v0.4 is released, it might be easier to integrate here.

Indirectly maybe, we inherit from the same backend llama.cpp, we don't actually use any Ollama stuff directly even though to a user it might appear that way!

p5 · 2024-10-22T16:16:33Z

Oh, apologies. I thought Ramalama used both llama.cpp and ollama runtimes 🤦
Now I can see you use Ollama's registry and transport, served via llama.cpp runtime.

ericcurtin · 2024-10-22T16:39:24Z

And we wrote the Ollama transport from scratch, so we use zero Ollama code.

What a lot of people don't realize is it's llama.cpp that does most of the heavy lifting for Ollama.

ericcurtin mentioned this issue Sep 21, 2024

Add basic vllm support #97

Merged

This comment has been minimized.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vision models #150

Vision models #150

p5 commented Sep 16, 2024 •

edited

Loading

ericcurtin commented Sep 17, 2024 •

edited

Loading

ericcurtin commented Sep 24, 2024

rhatdan commented Oct 14, 2024

p5 commented Oct 14, 2024

rhatdan commented Oct 14, 2024

ericcurtin commented Oct 14, 2024

This comment has been minimized.

ericcurtin commented Oct 22, 2024

p5 commented Oct 22, 2024 •

edited

Loading

ericcurtin commented Oct 22, 2024

Vision models #150

Vision models #150

Comments

p5 commented Sep 16, 2024 • edited Loading

Value Statement

Notes

ericcurtin commented Sep 17, 2024 • edited Loading

ericcurtin commented Sep 24, 2024

rhatdan commented Oct 14, 2024

p5 commented Oct 14, 2024

rhatdan commented Oct 14, 2024

ericcurtin commented Oct 14, 2024

This comment has been minimized.

ericcurtin commented Oct 22, 2024

p5 commented Oct 22, 2024 • edited Loading

ericcurtin commented Oct 22, 2024

p5 commented Sep 16, 2024 •

edited

Loading

ericcurtin commented Sep 17, 2024 •

edited

Loading

p5 commented Oct 22, 2024 •

edited

Loading