Skip to content

Testbed for multimodal retrieval augmented generation techniques with FiftyOne, LlamaIndex, and Milvus

Notifications You must be signed in to change notification settings

jacobmarks/fiftyone-multimodal-rag-plugin

Repository files navigation

Multimodal RAG with FiftyOne, LlamaIndex, and Milvus

Introduction

Retrieval augmentated generation (RAG) has grown increasingly popular as a way to improve the quality of text generated by large language models. Now that multimodal LLMs are in vouge, it's time to extend RAG to multimodal data.

When we add in the ability to search and retrieve data across multiple modalities, we get a powerful tool for interacting with the most powerful AI models available today. However, we also add brand new layers of complexity to the process.

Some of the considerations we need to take into account include:

  • How do we chunk and index multimodal data? Do we split it into separate modalities or keep it together?
  • How do we search multimodal data? Do we search each modality separately and then combine the results, or do we search them together?
  • What new strategies can we use to improve the quality of the data we generate?

On a more practical level, here are some of the basic knobs we can turn:

  • Text embedding model: Which model do we use to embed the text?
  • Image representation: Do we embed the image with a multimodal model (like CLIP) or use captions?
  • How many image and text results do we want to retrieve?
  • Which multimodal model do we use to generate our retrieval-augmented results?

This project is a testbed for exploring these questions and more. It uses three open source libraries, FiftyOne, LlamaIndex, and Milvus, to make the process of working with multimodal data, experimenting with different multimodal RAG techniques, and finding what works best for your use-case as easy as possible.

⚠️ This project is a work in progress. It may be rough around the edges, and some features may not work as expected. If you run into any issues, please open an issue on this repository — or better yet, submit a pull request!

Also note that LlamaIndex frequently updates its API. This is why the version of LlamaIndex and its associated packages are all pinned 🙃

Installation

First, install FiftyOne:

pip install fiftyone

Next, using FiftyOne's CLI syntax, download and install the FiftyOne Multimodal RAG plugin:

fiftyone plugins download https://github.com/jacobmarks/fiftyone-multimodal-rag-plugin

LlamaIndex has a verbose installation process (if you want to build anything multimodal at least). Fortunately for you, this (and all other install dependencies) will be taken care of with the following command:

fiftyone plugins requirements @jacobmarks/multimodal_rag --install

Usage

Setup

To get started, launch the FiftyOne App. You can do so from the terminal by running:

fiftyone app launch

Or you can run the following Python code:

import fiftyone as fo

session = fo.launch_app()

Creating a Multimodal Dataset

Now press the backtick key (`) and type create_dataset_from_llama_documents. Press Enter to open the operator's modal. This operator gives you a UI to choose a directory containing your multimodal data (images, text files, pdfs, etc) and create a FiftyOne dataset from it.

Once you've selected a directory, execute the operator. It will create a new dataset in your FiftyOne session. For text files, you will see a image rendering of the truncated text. For images, you will see the image itself.

💡 You can add additional directories of multimodal data with the add_llama_documents_to_dataset operator.

Indexing the Multimodal Dataset

Now that you have a multimodal dataset, you can index it with LlamaIndex and Milvus. Use the create_multimodal_rag_index operator to enter this process. This operator will prompt you to name the index, and will give you the option to index the images via CLIP embeddings or captions. If you choose captions, you will be prompted to select the text field to use as the caption.

💡 If you do not have captions on your dataset, you might be interested in the FiftyOne Image Captioning Plugin.

fiftyone plugins download https://github.com/jacobmarks/fiftyone-image-captioning-plugin

Inspect an Index

Once you have created an index, you can inspect it by running the get_multimodal_rag_index_info operator and selecting the index you want to inspect from the dropdown.

Querying the Index

Finally, you can query the index with the query_multimodal_rag_index operator. This operator will prompt you to enter a query string, and an index to query.

You can also specify the multimodal model to use for generating the retrieval-augmented results, as well as both the number of image and text results to retrieve.

Supported Multimodal Models

About

Testbed for multimodal retrieval augmented generation techniques with FiftyOne, LlamaIndex, and Milvus

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages