Retrieval augmentated generation (RAG) has grown increasingly popular as a way to improve the quality of text generated by large language models. Now that multimodal LLMs are in vouge, it's time to extend RAG to multimodal data.
When we add in the ability to search and retrieve data across multiple modalities, we get a powerful tool for interacting with the most powerful AI models available today. However, we also add brand new layers of complexity to the process.
Some of the considerations we need to take into account include:
- How do we chunk and index multimodal data? Do we split it into separate modalities or keep it together?
- How do we search multimodal data? Do we search each modality separately and then combine the results, or do we search them together?
- What new strategies can we use to improve the quality of the data we generate?
On a more practical level, here are some of the basic knobs we can turn:
- Text embedding model: Which model do we use to embed the text?
- Image representation: Do we embed the image with a multimodal model (like CLIP) or use captions?
- How many image and text results do we want to retrieve?
- Which multimodal model do we use to generate our retrieval-augmented results?
This project is a testbed for exploring these questions and more. It uses three open source libraries, FiftyOne, LlamaIndex, and Milvus, to make the process of working with multimodal data, experimenting with different multimodal RAG techniques, and finding what works best for your use-case as easy as possible.
Also note that LlamaIndex frequently updates its API. This is why the version of LlamaIndex and its associated packages are all pinned 🙃
First, install FiftyOne:
pip install fiftyone
Next, using FiftyOne's CLI syntax, download and install the FiftyOne Multimodal RAG plugin:
fiftyone plugins download https://github.com/jacobmarks/fiftyone-multimodal-rag-plugin
LlamaIndex has a verbose installation process (if you want to build anything multimodal at least). Fortunately for you, this (and all other install dependencies) will be taken care of with the following command:
fiftyone plugins requirements @jacobmarks/multimodal_rag --install
To get started, launch the FiftyOne App. You can do so from the terminal by running:
fiftyone app launch
Or you can run the following Python code:
import fiftyone as fo
session = fo.launch_app()
Now press the backtick key (`
) and type create_dataset_from_llama_documents
.
Press Enter
to open the operator's modal. This operator gives you a UI to choose
a directory containing your multimodal data (images, text files, pdfs, etc) and create a FiftyOne dataset from it.
Once you've selected a directory, execute the operator. It will create a new dataset in your FiftyOne session. For text files, you will see a image rendering of the truncated text. For images, you will see the image itself.
💡 You can add additional directories of multimodal data with the add_llama_documents_to_dataset
operator.
Now that you have a multimodal dataset, you can index it with LlamaIndex and Milvus.
Use the create_multimodal_rag_index
operator to enter this process. This operator
will prompt you to name the index, and will give you the option to index the images
via CLIP embeddings or captions. If you choose captions, you will be prompted to select
the text field to use as the caption.
💡 If you do not have captions on your dataset, you might be interested in the FiftyOne Image Captioning Plugin.
fiftyone plugins download https://github.com/jacobmarks/fiftyone-image-captioning-plugin
Once you have created an index, you can inspect it by running the get_multimodal_rag_index_info
operator
and selecting the index you want to inspect from the dropdown.
Finally, you can query the index with the query_multimodal_rag_index
operator.
This operator will prompt you to enter a query string, and an index to query.
You can also specify the multimodal model to use for generating the retrieval-augmented results, as well as both the number of image and text results to retrieve.