Multimodal RAG using Granite and Docling¶

Retrieval-augmented generation (RAG) is a technique used with large language models (LLMs) to connect the model with a knowledge base of information outside the data the LLM has been trained on without having to perform fine-tuning. Traditional RAG is limited to text-based use cases such as text summarization and chatbots.

Multimodal RAG can use multimodal LLMs (MLLM) to process information from multiple types of data to be included as part of the external knowledge base used in RAG. Multimodal data can include text, images, audio, video or other forms. Popular multimodal LLMs include Google’s Gemini, Meta’s Llama 3.2 and OpenAI’s GPT-4 and GPT-4o.

For this lab, you will use IBM Granite models capable of processing different modalities. You will create an AI system to answer real-time user queries from unstructured data in a PDF.

Prerequisites¶

This lab is a Jupyter notebook. Please follow the instructions in pre-work to run the lab.

Lab¶

To run the notebook from your command line in Jupyter using the active virtual environment from the pre-work, run:

jupyter notebook notebooks/Granite_Multimodal_RAG.ipynb

The path of the notebook file above is relative to the granite-workshop folder from the git clone in the pre-work.

Credits¶

This notebook is a modified version of the IBM Granite Community Multimodal RAG using Granite and Docling notebook. Refer to the IBM Granite Community for the official notebooks.