Implement RAG chunking strategies with LangChain and watsonx.ai¶

Author: Anna Gutowska

In this tutorial, you will experiment with several chunking strategies using LangChain and the latest IBM® Granite™ model now available on watsonx.ai™. The overall goal will be to perform chunking to effective implement retrieval augmented generation (RAG).

What is chunking?¶

Chunking refers to the process of breaking large pieces of text into smaller text segments or chunks. To emphasize the importance of chunking, it is helpful to understand RAG. RAG is a technique in natural language processing (NLP) that combines information retrieval and large language models (LLMs) to retrieve relevant information from supplemental datasets to optimize the quality of the LLM’s output. To manage large documents, we can use chunking to split the text into smaller snippets of meaningful chunks. These text chunks can then be embedded and stored in a vector database through the use of an embedding model. Finally, the RAG system can then uses semantic search to retrieve only the most relevant chunks. Smaller chunks tend to outperform larger chunks as they tend to be more manageable pieces for models of smaller context window size.

Some key components of chunking include:

Chunking strategy: Choosing the right chunking strategy for your RAG application is important as it determines the boundaries for setting chunks. We will explore some of these in the next section.
Chunk size: Maximum number of tokens to be in each chunk. Determining the appropriate chunk size usually involves some experimenting.
Chunk overlap: The number of tokens overlapping between chunks to preserve context. This is an optional parameter.

Choosing the right chunking strategy for your RAG application¶

There are several different chunking strategies to choose from. It is important to select the most effective chunking technique for the specific use case of your LLM application. Some commonly used chunking processes include:

Fixed-size chunking: Splitting text based on a chunk size and optional chunk overlap. This approach is most common and straightforward.
Recursive chunking: Iterating default separators until one of them produces the preferred chunk size. Default separators include ["\n\n", "\n", " ", ""]. This chunking method uses hierarchical separators so that paragraphs, followed by sentences and then words, are kept together as much as possible.
Semantic chunking: Splitting text in a way that groups sentences based on the semantic similarity of their embeddings. Embeddings of high semantic similarity are closer together than those of low semantic similarity. This results in context-aware chunks.
Document-based chunking: Splitting based on document structure. This splitter can utilize Markdown text, images, tables and even Python code classes and functions as ways of determining structure. In doing so, large documents can be chunked and processed by the LLM.
Agentic chunking: Leverages agentic AI by allowing the LLM to determine appropriate document splitting based on semantic meaning as well as content structure such as paragraph types, section headings, step-by-step instructions and more. This chunker is experimental and attempts to simulate human reasoning when processing long documents.

Steps¶

Step 1. Set up your environment¶

While you can choose from several tools, this tutorial walks you through how to set up an IBM account to use a Jupyter Notebook.

Log in to watsonx.ai using your IBM Cloud® account.
Create a watsonx.ai project.

You can get your project ID from within your project. Click the Manage tab. Then, copy the project ID from the Details section of the General page. You need this ID for this tutorial.
Create a Jupyter Notebook.

This step opens a Jupyter Notebook environment where you can copy the code from this tutorial. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. To view more Granite tutorials, check out the IBM Granite Community. This tutorial is also available on GitHub.

Step 2. Set up a watsonx.ai Runtime instance and API key.¶

Create a watsonx.ai Runtime service instance (select your appropriate region and choose the Lite plan, which is a free instance).
Generate an application programming interface (API) key.
Associate the watsonx.ai Runtime service instance to the project that you created in watsonx.ai.

Step 3. Install and import relevant libraries and set up your credentials¶

We need a few libraries and modules for this tutorial. Make sure to import the following ones and if they're not installed, a quick pip installation resolves the problem.

Note, this tutorial was built using Python 3.11.9.

In [ ]:

Copied!

# installations
!pip install -q langchain langchain-ibm langchain_experimental langchain-text-splitters langchain_chroma transformers bs4 langchain_huggingface sentence-transformers
# installations
!pip install -q langchain langchain-ibm langchain_experimental langchain-text-splitters langchain_chroma transformers bs4 langchain_huggingface sentence-transformers

In [ ]:

Copied!





# imports 
import getpass

from langchain_ibm import WatsonxLLM
from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader
from ibm_watsonx_ai.metanames import GenTextParamsMetaNames as GenParams
from transformers import AutoTokenizer
# imports 
import getpass

from langchain_ibm import WatsonxLLM
from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader
from ibm_watsonx_ai.metanames import GenTextParamsMetaNames as GenParams
from transformers import AutoTokenizer

To set our credentials, we need the WATSONX_APIKEY and WATSONX_PROJECT_ID you generated in step 1. We will also set the URL serving as the API endpoint.

In [ ]:

Copied!

WATSONX_APIKEY = getpass.getpass("Please enter your watsonx.ai Runtime API key (hit enter): ")

WATSONX_PROJECT_ID = getpass.getpass("Please enter your project ID (hit enter): ")

URL = "https://us-south.ml.cloud.ibm.com"
WATSONX_APIKEY = getpass.getpass("Please enter your watsonx.ai Runtime API key (hit enter): ")

WATSONX_PROJECT_ID = getpass.getpass("Please enter your project ID (hit enter): ")

URL = "https://us-south.ml.cloud.ibm.com"

Step 4. Initialize your LLM¶

We will use Granite 3.1 as our LLM for this tutorial. To initialize the LLM, we need to set the model parameters. To learn more about these model parameters, such as the minimum and maximum token limits, refer to the documentation.

In [ ]:

Copied!





llm = WatsonxLLM(
    model_id= "ibm/granite-3-8b-instruct", 
    url=URL,
    apikey=WATSONX_APIKEY,
    project_id=WATSONX_PROJECT_ID,
    params={
        GenParams.DECODING_METHOD: "greedy",
        GenParams.TEMPERATURE: 0,
        GenParams.MIN_NEW_TOKENS: 5,
        GenParams.MAX_NEW_TOKENS: 2000,
        GenParams.REPETITION_PENALTY:1.2,
        GenParams.STOP_SEQUENCES: ["\n\n"]
    }
)
llm = WatsonxLLM(
    model_id= "ibm/granite-3-8b-instruct", 
    url=URL,
    apikey=WATSONX_APIKEY,
    project_id=WATSONX_PROJECT_ID,
    params={
        GenParams.DECODING_METHOD: "greedy",
        GenParams.TEMPERATURE: 0,
        GenParams.MIN_NEW_TOKENS: 5,
        GenParams.MAX_NEW_TOKENS: 2000,
        GenParams.REPETITION_PENALTY:1.2,
        GenParams.STOP_SEQUENCES: ["\n\n"]
    }
)

Step 5. Load your document¶

The context we are using for our RAG pipeline is the official IBM announcement for the release of Granite 3.1. We can load the blog to a Document directly from the webpage by using LangChain's WebBaseLoader.

In [ ]:

Copied!

url = "https://www.ibm.com/new/announcements/ibm-granite-3-1-powerful-performance-long-context-and-more"
doc = WebBaseLoader(url).load()
url = "https://www.ibm.com/new/announcements/ibm-granite-3-1-powerful-performance-long-context-and-more"
doc = WebBaseLoader(url).load()

Step 6. Perform text splitting¶

Let's provide sample code for implementing each of the chunking strategies we covered earlier in this tutorial available through LangChain.

Fixed-size chunking¶

To implement fixed-size chunking, we can use LangChain's CharacterTextSplitter and set a chunk_size as well as chunk_overlap. The chunk_size is measured by the number of characters. Feel free to experiment with different values. We will also set the separator to be the newline character so that we can differentiate between paragraphs. For tokenization, we can use the granite-3.1-8b-instruct tokenizer. The tokenizer breaks down text into tokens that can be processed by the LLM.

In [ ]:

Copied!





from langchain_text_splitters import CharacterTextSplitter

tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-3.1-8b-instruct")
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer, 
    separator="\n", #default: "\n\n"
    chunk_size=1200, 
    chunk_overlap=200)

fixed_size_chunks = text_splitter.create_documents([doc[0].page_content])
from langchain_text_splitters import CharacterTextSplitter

tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-3.1-8b-instruct")
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer, 
    separator="\n", #default: "\n\n"
    chunk_size=1200, 
    chunk_overlap=200)

fixed_size_chunks = text_splitter.create_documents([doc[0].page_content])

We can print one of the chunks for a better understanding of their structure.

In [ ]:

Copied!

fixed_size_chunks[1]
fixed_size_chunks[1]

Out[ ]:

Document(metadata={}, page_content='As always, IBM’s historical commitment to open source is reflected in the permissive and standard open source licensing for every offering discussed in this article.\n\r\n        Granite 3.1 8B Instruct: raising the bar for lightweight enterprise models\r\n    \nIBM’s efforts in the ongoing optimization the Granite series are most evident in the growth of its flagship 8B dense model. IBM Granite 3.1 8B Instruct now bests most open models in its weight class in average scores on the academic benchmarks evaluations included in the Hugging Face OpenLLM Leaderboard.\nThe evolution of the Granite model series has continued to prioritize excellence and efficiency in enterprise use cases, including agentic AI. This progress is most apparent in the newest 8B model’s significantly improved performance on IFEval, a dataset featuring tasks that test a model’s ability to follow detailed instructions, and Multi-step Soft Reasoning (MuSR), whose tasks measure reasoning and understanding on and of long texts.\n\r\n        Expanded context length\r\n    \nBolstering the performance leap from Granite 3.0 to Granite 3.1 is the expansion of all models’ context windows. Granite 3.1’s 128K token context length is on par with that of other leading open model series, including Llama 3.1–3.3 and Qwen2.5.\nThe context window (or context length) of a large language model (LLM) is the amount of text, in tokens, that an LLM can consider at any one time. A larger context window enables a model to process larger inputs, carry out longer continuous exchanges and incorporate more information into each output. Tokenization doesn’t entail any fixed token-to-word “exchange rate,” but 1.5 tokens per word is a useful estimate. 128K tokens is roughly equivalent to a 300-page book.\nAbove a threshold of about 100K tokens, impressive new possibilities emerge, including multi-document question answering, repository-level code understanding, self-reflection and LLM-powered autonomous agents.1 Granite 3.1’s expanded context length thus lends itself to a much wider range of enterprise use cases, from processing code bases and lengthy legal documents in their entirety to simultaneously reviewing thousands of financial transactions.\n\r\n        Granite Guardian 3.1: detecting hallucinations in agentic workflows\nGranite Guardian 3.1 8B and Granite Guardian 3.1 2B can now detect hallucinations that might occur in an agentic workflow, affording the same accountability and trust to function calling that we already provide for RAG.\nMany steps and subprocesses occur in the space between the initial request sent to an AI agent and the output the agent eventually returns to the user. To provide oversight throughout, Granite Guardian 3.1 models monitor every function call for syntactic and semantic hallucinations.\nFor instance, if an AI agent purportedly queries an external information source, Granite Guardian 3.1 monitors for fabricated information flows. If an agentic workflow entails intermediate calculations using figures retrieved from a bank record, Granite Guardian 3.1 checks to see whether the agent pulled the correct function call along with the appropriate numbers.\nToday’s release is yet another step toward accountability and trust for any component of an LLM-based enterprise workflow. The new Granite Guardian 3.1 models are available on Hugging Face. They’ll also be available through Ollama later this month and on IBM watsonx.ai in January 2025.\n\r\n        Granite embedding models\r\n    \nEmbeddings are an integral part of the LLM ecosystem. An accurate and efficient means of representing words, queries and documents in numerical form is essential to an array of enterprise tasks including semantic search, vector search and RAG, as well as maintaining effective vector databases. An effective embedding model can significantly enhance a system’s understanding of user intent and increase the relevance of information and sources in response to a query.\nWhile the past two years have seen the proliferation of increasingly competitive open source autoregressive LLMs for tasks like text generation and summarization, open source embedding model releases from major providers are relatively few and far between.\nThe new Granite Embedding models are an enhanced evolution of the Slate family of encoder-only, RoBERTA-based language models. Trained with the same care and consideration for filtering bias, hate, abuse and profanity (“HAP”) as the rest of the Granite series, Granite Embedding is offered in four model sizes, two of which support multilingual embedding across 12 natural languages:\nGranite-Embedding-30M-EnglishGranite-Embedding-125M-EnglishGranite-Embedding-107M-MultilingualGranite-Embedding-278M-Multilingual')

We can also use the tokenizer for verifying our process and to check the number of tokens present in each chunk. This step is optional and for demonstrative purposes.

In [ ]:

Copied!

for idx, val in enumerate(fixed_size_chunks):
    token_count = len(tokenizer.encode(val.page_content))
    print(f"The chunk at index {idx} contains {token_count} tokens.")
for idx, val in enumerate(fixed_size_chunks):
    token_count = len(tokenizer.encode(val.page_content))
    print(f"The chunk at index {idx} contains {token_count} tokens.")

The chunk at index 0 contains 1106 tokens.
The chunk at index 1 contains 1102 tokens.
The chunk at index 2 contains 1183 tokens.
The chunk at index 3 contains 1010 tokens.

Great! It looks like our chunk sizes were appropriately implemented.

Recursive chunking¶

For recursive chunking, we can use LangChain's RecursiveCharacterTextSplitter. Like the fixed-size chunking example, we can experiment with different chunk and overlap sizes.

In [ ]:

Copied!

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=0)
recursive_chunks = text_splitter.create_documents([doc[0].page_content])
recursive_chunks[:5]
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=0)
recursive_chunks = text_splitter.create_documents([doc[0].page_content])
recursive_chunks[:5]

Out[ ]:

[Document(metadata={}, page_content='IBM Granite 3.1: powerful performance, longer context and more'),
 Document(metadata={}, page_content='IBM Granite 3.1: powerful performance, longer context, new embedding models and more'),
 Document(metadata={}, page_content='Artificial Intelligence'),
 Document(metadata={}, page_content='Compute and servers'),
 Document(metadata={}, page_content='IT automation')]

The splitter successfully chunked the text by using the default separators: ["\n\n", "\n", " ", ""].

Semantic chunking¶

Semantic chunking requires an embedding or encoder model. We can use the granite-embedding-30m-english model as our embedding model. We can also print one of the chunks for a better understanding of their structure.

In [ ]:

Copied!





from langchain_huggingface import HuggingFaceEmbeddings
from langchain_experimental.text_splitter import SemanticChunker

embeddings_model = HuggingFaceEmbeddings(model_name="ibm-granite/granite-embedding-30m-english")
text_splitter = SemanticChunker(embeddings_model)
semantic_chunks = text_splitter.create_documents([doc[0].page_content])
semantic_chunks[1]
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_experimental.text_splitter import SemanticChunker

embeddings_model = HuggingFaceEmbeddings(model_name="ibm-granite/granite-embedding-30m-english")
text_splitter = SemanticChunker(embeddings_model)
semantic_chunks = text_splitter.create_documents([doc[0].page_content])
semantic_chunks[1]

Out[ ]:

Document(metadata={}, page_content='Our latest dense models (Granite 3.1 8B, Granite 3.1 2B), MoE models (Granite 3.1 3B-A800M, Granite 3.1 1B-A400M) and guardrail models (Granite Guardian 3.1 8B, Granite Guardian 3.1 2B) all feature a 128K token context length.We’re releasing a family of all-new embedding models. The new retrieval-optimized Granite Embedding models are offered in four sizes, ranging from 30M–278M parameters. Like their generative counterparts, they offer multilingual support across 12 different languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch and Chinese. Granite Guardian 3.1 8B and 2B feature a new function calling hallucination detection capability, allowing increased control over and observability for agents making tool calls.All Granite 3.1, Granite Guardian 3.1, and Granite Embedding models are open source under Apache 2.0 license.These latest entries in the Granite series follow IBM’s recent launch of Docling (an open source framework for prepping documents for RAG and other generative AI applications) and Bee (an open source, model agnostic framework for agentic AI).Granite TTM (TinyTimeMixers), IBM’s series of compact but highly performant timeseries models, are now available in watsonx.ai through the beta release of watsonx.ai Timeseries Forecasting API and SDK.Granite 3.1 models are now available in IBM watsonx.ai, as well as through platform partners including (in alphabetical order) Docker, Hugging Face, LM Studio, Ollama and Replicate.Granite 3.1 will also be leveraged internally by enterprise partners: Samsung is integrating select Granite models into its SDS platform; Lockheed Martin is integrating Granite 3.1 models into its AI Factory tools, used by over 10,000 developers and engineers. Today marks the release of IBM Granite 3.1, the latest update to our Granite series of open, performant, enterprise-optimized language models. This suite of improvements, additions and new capabilities focuses primarily on augmenting performance, accuracy and accountability in essential enterprise use cases like tool use, retrieval augmented generation (RAG) and scalable agentic AI workflows. Granite 3.1 builds upon the momentum of the recently launched Granite 3.0 collection. IBM will continue to release updated models and functionality for the Granite 3 series in the coming months, with new multimodal capabilities slated for release in Q1 2025. These new Granite models are not the only notable recent IBM contributions to the open source LLM ecosystem. Today’s release caps off a recent run of innovative open source launches, from a flexible framework for developing AI agents to an intuitive toolkit to unlock essential information stashed away in PDFs, slide decks and other file formats that are difficult for models to digest. Using these tools and frameworks in tandem with Granite 3.1 models offers developers evolved capabilities for RAG, AI agents and other LLM-based workflows. As always, IBM’s historical commitment to open source is reflected in the permissive and standard open source licensing for every offering discussed in this article. Granite 3.1 8B Instruct: raising the bar for lightweight enterprise models\r\n    \n\n\n\nIBM’s efforts in the ongoing optimization the Granite series are most evident in the growth of its flagship 8B dense model. IBM Granite 3.1 8B Instruct now bests most open models in its weight class in average scores on the academic benchmarks evaluations included in the Hugging Face OpenLLM Leaderboard. The evolution of the Granite model series has continued to prioritize excellence and efficiency in enterprise use cases, including agentic AI. This progress is most apparent in the newest 8B model’s significantly improved performance on IFEval, a dataset featuring tasks that test a model’s ability to follow detailed instructions, and Multi-step Soft Reasoning (MuSR), whose tasks measure reasoning and understanding on and of long texts. Expanded context length\r\n    \n\n\n\nBolstering the performance leap from Granite 3.0 to Granite 3.1 is the expansion of all models’ context windows.')

Document-based chunking¶

Documents of various file types are compatible with LangChain's document-based text splitters. For this tutorial's purposes, we will use a Markdown file. For examples of recursive JSON splitting, code splitting and HTML splitting, refer to the LangChain documentation.

An example of a Markdown file we can load is the README file for Granite 3.1 on IBM's GitHub.

In [ ]:

Copied!

url = "https://raw.githubusercontent.com/ibm-granite/granite-3.1-language-models/refs/heads/main/README.md"
markdown_doc = WebBaseLoader(url).load()
markdown_doc
url = "https://raw.githubusercontent.com/ibm-granite/granite-3.1-language-models/refs/heads/main/README.md"
markdown_doc = WebBaseLoader(url).load()
markdown_doc

Out[ ]:

[Document(metadata={'source': 'https://raw.githubusercontent.com/ibm-granite/granite-3.1-language-models/refs/heads/main/README.md'}, page_content='\n\n\n\n  :books: Paper (comming soon)\xa0 | :hugs: HuggingFace Collection\xa0 | \n  :speech_balloon: Discussions Page\xa0 | ðŸ“˜ IBM Granite Docs\n\n\n---\n## Introduction to Granite 3.1 Language Models\nGranite 3.1 language models are lightweight, state-of-the-art, open foundation models that natively support multilinguality, coding, reasoning, and tool usage, including the potential to be run on constrained compute resources. All the models are publicly released under an Apache 2.0 license for both research and commercial use. The models\' data curation and training procedure were designed for enterprise usage and customization, with a process that evaluates datasets for governance, risk and compliance (GRC) criteria, in addition to IBM\'s standard data clearance process and document quality checks.\n\nGranite 3.1 language models extend the context length of Granite 3.0 language models from 4K to 128K using a progressive training strategy by increasing the supported context length in increments while adjusting RoPE theta until the models successfully adapt to the desired length of 128K. This long-context pre-training stage was performed using approximately 500B tokens. Moreover, Granite 3.1 instruction models provide an improved developer experience for function-calling and RAG generation tasks.\n\nGranite 3.1 models come in 4 varying sizes and 2 architectures:\n- Dense Models: 2B and 8B parameter models, trained on 12 trillion tokens in total.\n- Mixture-of-Expert (MoE) Models: Sparse 1B and 3B MoE models, with 400M and 800M activated parameters respectively, trained on 10 trillion tokens in total.\n\nAccordingly, these options provide a range of models with different compute requirements to choose from, with appropriate trade-offs with their performance on downstream tasks. At each scale, we release base model â€” checkpoints of models after pretraining, as well as instruct checkpoints â€” models finetuned for dialogue, instruction-following, helpfulness, and safety.\n\nEvaluation results show that Granite-3.1-8B-Instruct outperforms models of similar parameter sizes in [Hugging Face\'s OpenLLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/) (see Figure 1). \n\n\n\n\n  Figure 1. Evaluation results from Granite-3.1-8B-Instruct in Hugging Face\'s OpenLLM Leaderboard.\n\n\nComprehensive evaluation results for all model variants, as well as other relevant information will be available in Granite 3.1 Language Models technical report.\n\n## How to Use our Models?\nTo use any of our models, pick an appropriate `model_path` from:\n1. `ibm-granite/granite-3.1-2b-base`\n2. `ibm-granite/granite-3.1-2b-instruct`\n3. `ibm-granite/granite-3.1-8b-base`\n4. `ibm-granite/granite-3.1-8b-instruct`\n5. `ibm-granite/granite-3.1-1b-a400m-base`\n6. `ibm-granite/granite-3.1-1b-a400m-instruct`\n7. `ibm-granite/granite-3.1-3b-a800m-base`\n8. `ibm-granite/granite-3.1-3b-a800m-instruct`\n\n### Inference\nThis is a simple example of how to use Granite-3.1-1B-A400M-Instruct model.\n\n```python\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\ndevice = "auto"\nmodel_path = "ibm-granite/granite-3.1-1b-a400m-instruct"\ntokenizer = AutoTokenizer.from_pretrained(model_path)\n# drop device_map if running on CPU\nmodel = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)\nmodel.eval()\n# change input text as desired\nchat = [\n    { "role": "user", "content": "Please list one IBM Research laboratory located in the United States. You should only output its name and location." },\n]\nchat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)\n# tokenize the text\ninput_tokens = tokenizer(chat, return_tensors="pt").to(device)\n# generate output tokens\noutput = model.generate(**input_tokens, \n                        max_new_tokens=100)\n# decode output tokens into text\noutput = tokenizer.batch_decode(output)\n# print output\nprint(output)\n```\n## How to Download our Models?\nThe model of choice (granite-3.1-1b-a400m-instruct in this example) can be cloned using:\n```shell\ngit clone https://huggingface.co/ibm-granite/granite-3.1-1b-a400m-instruct\n```\n\n## How to Contribute to this Project?\nPlese check our [Guidelines](/CONTRIBUTING.md) and [Code of Conduct](/CODE_OF_CONDUCT.md) to contribute to our project.\n\n## Model Cards\nThe model cards for each model variant are available in their respective HuggingFace repository. Please visit our collection [here](https://huggingface.co/collections/ibm-granite/granite-31-language-models-6751dbbf2f3389bec5c6f02d).\n\n## License \nAll Granite 3.0 Language Models are distributed under [Apache 2.0](./LICENSE) license.\n\n## Would you like to provide feedback?\nPlease let us know your comments about our family of language models by visiting our [collection](https://huggingface.co/collections/ibm-granite/granite-31-language-models-6751dbbf2f3389bec5c6f02d). Select the repository of the model you would like to provide feedback about. Then, go to *Community* tab, and click on *New discussion*. Alternatively, you can also post any questions/comments on our [github discussions page](https://github.com/orgs/ibm-granite/discussions).\n\n')]

Now, we can use LangChain's MarkdownHeaderTextSplitter to split the file by header type, which we set in the headers_to_split_on list. We will also print one of the chunks as an example.

In [ ]:

Copied!





#document based chunking
from langchain_text_splitters import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
document_based_chunks = markdown_splitter.split_text(markdown_doc[0].page_content)
document_based_chunks[3]
#document based chunking
from langchain_text_splitters import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
document_based_chunks = markdown_splitter.split_text(markdown_doc[0].page_content)
document_based_chunks[3]

Out[ ]:

Document(metadata={'Header 2': 'How to Use our Models?', 'Header 3': 'Inference'}, page_content='This is a simple example of how to use Granite-3.1-1B-A400M-Instruct model.  \n```python\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\ndevice = "auto"\nmodel_path = "ibm-granite/granite-3.1-1b-a400m-instruct"\ntokenizer = AutoTokenizer.from_pretrained(model_path)\n# drop device_map if running on CPU\nmodel = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)\nmodel.eval()\n# change input text as desired\nchat = [\n{ "role": "user", "content": "Please list one IBM Research laboratory located in the United States. You should only output its name and location." },\n]\nchat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)\n# tokenize the text\ninput_tokens = tokenizer(chat, return_tensors="pt").to(device)\n# generate output tokens\noutput = model.generate(**input_tokens,\nmax_new_tokens=100)\n# decode output tokens into text\noutput = tokenizer.batch_decode(output)\n# print output\nprint(output)\n```')

As you can see in the output, the chunking successfully split the text by header type.

Step 7. Create vector store¶

Now that we have experimented with various chunking strategies, let's move along with our RAG implementation. For this tutorial, we will choose the chunks produced by the semantic split and convert them to vector embeddings. An open source vector store we can use is Chroma DB. We can easily access Chroma functionality through the langchain_chroma package.

Let's initialize our Chroma vector database, provide it with our embeddings model and add our documents produced by semantic chunking.

In [ ]:

Copied!





vector_db = Chroma(
    collection_name="example_collection",
    embedding_function=embeddings_model,
    persist_directory="./chroma_langchain_db",  # Where to save data locally, remove if not necessary
)
vector_db = Chroma(
    collection_name="example_collection",
    embedding_function=embeddings_model,
    persist_directory="./chroma_langchain_db",  # Where to save data locally, remove if not necessary
)

In [ ]:

Copied!

vector_db.add_documents(semantic_chunks)
vector_db.add_documents(semantic_chunks)

Out[ ]:

['84fcc1f6-45bb-4031-b12e-031139450cf8',
 '433da718-0fce-4ae8-a04a-e62f9aa0590d',
 '4bd97cd3-526a-4f70-abe3-b95b8b47661e',
 '342c7609-b1df-45f3-ae25-9d9833829105',
 '46a452f6-2f02-4120-a408-9382c240a26e']

Step 8. Structure the prompt template¶

Next, we can move onto creating a prompt template for our LLM. This prompt template allows us to ask multiple questions without altering the initial prompt structure. We can also provide our vector store as the retriever. This step finalizes the RAG structure.

In [ ]:

Copied!





from langchain.chains import create_retrieval_chain
from langchain.prompts import PromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain

prompt_template = """<|start_of_role|>user<|end_of_role|>Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {input}<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>"""

qa_chain_prompt = PromptTemplate.from_template(prompt_template)
combine_docs_chain = create_stuff_documents_chain(llm, qa_chain_prompt)
rag_chain = create_retrieval_chain(vector_db.as_retriever(), combine_docs_chain)
from langchain.chains import create_retrieval_chain
from langchain.prompts import PromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain

prompt_template = """<|start_of_role|>user<|end_of_role|>Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {input}<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>"""

qa_chain_prompt = PromptTemplate.from_template(prompt_template)
combine_docs_chain = create_stuff_documents_chain(llm, qa_chain_prompt)
rag_chain = create_retrieval_chain(vector_db.as_retriever(), combine_docs_chain)

Step 9. Prompt the RAG chain¶

Using our completed RAG workflow, let's invoke a user query. First, we can strategically prompt the model without any additional context from the vector store we built to test whether the model is using its built-in knowledge or truly using the RAG context. The Granite 3.1 announcement blog references Docling, IBM's tool for parsing various document types and converting them into Markdown or JSON. Let's ask the LLM about Docling.

In [ ]:

Copied!

output = llm.invoke("What is Docling?")
output
output = llm.invoke("What is Docling?")
output

Out[ ]:

"\nDocling is a platform that allows users to create, share and discover interactive documents. It's like having your own personal library of dynamic content where you can add notes, highlights, bookmarks, and even collaborate with others in real-time. Think of it as the next generation of document management systems designed for modern collaboration needs."

Clearly, the model was not trained on information about Docling and without outside tools or information, it cannot provide us with the correct information. The model hallucinates. Now, let's try providing the same query to the RAG chain we built.

In [ ]:

Copied!

rag_output = rag_chain.invoke({"input": "What is Docling?"})
rag_output['answer']
rag_output = rag_chain.invoke({"input": "What is Docling?"})
rag_output['answer']

Out[ ]:

'Docling is a powerful tool developed by IBM Deep Search for parsing documents in various formats such as PDF, DOCX, images, PPTX, XLSX, HTML, and AsciiDoc, and converting them into model-friendly formats like Markdown or JSON. This enables easier access to the information within these documents for models like Granite for tasks such as RAG and other workflows. Docling is designed to integrate seamlessly with agentic frameworks like LlamaIndex, LangChain, and Bee, providing developers with the flexibility to incorporate its assistance into their preferred ecosystem. It surpasses basic optical character recognition (OCR) and text extraction methods by employing advanced contextual and element-based preprocessing techniques. Currently, Docling is open-sourced under the permissive MIT License, and the team continues to develop additional features, including equation and code extraction, as well as metadata extraction.'

Great! The Granite model correctly used the RAG context to tell us correct information about Docling while preserving semantic coherence. We proved this same result was not possible without the use of RAG.

Summary¶

In this tutorial, you created a RAG pipeline and experimented with several chunking strategies to improve the system’s retrieval accuracy. Using the Granite 3.1 model, we successfully produced appropriate model responses to a user query related to the documents provided as context. The text we used for this RAG implementation was loaded from a blog on ibm.com announcing the release of Granite 3.1. The model provided us with information only accessible through the provided context since it was not part of the model's initial knowledge base.

For those in search of further reading, check out the results of a project comparing LLM performance using HTML structured chunking in comparison to watsonx chunking.