Implement agentic chunking to optimize LLM inputs with Langchain and watsonx.ai¶

Author: Shalini Harkar

What is Agentic Chunking?¶

The way language models process and segment text is changing from the traditional static approach, to a better, more responsive process. Unlike traditional fixed-size chunking , which chunks large documents at fixed points, agentic chunking employs AI-based techniques to analyze content in a dynamic process, and to determine the best way to segment the text.

Agentic chunking makes use of AI-based text-splitting methods, recursive chunking, and chunk overlap methods, which work concurrently to polish chunking ability, preserving links between notable ideas while optimizing contextual windows in real time. With agentic chunking, each chunk is enriched with metadata to deepen retrieval accuracy and overall model efficiency. This is particularly important in RAG applications applications , where segmentation of data can directly impact retrieval quality and coherence of the response. Meaningful context is preserved in all chunks, making this approach incredibly important to chatbots, knowledge bases, and generative ai use cases..

Key Elements of Agentic Chunking¶

Adaptive Chunking Strategy – Dynamically choosing the best chunking method based on the type of content, the intent behind the query, and the needs for retrieval to ensure effective segmentation.
Dynamic Chunk Sizing – Modifying chunk sizes in real time by taking into account the semantic structure and context, instead of sticking to fixed token limits.
Context-Preserving Overlap – Smartly assessing the overlap between chunks to keep coherence intact and avoid losing essential information, thereby enhancing retrieval efficiency.

Advantages of Agentic Chunking Over Traditional Methods¶

Retains Context – Maintains crucial information without unnecessary breaks.
Smart Sizing – Adjusts chunk boundaries according to meaning and significance.
Query-Optimized – Continuously refines chunks to match specific queries.
Efficient Retrieval – Improves search and RAG by minimizing unnecessary fragmentation.

In this tutorial, you will experiment with agentic chunking strategy by using the IBM Granite-3.0-8B-Instruct model now available on watsonx.ai®. The overall goal is to perform efficient chunking to effectively implement RAG.

Prerequisite¶

You need an IBM Cloud account® to create a watsonx.ai project.

Steps¶

Step 1. Set up your environment¶

While you can choose from several tools, this tutorial walks you through how to set up an IBM account to use a Jupyter Notebook.

Log in to watsonx.ai using your IBM Cloud® account.
Create a watsonx.ai project.

You can get your project ID from within your project. Click the Manage tab. Then, copy the project ID from the Details section of the General page. You need this ID for this tutorial.
Create a Jupyter Notebook.

This step opens a Jupyter Notebook environment where you can copy the code from this tutorial. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. To view more Granite tutorials, check out the IBM Granite Community. ss

Step 2. Set up a watsonx.ai Runtime instance and API key.¶

Create a watsonx.ai Runtime service instance (select your appropriate region and choose the Lite plan, which is a free instance).
Generate an application programming interface (API) key.
Associate the watsonx.ai Runtime service instance to the project that you created in watsonx.ai.

Step 3. Install and import relevant libraries and set up your credentials¶

We need a few libraries and modules for this tutorial. Make sure to import the following ones and if they're not installed, a quick pip installation resolves the problem.

Note, this tutorial was built using Python 3.12.7.

In [1]:

Copied!

# installations
!pip install -q langchain langchain-ibm langchain_experimental langchain-text-splitters langchain_chroma transformers bs4 langchain_huggingface sentence-transformers
# installations
!pip install -q langchain langchain-ibm langchain_experimental langchain-text-splitters langchain_chroma transformers bs4 langchain_huggingface sentence-transformers

In [ ]:

Copied!





import getpass
import requests
from bs4 import BeautifulSoup
from langchain_ibm import WatsonxLLM
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import WebBaseLoader
from langchain.schema import SystemMessage, HumanMessage
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.prompts import ChatPromptTemplate
from langchain.vectorstores import Chroma
from langchain.tools import tool
from langchain.agents import AgentExecutor
from langchain.memory import ConversationBufferMemory
from transformers import AutoTokenizer
from ibm_watsonx_ai.foundation_models.utils.enums import EmbeddingTypes
from ibm_watsonx_ai.metanames import GenTextParamsMetaNames as GenParams
from langchain.output_parsers import CommaSeparatedListOutputParser
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.schema import Document
from langchain.chains.combine_documents import create_stuff_documents_chain
import getpass
import requests
from bs4 import BeautifulSoup
from langchain_ibm import WatsonxLLM
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import WebBaseLoader
from langchain.schema import SystemMessage, HumanMessage
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.prompts import ChatPromptTemplate
from langchain.vectorstores import Chroma
from langchain.tools import tool
from langchain.agents import AgentExecutor
from langchain.memory import ConversationBufferMemory
from transformers import AutoTokenizer
from ibm_watsonx_ai.foundation_models.utils.enums import EmbeddingTypes
from ibm_watsonx_ai.metanames import GenTextParamsMetaNames as GenParams
from langchain.output_parsers import CommaSeparatedListOutputParser
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.schema import Document
from langchain.chains.combine_documents import create_stuff_documents_chain

To set our credentials, we need the WATSONX_APIKEY and WATSONX_PROJECT_ID you generated in step 1. We will also set the URL serving as the API endpoint.

In [4]:

Copied!

WATSONX_APIKEY = getpass.getpass("Please enter your watsonx.ai Runtime API key (hit enter): ")

WATSONX_PROJECT_ID = getpass.getpass("Please enter your project ID (hit enter): ")

URL = "https://us-south.ml.cloud.ibm.com"
WATSONX_APIKEY = getpass.getpass("Please enter your watsonx.ai Runtime API key (hit enter): ")

WATSONX_PROJECT_ID = getpass.getpass("Please enter your project ID (hit enter): ")

URL = "https://us-south.ml.cloud.ibm.com"

Step 4. Initialize your LLM¶

We will use Granite 3.1 as our LLM for this tutorial. To initialize the LLM, we need to set the model parameters. To learn more about these model parameters, such as the minimum and maximum token limits, refer to the documentation.

In [ ]:

Copied!





llm = WatsonxLLM(
    model_id= "ibm/granite-3-8b-instruct", 
    url=URL,
    apikey=WATSONX_APIKEY,
    project_id=WATSONX_PROJECT_ID,
    params={
        GenParams.DECODING_METHOD: "greedy",
        GenParams.TEMPERATURE: 0,
        GenParams.MIN_NEW_TOKENS: 5,
        GenParams.MAX_NEW_TOKENS: 2000,
        GenParams.REPETITION_PENALTY:1.2,
        GenParams.STOP_SEQUENCES: ["\n\n"]
    }
)
llm = WatsonxLLM(
    model_id= "ibm/granite-3-8b-instruct", 
    url=URL,
    apikey=WATSONX_APIKEY,
    project_id=WATSONX_PROJECT_ID,
    params={
        GenParams.DECODING_METHOD: "greedy",
        GenParams.TEMPERATURE: 0,
        GenParams.MIN_NEW_TOKENS: 5,
        GenParams.MAX_NEW_TOKENS: 2000,
        GenParams.REPETITION_PENALTY:1.2,
        GenParams.STOP_SEQUENCES: ["\n\n"]
    }
)

Step 5. Load your document¶

This function extracts the text content from IBM's explainer page on Machine learning. This function removes unwanted HTML elements (scripts, styles), and returns clean, readable text.

In [10]:

Copied!





# Step 2: Function to extract text from a web page
def get_text_from_url(url):
    response = requests.get(url)
    if response.status_code != 200:
        raise ValueError(f"Failed to fetch the page, status code: {response.status_code}")
    
    soup = BeautifulSoup(response.text, "html.parser")
    
    # Remove unwanted elements (scripts, styles)
    for script in soup(["script", "style"]):
        script.decompose()
    
    return soup.get_text(separator="\n", strip=True)
# Step 2: Function to extract text from a web page
def get_text_from_url(url):
    response = requests.get(url)
    if response.status_code != 200:
        raise ValueError(f"Failed to fetch the page, status code: {response.status_code}")
    
    soup = BeautifulSoup(response.text, "html.parser")
    
    # Remove unwanted elements (scripts, styles)
    for script in soup(["script", "style"]):
        script.decompose()
    
    return soup.get_text(separator="\n", strip=True)

In [ ]:

Copied!

url = "https://www.ibm.com/think/topics/machine-learning"  # Example URL
web_text = get_text_from_url(url)  # Fetch and clean text from URL
web_text
url = "https://www.ibm.com/think/topics/machine-learning"  # Example URL
web_text = get_text_from_url(url)  # Fetch and clean text from URL
web_text

Let's provide sample code for implementing Agentic Chunking using LangChain and granite model.¶

Step 6 . Instead of using a fixed-length chunking method, we used Agentic Chunking here.¶

In [13]:

Copied!





# Step 3: Agentic Chunking Function
def agentic_chunking(text):
    """
    Dynamically splits text into meaningful chunks using LLM.
    """
    system_message = SystemMessage(content="You are an AI assistant helping to split text into meaningful chunks based on topics.")
    human_message = HumanMessage(content=f"Please divide the following text into semantically different, separate and meaningful chunks:\n\n{text}")

    response = llm.invoke([system_message, human_message])  # LLM returns a string
    return response.split("\n\n")  # Split based on meaningful sections
# Step 3: Agentic Chunking Function
def agentic_chunking(text):
    """
    Dynamically splits text into meaningful chunks using LLM.
    """
    system_message = SystemMessage(content="You are an AI assistant helping to split text into meaningful chunks based on topics.")
    human_message = HumanMessage(content=f"Please divide the following text into semantically different, separate and meaningful chunks:\n\n{text}")

    response = llm.invoke([system_message, human_message])  # LLM returns a string
    return response.split("\n\n")  # Split based on meaningful sections

In [ ]:

Copied!

###Step 7- Calling the agentic Chunking defined above for Sample Text
chunks = agentic_chunking(web_text)
chunks
###Step 7- Calling the agentic Chunking defined above for Sample Text
chunks = agentic_chunking(web_text)
chunks

In [ ]:

Copied!

### Lets print out the chunks for better understanding of their output structure. 
for i, chunk in enumerate(chunks,1):
    print(f"Chunk {i}:\n{chunk}\n{'-'*40}")
### Lets print out the chunks for better understanding of their output structure. 
for i, chunk in enumerate(chunks,1):
    print(f"Chunk {i}:\n{chunk}\n{'-'*40}")

Great! As you can see in the output, the chunks are successfully created by the agents.¶

Step 8. Create vector store¶

Now that we have experimented with Agentic chunking on the given text , let's move along with our RAG implementation. For this tutorial, we will choose the chunks produced by the Agents and convert them to vector embeddings. An open source vector store we can use is Chroma DB. We can easily access Chroma functionality through the langchain_chroma package.

Let's initialize our Chroma vector database, provide it with our embeddings model and add our documents produced by Agentic chunking.

In [ ]:

Copied!





from langchain_core.documents import Document
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

# Initialize the embedding model
embeddings_model = HuggingFaceEmbeddings(model_name="ibm-granite/granite-embedding-30m-english")

# Create a Chroma vector database
vector_db = Chroma(
    collection_name="example_collection",
    embedding_function=embeddings_model
)

# Convert each text chunk into a Document object
documents = [Document(page_content=chunk) for chunk in chunks]

# Add the documents to the vector database
vector_db.add_documents(documents)
from langchain_core.documents import Document
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

# Initialize the embedding model
embeddings_model = HuggingFaceEmbeddings(model_name="ibm-granite/granite-embedding-30m-english")

# Create a Chroma vector database
vector_db = Chroma(
    collection_name="example_collection",
    embedding_function=embeddings_model
)

# Convert each text chunk into a Document object
documents = [Document(page_content=chunk) for chunk in chunks]

# Add the documents to the vector database
vector_db.add_documents(documents)

Step 9. Structure the prompt template¶

Now, we can create a prompt template for our LLM. This template ensures that we can ask multiple questions while maintaining a consistent prompt structure. Additionally, we can integrate our vector store as the retriever, finalizing the RAG framework.

In [19]:

Copied!





from langchain.chains import create_retrieval_chain
from langchain.prompts import PromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain

prompt_template = """<|start_of_role|>user<|end_of_role|>Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {input}<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>"""

qa_chain_prompt = PromptTemplate.from_template(prompt_template)
combine_docs_chain = create_stuff_documents_chain(llm, qa_chain_prompt)
rag_chain = create_retrieval_chain(vector_db.as_retriever(), combine_docs_chain)
from langchain.chains import create_retrieval_chain
from langchain.prompts import PromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain

prompt_template = """<|start_of_role|>user<|end_of_role|>Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {input}<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>"""

qa_chain_prompt = PromptTemplate.from_template(prompt_template)
combine_docs_chain = create_stuff_documents_chain(llm, qa_chain_prompt)
rag_chain = create_retrieval_chain(vector_db.as_retriever(), combine_docs_chain)

Step 10. Prompt the RAG chain¶

Using our Agentic Chunks in the RAG workflow, let's invoke a user query. First, we can strategically prompt the model without any additional context from the vector store we built to test whether the model is using its built-in knowledge or truly using the RAG context. Using Machine learning explainer from IBM's Let's ask the LLM about Docling.

In [ ]:

Copied!

output = llm.invoke("What is Model optimization process")
output
output = llm.invoke("What is Model optimization process")
output

Clearly, the model was not trained on information about Model optimization process and without outside tools or information, it cannot provide us with the correct information. The model hallucinates. Now, let's try providing the same query to the RAG chain with Agentic Chunks we built.

In [ ]:

Copied!

rag_output = rag_chain.invoke({"input": "What is Model optimization process?"})
rag_output['answer']
rag_output = rag_chain.invoke({"input": "What is Model optimization process?"})
rag_output['answer']

Great! The Granite model correctly used the Agentic RAG chunks as context to tell us correct information about Model optimization process while preserving semantic coherence.

Summary

In this tutorial, we generated smaller pieces of relevant information using AI agents in the chunking process and constructed a retrieval-augmented generation (RAG) pipeline.

This method improves information retrieval and context window optimization using artificial intelligence and natural language processing (NLP). It streamlines data chunks to enhance retrieval efficiency when leveraging large language models (LLMs) like OpenAI's GPT models for better results.