Chunking with Docling¶
Chunking is the process of splitting large texts into smaller, manageable segments before feeding them into a model. This is an important step because models have a maximum context length, and chunking ensures that relevant information fits within this limit while preserving coherence, improving retrieval accuracy, and avoiding loss of important content during processing.
In this lab we will explore the importance of chunking and the capabilities Docling has to create more valuable chunks.
Prerequisites¶
This lab is a Jupyter notebook. Please follow the instructions in prework for the prerequisites to run the lab.
Lab¶
Launch Jupyter Lab by running the following commands from the opentech directory of your beeai-workshop cloned repo.
-
Create
doclingkernelwhich will have the dependencies preinstalled in our virtual environment.uv run --directory docling ipython kernel install --user --env VIRTUAL_ENV .venv --name=doclingkernel -
Use
uvto run Jupyter Lab. The directory and allow_hidden gives us access to.venvmodules.uv run --directory docling jupyter lab --ContentsManager.allow_hidden=True -
In Jupyter Lab in your browser, walk through the notebook:
- Navigate to the
notebooksfolder - Open
Chunking.ipynb - Use the play button to walk through the notebook
- Be sure to read the text, the code, and the output
- Exit your browser tab
- Exit your Jupyter Lab server by entering CTRL-C, CTRL-C in your the terminal where it is running
- Navigate to the