RAG Components¶
This page provides detailed architecture documentation for the RAG pipeline components that ai4rag optimizes.
Component Hierarchy¶
classDiagram
class BaseFoundationModel {
<<abstract>>
+client: ClientT
+model_id: str
+params: ParamsT
+system_message_text: str
+user_message_text: str
+context_template_text: str
+chat(messages)* list
}
class LSFoundationModel {
+client: LlamaStackClient
+params: LSModelParameters
+chat(messages) list
}
class OpenAIFoundationModel {
+client: OpenAI
+params: OpenAIModelParameters
+chat(messages) list
}
class BaseEmbeddingModel {
<<abstract>>
+client: ClientT
+model_id: str
+params: ParamsT
+embed_documents(texts)* list
+embed_query(query)* list
}
class LSEmbeddingModel {
+client: LlamaStackClient
+params: LSEmbeddingParams
+embed_documents(texts) list
+embed_query(query) list
}
class OpenAIEmbeddingModel {
+client: OpenAI
+params: OpenAIEmbeddingParams
+embed_documents(texts) list
+embed_query(query) list
}
class BaseVectorStore {
<<abstract>>
+embedding_model: BaseEmbeddingModel
+distance_metric: str
+collection_name: str
+search(query, k)* list
+add_documents(docs)* void
}
class LSVectorStore {
+client: LlamaStackClient
+search(query, k, search_mode, ranker_*) list
+add_documents(docs) void
}
class ChromaVectorStore {
+search(query, k) list
+window_search(query, k, window_size) list
+add_documents(docs) list
}
class BaseChunker {
<<abstract>>
+split_documents(docs)* list
+to_dict()* dict
+from_dict(d)* BaseChunker
}
class LangChainChunker {
+method: str
+chunk_size: int
+chunk_overlap: int
+split_documents(docs) list
}
class Retriever {
+vector_store: BaseVectorStore
+method: str
+number_of_chunks: int
+search_mode: str
+ranker_strategy: str
+ranker_k: int
+ranker_alpha: float
+retrieve(query) list
}
class BaseRAGTemplate {
<<abstract>>
+foundation_model: BaseFoundationModel
+retriever: Retriever
+build_index(docs)* void
+generate(question)* dict
+generate_stream(question)* iterator
}
class SimpleRAG {
+chunker: LangChainChunker
+embedding_model: BaseEmbeddingModel
+vector_store: BaseVectorStore
+build_index(docs) void
+generate(question) dict
+generate_stream(question) iterator
}
BaseFoundationModel <|-- LSFoundationModel
BaseFoundationModel <|-- OpenAIFoundationModel
BaseEmbeddingModel <|-- LSEmbeddingModel
BaseEmbeddingModel <|-- OpenAIEmbeddingModel
BaseVectorStore <|-- LSVectorStore
BaseVectorStore <|-- ChromaVectorStore
BaseChunker <|-- LangChainChunker
BaseRAGTemplate <|-- SimpleRAG
BaseVectorStore --> BaseEmbeddingModel : uses
Retriever --> BaseVectorStore : uses
BaseRAGTemplate --> BaseFoundationModel : uses
BaseRAGTemplate --> Retriever : uses
SimpleRAG --> LangChainChunker : uses Foundation Models¶
Foundation models generate text responses given prompts and retrieved context.
BaseFoundationModel¶
Abstract base class defining the foundation model interface:
class BaseFoundationModel(Generic[ClientT, ParamsT], ABC):
def __init__(
self,
client: ClientT,
model_id: str,
params: ParamsT,
system_message_text: str | None = None,
user_message_text: str | None = None,
context_template_text: str | None = None,
):
Configurable Prompt Templates:
Foundation models support three customizable prompt templates:
1. system_message_text
The system prompt that defines the model's behavior:
# Default:
"You are a helpful, respectful and honest assistant. "
"Always answer as helpfully as possible, while being safe."
2. user_message_text
Template for formatting the user's question with retrieved context:
Placeholders: - {reference_documents}: Formatted context from retrieval - {question}: The user's question
3. context_template_text
Template for formatting each retrieved document:
Placeholder: - {document}: Individual chunk's page_content
Customization Example:
foundation_model = LSFoundationModel(
model_id="ollama/llama3.2:3b",
client=client,
system_message_text="You are a technical documentation assistant specialized in software APIs.",
user_message_text="Context:\n{reference_documents}\n\nUser Question: {question}\n\nDetailed Answer:",
context_template_text="[Document {document_id}] {document}\n\n"
)
Interface Method:
@abstractmethod
def chat(self, messages: list[MessageTyped]) -> list[MessageTyped]:
"""Chat with the model based on the client capabilities."""
MessageTyped Format:
class MessageTyped(TypedDict):
role: str # "system", "user", or "assistant"
content: str # Message text
LSFoundationModel¶
Llama Stack integration for foundation models:
class LSFoundationModel(BaseFoundationModel[LlamaStackClient, LSModelParameters]):
def __init__(
self,
client: LlamaStackClient,
model_id: str,
params: dict | LSModelParameters | None = None,
system_message_text: str | None = None,
user_message_text: str | None = None,
context_template_text: str | None = None,
):
Parameters:
@dataclass
class LSModelParameters:
max_completion_tokens: int = 1024 # Max tokens in response
temperature: float = 0.1 # Sampling temperature (0.0-1.0)
Chat Implementation:
def chat(self, messages: list[MessageTyped]) -> list[MessageTyped]:
response = self.client.chat.completions.create(
model=self.model_id,
messages=messages,
max_completion_tokens=self.params.max_completion_tokens,
temperature=self.params.temperature,
)
return response.choices # List of response choices
Usage:
foundation_model = LSFoundationModel(
model_id="ollama/llama3.2:3b",
client=llama_stack_client,
params={"max_completion_tokens": 512, "temperature": 0.0}
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2+2?"}
]
response = foundation_model.chat(messages)
answer = response[0].message.content
OpenAIFoundationModel¶
OpenAI integration for foundation models:
class OpenAIFoundationModel(BaseFoundationModel[OpenAI, OpenAIModelParameters]):
# Similar interface to LSFoundationModel but uses OpenAI client
Supported Models: - GPT-4, GPT-4 Turbo - GPT-3.5 Turbo - Any model accessible via OpenAI API
Embedding Models¶
Embedding models convert text into dense vector representations for semantic search.
BaseEmbeddingModel¶
Abstract base class for embedding models:
class BaseEmbeddingModel(ABC, Generic[ClientT, ParamsT]):
def __init__(
self,
client: ClientT,
model_id: str,
params: ParamsT | None = None
):
Interface Methods:
@abstractmethod
def embed_documents(self, texts: list[str]) -> list[list[float]]:
"""Embed multiple documents (used during indexing)."""
@abstractmethod
def embed_query(self, query: str) -> list[float]:
"""Embed a single query (used during retrieval)."""
LSEmbeddingModel¶
Llama Stack integration with auto-detection of model capabilities:
class LSEmbeddingModel(BaseEmbeddingModel[LlamaStackClient, LSEmbeddingParams]):
def __init__(
self,
client: LlamaStackClient,
model_id: str,
params: dict | LSEmbeddingParams | None = None
):
Parameters:
@dataclass
class LSEmbeddingParams:
embedding_dimension: int | None = None # Auto-detected if None
context_length: int | None = None # Auto-detected if None
timeout: float | Timeout | None = None
model_type: str | None = None
provider_id: str | None = None
provider_resource_id: str | None = None
Auto-Detection:
When embedding_dimension or context_length not provided, the model auto-detects them on first use:
Embedding Dimension Detection:
def _detect_embedding_dimension(self) -> int:
"""Embed a test string and count dimensions."""
test_embedding = self._embed_text("test")[0]
return len(test_embedding) # e.g., 768 for nomic-embed-text
Context Length Detection:
def _detect_context_length(self) -> int:
"""Binary search to find max context length."""
lo, hi, best = 64, 8192, None
while hi - lo >= 64:
mid = (lo + hi) // 2
probe_text = "word " * mid # Approx. 1 word = 1 token
try:
self._embed_text(probe_text)
best = mid
lo = mid + 1
except:
hi = mid - 1
return best
Performance: ~5 API calls for context length detection via binary search.
Batch Processing:
def embed_documents(self, texts: list[str]) -> list[list[float]]:
"""Process in batches of 2048 to respect API limits."""
embeddings = []
for idx in range(0, len(texts), 2048):
batch = texts[idx : idx + 2048]
batch_embeddings = self._embed_text(batch)
embeddings.extend(batch_embeddings)
return embeddings
Usage:
# Auto-detect parameters
embedding_model = LSEmbeddingModel(
model_id="ollama/nomic-embed-text:latest",
client=llama_stack_client,
)
# First call triggers detection:
# - embedding_dimension = 768 (detected)
# - context_length = 8192 (detected)
# Or explicitly provide parameters
embedding_model = LSEmbeddingModel(
model_id="ollama/nomic-embed-text:latest",
client=llama_stack_client,
params={"embedding_dimension": 768, "context_length": 8192}
)
# Embed documents
embeddings = embedding_model.embed_documents(["text 1", "text 2", ...])
# Returns: [[0.1, -0.2, ...], [0.3, 0.1, ...], ...]
# Embed query
query_embedding = embedding_model.embed_query("What is X?")
# Returns: [0.05, -0.12, ...]
OpenAIEmbeddingModel¶
OpenAI integration for embeddings:
class OpenAIEmbeddingModel(BaseEmbeddingModel[OpenAI, OpenAIEmbeddingParams]):
# Similar interface but uses OpenAI client
Supported Models: - text-embedding-3-small - text-embedding-3-large - text-embedding-ada-002
Vector Stores¶
Vector stores manage document storage, embedding indexing, and similarity search.
BaseVectorStore¶
Abstract base class for vector stores:
class BaseVectorStore(ABC):
def __init__(
self,
embedding_model: BaseEmbeddingModel,
distance_metric: str,
reuse_collection_name: str | None = None
):
Interface Methods:
@abstractmethod
def search(self, query: str, k: int, **kwargs) -> list[Document]:
"""Search for k most relevant chunks."""
@abstractmethod
def add_documents(self, documents: Sequence[Document]) -> None:
"""Add documents to the collection."""
@property
@abstractmethod
def collection_name(self) -> str:
"""Returns collection name (reused or newly created)."""
ChromaVectorStore¶
In-memory ChromaDB implementation for development and testing:
class ChromaVectorStore(BaseVectorStore):
def __init__(
self,
embedding_model: BaseEmbeddingModel,
reuse_collection_name: str | None = None,
distance_metric: str = "cosine",
**kwargs
):
Supported Distance Metrics:
"cosine": Cosine similarity (default)"l2": Euclidean distance
Search Methods:
1. Standard Search:
def search(
self,
query: str,
k: int = 5,
include_scores: bool = False,
**kwargs
) -> list[Document] | list[tuple[Document, float]]:
"""Vector similarity search."""
2. Window Search:
def window_search(
self,
query: str,
k: int = 5,
window_size: int = 2,
include_scores: bool = False,
**kwargs
) -> list[Document]:
"""Retrieve chunks + adjacent chunks (window) from same document."""
Window Search Details:
For each retrieved chunk: 1. Extract document_id and sequence_number from metadata 2. Query vector store for chunks with: - Same document_id - sequence_number in [seq - window_size, seq + window_size] 3. Sort by sequence_number 4. Merge into single document (concatenate page_content)
Example:
# Retrieved chunk: document_id="doc1", sequence_number=5
# window_size=2
# Fetches chunks with sequence_number in [3, 4, 5, 6, 7]
# Returns merged document with all 5 chunks concatenated
Batch Document Addition:
def add_documents(self, documents: list, max_batch_size: int = 2048) -> list[str]:
"""Add documents in batches of max_batch_size."""
for batch_start in range(0, len(docs), max_batch_size):
batch = docs[batch_start : batch_start + max_batch_size]
self._vector_store.add_documents(batch, ids=ids)
Usage:
vector_store = ChromaVectorStore(
embedding_model=embedding_model,
distance_metric="cosine"
)
# Index documents
vector_store.add_documents(chunked_documents)
# Search
results = vector_store.search(query="What is X?", k=5)
# Returns: [Document(...), Document(...), ...]
# Window search
results = vector_store.window_search(query="What is X?", k=5, window_size=2)
# Returns: [merged_doc_1, merged_doc_2, ...]
LSVectorStore¶
Llama Stack integration supporting any vector store provider and hybrid search:
class LSVectorStore(BaseVectorStore):
def __init__(
self,
embedding_model: LSEmbeddingModel,
client: LlamaStackClient,
provider_id: str,
reuse_collection_name: str | None = None,
distance_metric: str | None = None,
):
Provider-Agnostic Design:
The provider_id parameter determines the backend vector store:
# Milvus
vector_store = LSVectorStore(
embedding_model=embedding_model,
client=client,
provider_id="milvus"
)
# Qdrant
vector_store = LSVectorStore(
embedding_model=embedding_model,
client=client,
provider_id="qdrant"
)
# Any provider supported by Llama Stack server
Collection Creation:
vs = client.vector_stores.create(
extra_body={
"provider_id": provider_id,
"embedding_model": embedding_model.model_id,
"embedding_dimension": embedding_model.params.embedding_dimension,
}
)
collection_name = vs.id # Unique collection identifier
Hybrid Search Support:
def search(
self,
query: str,
k: int,
search_mode: str = "vector",
ranker_strategy: str | None = None,
ranker_k: int | None = None,
ranker_alpha: float | None = None,
**kwargs
) -> list[Document]:
Search Modes:
1. Vector Mode (default):
Pure semantic search using dense embeddings.
2. Hybrid Mode:
results = vector_store.search(
query="What is X?",
k=5,
search_mode="hybrid",
ranker_strategy="rrf",
ranker_k=60
)
Combines dense vector search with sparse keyword search (e.g., BM25).
Ranker Strategies:
| Strategy | Description | Parameters |
|---|---|---|
"rrf" | Reciprocal Rank Fusion | ranker_k: smoothing constant (30-100) |
"weighted" | Weighted combination | ranker_alpha: dense weight (0.0-1.0) |
"normalized" | Score normalization | Strategy-specific |
RRF Example:
# Combines dense and sparse rankings
params = {
"mode": "hybrid",
"reranker_type": "rrf",
"reranker_params": {"impact_factor": 60} # ranker_k
}
Weighted Example:
# 70% dense (semantic), 30% sparse (keyword)
params = {
"mode": "hybrid",
"reranker_type": "weighted",
"reranker_params": {"alpha": 0.7} # ranker_alpha
}
Validation:
LSVectorStore validates hybrid search parameters:
def _validate_search_params(search_mode, ranker_strategy, ranker_k, ranker_alpha):
# When search_mode != "hybrid":
# - ranker_strategy must be None or ""
# - ranker_k must be None or 0
# - ranker_alpha must be None or 1
# When search_mode == "hybrid":
# - ranker_strategy must be non-empty ("rrf", "weighted", "normalized")
# - ranker_k > 0 only for "rrf"
# - ranker_alpha != 1 only for "weighted"
Document Addition:
def add_documents(self, documents: list[Document], batch_size: int = 2048):
"""Add documents with embeddings to Llama Stack vector store."""
chunks = [
{
"content": doc.page_content,
"chunk_metadata": doc.metadata,
"chunk_id": doc.metadata["document_id"],
"embedding_model": self.embedding_model.model_id,
"embedding_dimension": self.embedding_model.params.embedding_dimension,
"embedding": embedding_vector,
}
for doc, embedding_vector in zip(documents, embeddings)
]
for idx in range(0, len(chunks), batch_size):
self.client.vector_io.insert(
vector_store_id=self.collection_name,
chunks=chunks[idx : idx + batch_size]
)
Usage:
# Create vector store
vector_store = LSVectorStore(
embedding_model=ls_embedding_model,
client=llama_stack_client,
provider_id="milvus"
)
# Index documents
vector_store.add_documents(chunked_documents)
# Vector search
results = vector_store.search(query="What is X?", k=5)
# Hybrid search with RRF
results = vector_store.search(
query="What is X?",
k=5,
search_mode="hybrid",
ranker_strategy="rrf",
ranker_k=60
)
# Hybrid search with weighted ranker
results = vector_store.search(
query="What is X?",
k=5,
search_mode="hybrid",
ranker_strategy="weighted",
ranker_alpha=0.7
)
Chunking¶
Chunkers split documents into smaller, overlapping chunks for embedding and retrieval.
BaseChunker¶
Abstract base class for chunkers:
class BaseChunker(ABC, Generic[ChunkT]):
@abstractmethod
def split_documents(self, documents: Sequence[ChunkT]) -> list[ChunkT]:
"""Split documents into smaller chunks."""
@abstractmethod
def to_dict(self) -> dict[str, Any]:
"""Serialize chunker configuration."""
@classmethod
@abstractmethod
def from_dict(cls, d: dict[str, Any]) -> "BaseChunker":
"""Deserialize chunker configuration."""
LangChainChunker¶
LangChain-based chunking with metadata management:
class LangChainChunker(BaseChunker[Document]):
def __init__(
self,
method: Literal["recursive"] = "recursive",
chunk_size: int = 2048,
chunk_overlap: int = 256,
**kwargs
):
Chunking Method:
Currently supports "recursive":
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separators=["\n\n", r"(?<=\. )", "\n", " ", ""],
length_function=len,
add_start_index=True, # Adds "start_index" to metadata
)
Splitting Hierarchy:
- Double newlines (
\n\n): Paragraph boundaries - Sentence boundaries (
(?<=\. )): After periods - Single newlines (
\n): Line breaks - Spaces (
): Word boundaries - Characters (
""): Character-level splitting (last resort)
Metadata Management:
1. Document ID Assignment:
def _set_document_id_in_metadata_if_missing(documents):
for doc in documents:
if "document_id" not in doc.metadata:
doc.metadata["document_id"] = str(hash(doc.page_content))
2. Sequence Number Assignment:
def _set_sequence_number_in_metadata(chunks):
# Sort by (document_id, start_index)
sorted_chunks = sorted(chunks, key=lambda x: (
x.metadata["document_id"],
x.metadata["start_index"]
))
# Assign sequential numbers per document
document_sequence = {}
for chunk in sorted_chunks:
doc_id = chunk.metadata["document_id"]
seq_num = document_sequence.get(doc_id, 0) + 1
document_sequence[doc_id] = seq_num
chunk.metadata["sequence_number"] = seq_num
return sorted_chunks
Output Chunk Structure:
Document(
page_content="Chunk text content...",
metadata={
"document_id": "doc1",
"sequence_number": 3,
"start_index": 1024,
# ... original document metadata preserved ...
}
)
Usage:
chunker = LangChainChunker(
method="recursive",
chunk_size=512,
chunk_overlap=128
)
chunks = chunker.split_documents(documents)
# Returns: list[Document] with sequence_number and start_index metadata
Retrieval¶
The Retriever class coordinates document retrieval from vector stores.
Retriever¶
class Retriever:
def __init__(
self,
vector_store: BaseVectorStore,
number_of_chunks: int,
method: Literal["simple", "window"] = "simple",
search_mode: Literal["vector", "hybrid"] = "vector",
ranker_strategy: str | None = None,
ranker_k: int | None = None,
ranker_alpha: float | None = None,
):
Parameters:
- vector_store: Vector store instance to query
- number_of_chunks: Top-k parameter (how many chunks to retrieve)
- method: Retrieval method
"simple": Return top-k chunks as-is"window": Expand each chunk to include adjacent chunks (ChromaDB only)- search_mode: Search type
"vector": Dense semantic search only"hybrid": Dense + sparse (keyword) search- ranker_strategy: Hybrid search ranker (
"rrf","weighted","normalized") - ranker_k: RRF smoothing parameter
- ranker_alpha: Weighted ranker dense/sparse balance
Retrieve Method:
def retrieve(self, query: str, **kwargs) -> list[Document]:
"""Retrieve relevant documents from vector store."""
_number_of_chunks = kwargs.get("number_of_chunks", self.number_of_chunks)
return self.vector_store.search(
query,
k=_number_of_chunks,
search_mode=self.search_mode,
ranker_strategy=self.ranker_strategy,
ranker_k=self.ranker_k,
ranker_alpha=self.ranker_alpha,
)
Simple vs Window Retrieval:
The method parameter determines retrieval strategy but actual implementation depends on vector store:
- LSVectorStore: Always returns simple chunks (no window expansion)
- ChromaVectorStore:
method="simple": Returns top-k chunksmethod="window": Returns top-k chunks expanded with adjacent chunks
Usage:
# Simple vector retrieval
retriever = Retriever(
vector_store=vector_store,
number_of_chunks=5,
method="simple",
search_mode="vector"
)
docs = retriever.retrieve("What is X?")
# Returns: [Document(...), Document(...), ...] (5 chunks)
# Hybrid retrieval with RRF
retriever = Retriever(
vector_store=ls_vector_store,
number_of_chunks=5,
method="simple",
search_mode="hybrid",
ranker_strategy="rrf",
ranker_k=60
)
docs = retriever.retrieve("What is X?")
# Returns: 5 chunks re-ranked by RRF (dense + sparse)
RAG Templates¶
RAG templates combine all components into end-to-end retrieval-augmented generation pipelines.
BaseRAGTemplate¶
Abstract interface for RAG templates:
class BaseRAGTemplate(ABC):
def __init__(
self,
foundation_model: BaseFoundationModel,
retriever: Retriever,
embedding_model: BaseEmbeddingModel | None = None,
vector_store: BaseVectorStore | None = None,
):
Interface Methods:
@abstractmethod
def build_index(self, documents: list[Document], **kwargs) -> None:
"""Index documents into vector store."""
@abstractmethod
def generate(self, question: str, **kwargs) -> dict[str, Any]:
"""Generate answer for question using RAG pipeline."""
@abstractmethod
def generate_stream(self, question: str, **kwargs):
"""Generate streaming answer (for future streaming support)."""
SimpleRAG¶
Complete RAG implementation using Llama Stack and LangChain:
class SimpleRAG(BaseRAGTemplate):
def __init__(
self,
foundation_model: BaseFoundationModel,
retriever: Retriever,
chunker: LangChainChunker | None = None,
embedding_model: BaseEmbeddingModel | None = None,
vector_store: BaseVectorStore | None = None,
):
build_index() Method:
def build_index(self, documents: list[Document], **kwargs) -> None:
"""Index documents: chunk → embed → store."""
chunks = self.chunker.split_documents(documents)
self.vector_store.add_documents(chunks)
generate() Method:
def generate(self, question: str, **kwargs) -> dict[str, Any]:
"""Generate answer using RAG pipeline."""
# 1. Retrieve relevant chunks
reference_documents = self.retriever.retrieve(question, **kwargs)
# 2. Format context
context = "\n".join([
self.foundation_model.context_template_text.format(
document=doc.page_content
)
for doc in reference_documents
])
# 3. Format user message
user_message = self.foundation_model.user_message_text.format(
reference_documents=context,
question=question
)
# 4. Create messages
messages = [
{"role": "system", "content": self.foundation_model.system_message_text},
{"role": "user", "content": user_message}
]
# 5. Generate answer
chat_response = self.foundation_model.chat(messages)
# 6. Return result
return {
"answer": chat_response[0].message.content,
"reference_documents": reference_documents,
"question": question
}
generate_stream() Method:
def generate_stream(self, question: str, **kwargs):
"""Placeholder for streaming (currently non-streaming)."""
result = self.generate(question, **kwargs)
yield result["answer"]
Usage:
# Create RAG template
rag = SimpleRAG(
foundation_model=ls_foundation_model,
retriever=retriever,
chunker=chunker,
embedding_model=ls_embedding_model,
vector_store=ls_vector_store
)
# Index documents (if building index manually)
rag.build_index(documents)
# Generate answer
result = rag.generate("What is the capital of France?")
print(result["answer"])
# "Based on the provided documents, Paris is the capital of France."
print(result["reference_documents"])
# [Document(...), Document(...), ...]
Within AI4RAGExperiment:
The experiment creates SimpleRAG instances automatically during evaluation:
rag_pattern = SimpleRAG(
foundation_model=foundation_model,
retriever=retriever
)
# Note: chunker, embedding_model, vector_store handled separately
# by experiment during indexing phase
Component Integration Example¶
Full RAG pipeline with all components:
from llama_stack_client import LlamaStackClient
from ai4rag.rag.foundation_models.llama_stack import LSFoundationModel
from ai4rag.rag.embedding.llama_stack import LSEmbeddingModel
from ai4rag.rag.vector_store.llama_stack import LSVectorStore
from ai4rag.rag.chunking.langchain_chunker import LangChainChunker
from ai4rag.rag.retrieval.retriever import Retriever
from ai4rag.rag.template.simple_rag_template import SimpleRAG
# 1. Initialize client
client = LlamaStackClient(base_url="http://localhost:8000", api_key="...")
# 2. Create foundation model
foundation_model = LSFoundationModel(
model_id="ollama/llama3.2:3b",
client=client,
params={"max_completion_tokens": 512, "temperature": 0.1}
)
# 3. Create embedding model
embedding_model = LSEmbeddingModel(
model_id="ollama/nomic-embed-text:latest",
client=client,
params={"embedding_dimension": 768, "context_length": 8192}
)
# 4. Create vector store
vector_store = LSVectorStore(
embedding_model=embedding_model,
client=client,
provider_id="milvus"
)
# 5. Create chunker
chunker = LangChainChunker(
method="recursive",
chunk_size=512,
chunk_overlap=128
)
# 6. Create retriever
retriever = Retriever(
vector_store=vector_store,
number_of_chunks=5,
method="simple",
search_mode="hybrid",
ranker_strategy="rrf",
ranker_k=60
)
# 7. Create RAG template
rag = SimpleRAG(
foundation_model=foundation_model,
retriever=retriever,
chunker=chunker,
embedding_model=embedding_model,
vector_store=vector_store
)
# 8. Index documents
rag.build_index(documents)
# 9. Generate answer
result = rag.generate("What is X?")
print(result["answer"])
Extension Points¶
All RAG components are designed for extensibility:
Custom Foundation Model¶
class CustomFoundationModel(BaseFoundationModel[MyClient, MyParams]):
def chat(self, messages: list[MessageTyped]) -> list[MessageTyped]:
# Your implementation
pass
Custom Embedding Model¶
class CustomEmbeddingModel(BaseEmbeddingModel[MyClient, MyParams]):
def embed_documents(self, texts: list[str]) -> list[list[float]]:
# Your implementation
pass
def embed_query(self, query: str) -> list[float]:
# Your implementation
pass
Custom Vector Store¶
class CustomVectorStore(BaseVectorStore):
def search(self, query: str, k: int, **kwargs) -> list[Document]:
# Your implementation
pass
def add_documents(self, documents: Sequence[Document]) -> None:
# Your implementation
pass
@property
def collection_name(self) -> str:
return self._collection_name
Custom RAG Template¶
class CustomRAG(BaseRAGTemplate):
def build_index(self, documents: list[Document], **kwargs) -> None:
# Your indexing logic
pass
def generate(self, question: str, **kwargs) -> dict[str, Any]:
# Your generation logic
pass
def generate_stream(self, question: str, **kwargs):
# Your streaming logic
pass
Best Practices¶
Foundation Models:
- Customize prompts for your domain (system_message_text, user_message_text)
- Use low temperature (0.0-0.2) for factual Q&A
- Adjust max_completion_tokens based on expected answer length
Embedding Models:
- Provide embedding_dimension and context_length explicitly to avoid auto-detection overhead
- Choose models matching your language (multilingual vs English-only)
- Consider embedding dimension (higher = more expressive but slower/larger)
Vector Stores:
- Use LSVectorStore with Milvus/Qdrant for production (better performance, hybrid search)
- Use ChromaVectorStore for development/testing (in-memory, simpler setup)
- Enable hybrid search for keyword-heavy domains (technical docs, legal, medical)
- Tune ranker parameters (ranker_k, ranker_alpha) via optimization
Chunking:
- Smaller chunks (256-512) for precise Q&A
- Larger chunks (1024-2048) for broader context
- Adjust chunk_overlap (25-50% of chunk_size) to maintain coherence
- Ensure chunk_size < embedding context_length
Retrieval:
- Start with simple retrieval before trying window-based
- Use hybrid search when semantic search misses exact matches
- Tune number_of_chunks (5-10 typical) via optimization
- Monitor retrieval quality via context_correctness metric
Next Steps¶
- Core Components - Experiment engine and HPO details
- Data Flow - Detailed workflow analysis
- Architecture Overview - High-level design