Building Production-Ready RAG Systems: From Basics to Advanced Techniques
Master Retrieval-Augmented Generation (RAG) systems. Learn how to build, optimize, and deploy RAG applications that combine the power of LLMs with your own data.
Sani Mridha
Senior Mobile Developer
Building Production-Ready RAG Systems: From Basics to Advanced Techniques
Retrieval-Augmented Generation (RAG) has revolutionized how we build AI applications. Let's explore how to create production-ready RAG systems that actually work.
What is RAG?
RAG combines the power of Large Language Models (LLMs) with external knowledge retrieval. Instead of relying solely on the model's training data, RAG systems:
1. Retrieve relevant information from a knowledge base
2. Augment the prompt with retrieved context
3. Generate responses using both the LLM and retrieved data
Why RAG Matters
Traditional LLM Limitations
RAG Solutions
Architecture Overview
Basic RAG Pipeline
User Query → Embedding → Vector Search → Context Retrieval → LLM → ResponseComponents Breakdown
1. Document Ingestion: Load and process documents
2. Chunking: Split documents into manageable pieces
3. Embedding: Convert text to vector representations
4. Vector Store: Store and index embeddings
5. Retrieval: Find relevant chunks for queries
6. Generation: Create responses with LLM
Building Your First RAG System
Step 1: Document Processing
from langchain.text_splitter import RecursiveCharacterTextSplitter
def process_documents(documents):
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", " ", ""]
)
chunks = text_splitter.split_documents(documents)
return chunksStep 2: Create Embeddings
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
# Convert chunks to vectors
vectors = embeddings.embed_documents([chunk.page_content for chunk in chunks])Step 3: Vector Store Setup
from langchain.vectorstores import Pinecone
import pinecone
pinecone.init(api_key="your-api-key", environment="us-west1-gcp")
vectorstore = Pinecone.from_documents(
chunks,
embeddings,
index_name="my-rag-index"
)Step 4: Retrieval Chain
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(temperature=0),
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 4})
)
# Query the system
response = qa_chain.run("What is the new architecture in React Native?")Advanced Techniques
1. Hybrid Search
Combine vector search with keyword search for better results:
from langchain.retrievers import EnsembleRetriever
from langchain.retrievers import BM25Retriever
# Vector retriever
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# Keyword retriever
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5
# Ensemble retriever
ensemble_retriever = EnsembleRetriever(
retrievers=[vector_retriever, bm25_retriever],
weights=[0.5, 0.5]
)2. Re-ranking
Improve retrieval quality with re-ranking:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank
compressor = CohereRerank(model="rerank-english-v2.0")
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vector_retriever
)3. Query Transformation
Enhance queries before retrieval:
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
query_transform_prompt = PromptTemplate(
input_variables=["question"],
template="""Given the user question, generate 3 different versions
of the question to retrieve relevant documents:
Original: {question}
Variations:"""
)
query_transformer = LLMChain(llm=llm, prompt=query_transform_prompt)4. Metadata Filtering
Add metadata for precise filtering:
# Add metadata during ingestion
chunks_with_metadata = [
{
"content": chunk.page_content,
"metadata": {
"source": chunk.metadata["source"],
"date": "2024-01-15",
"category": "technical",
"author": "Sani Mridha"
}
}
for chunk in chunks
]
# Filter during retrieval
results = vectorstore.similarity_search(
"React Native performance",
k=5,
filter={"category": "technical", "date": {"$gte": "2024-01-01"}}
)Optimization Strategies
Chunking Strategies
Different strategies for different content:
# For code documentation
code_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\nclass ", "\ndef ", "\n\n", "\n", " "]
)
# For narrative content
narrative_splitter = RecursiveCharacterTextSplitter(
chunk_size=1500,
chunk_overlap=300,
separators=["\n## ", "\n### ", "\n\n", "\n", " "]
)Embedding Model Selection
| Model | Dimensions | Use Case |
|-------|-----------|----------|
| text-embedding-3-small | 1536 | Fast, cost-effective |
| text-embedding-3-large | 3072 | High accuracy |
| all-MiniLM-L6-v2 | 384 | Local deployment |
Prompt Engineering
Optimize your RAG prompts:
rag_prompt = """Use the following context to answer the question.
If you cannot answer based on the context, say so clearly.
Context:
{context}
Question: {question}
Instructions:
1. Answer based only on the provided context
2. Cite specific sources when possible
3. If uncertain, express your level of confidence
4. Keep answers concise but complete
Answer:"""Production Considerations
1. Caching
Implement caching for frequent queries:
from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_retrieval(query: str):
return vectorstore.similarity_search(query, k=4)2. Error Handling
Robust error handling is crucial:
def safe_rag_query(query: str, max_retries: int = 3):
for attempt in range(max_retries):
try:
results = qa_chain.run(query)
return results
except Exception as e:
if attempt == max_retries - 1:
return "I'm having trouble processing your request. Please try again."
time.sleep(2 ** attempt)3. Monitoring
Track key metrics:
import time
def monitored_rag_query(query: str):
start_time = time.time()
# Retrieval metrics
retrieval_start = time.time()
docs = vectorstore.similarity_search(query, k=4)
retrieval_time = time.time() - retrieval_start
# Generation metrics
generation_start = time.time()
response = llm.generate(docs)
generation_time = time.time() - generation_start
# Log metrics
log_metrics({
"total_time": time.time() - start_time,
"retrieval_time": retrieval_time,
"generation_time": generation_time,
"num_docs_retrieved": len(docs)
})
return responseReal-World Use Cases
Customer Support Bot
# Ingest support documentation
support_docs = load_support_documentation()
support_vectorstore = create_vectorstore(support_docs)
# Create specialized chain
support_chain = RetrievalQA.from_chain_type(
llm=ChatOpenAI(temperature=0.3),
retriever=support_vectorstore.as_retriever(),
return_source_documents=True
)Code Documentation Assistant
# Ingest codebase
code_docs = load_code_documentation()
code_vectorstore = create_vectorstore(code_docs)
# Query with code context
code_assistant = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-4"),
retriever=code_vectorstore.as_retriever(search_kwargs={"k": 6})
)Common Pitfalls
1. Poor Chunking
❌ Bad: Fixed 500-character chunks without context
✅ Good: Semantic chunking with overlap
2. Ignoring Metadata
❌ Bad: Storing only text content
✅ Good: Rich metadata for filtering and ranking
3. No Evaluation
❌ Bad: Deploy without testing
✅ Good: Comprehensive evaluation pipeline
Evaluation Framework
def evaluate_rag_system(test_cases):
metrics = {
"retrieval_precision": [],
"answer_relevance": [],
"faithfulness": []
}
for case in test_cases:
# Retrieve documents
docs = retriever.get_relevant_documents(case["query"])
# Check if correct docs retrieved
precision = calculate_precision(docs, case["expected_docs"])
metrics["retrieval_precision"].append(precision)
# Generate answer
answer = qa_chain.run(case["query"])
# Evaluate answer quality
relevance = evaluate_relevance(answer, case["expected_answer"])
faithfulness = evaluate_faithfulness(answer, docs)
metrics["answer_relevance"].append(relevance)
metrics["faithfulness"].append(faithfulness)
return metricsConclusion
RAG systems are powerful but require careful design and optimization. Focus on:
1. Quality Data: Clean, well-structured documents
2. Smart Chunking: Context-aware splitting
3. Effective Retrieval: Hybrid search + re-ranking
4. Robust Generation: Well-crafted prompts
5. Continuous Evaluation: Monitor and improve
Start simple, measure everything, and iterate based on real usage patterns.
---
*Building a RAG system? I'd love to hear about your use case!*