RAG (Retrieval-Augmented Generation)
What is RAG?
RAG — is an approach where an LLM is combined with an external knowledge base to provide more accurate and relevant answers.
Problem: LLMs know nothing about your internal data (documents, databases, APIs).
Solution: Retrieve relevant information and pass it to the LLM in context.
Analogy: Exam with Cheat Sheets
Without RAG:
Student (LLM) answers from memory
→ may hallucinate or give outdated info
With RAG:
Student (LLM) + cheat sheets (external DB)
→ finds relevant info and answers based on it
RAG Algorithm
1. Chunking
Text is split into smaller parts (chunks).
Strategies:
# Fixed size
chunk_size = 500 # Tokens
overlap = 50 # Overlap between chunks
# Semantic
# Split by paragraphs, headings, logical blocks
# By structure
# Split by document sections (chapters, subsections)Best Practices:
- Chunk size: 200-1000 tokens
- Overlap: 10-20% of chunk size
- Consider document structure
2. Embeddings
Each chunk is converted into a vector (array of numbers).
from openai import OpenAI
client = OpenAI()
text = "Python is a programming language"
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
embedding = response.data[0].embedding
# embedding = [0.023, -0.015, 0.048, ...] # 1536 dimensionsModels:
- OpenAI:
text-embedding-3-small(1536d),text-embedding-3-large(3072d) - Sentence-BERT: various models for different languages
- Cohere:
embed-multilingual-v3.0
3. Vector Database
Chunks and their embeddings are stored.
Popular databases:
- Pinecone — fully managed cloud service
- Weaviate — open-source with GraphQL
- Chroma — lightweight, for local development
- Qdrant — fast, with filtering
- Milvus — scalable for production
Storage:
import chromadb
client = chromadb.Client()
collection = client.create_collection("docs")
collection.add(
embeddings=[embedding],
documents=[text],
metadatas=[{"source": "doc1.pdf", "page": 1}],
ids=["chunk_1"]
)4. Retrieval
Find most similar chunks to user query.
# User query
query = "How to create a function in Python?"
# Create query embedding
query_embedding = get_embedding(query)
# Search for most similar chunks
results = collection.query(
query_embeddings=[query_embedding],
n_results=5
)
# results contains 5 most relevant chunksSimilarity metrics:
- Cosine Similarity — most commonly used
- Euclidean Distance — simpler, but less precise
- Dot Product — fast, but requires normalized vectors
5. Generation
Pass context + query to LLM.
context = "\n\n".join([doc for doc in results['documents'][0]])
prompt = f"""
Context: {context}
Question: {query}
Answer based ONLY on the provided context.
If the answer is not in the context, say "I don't know".
"""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)Advanced Techniques
Hybrid Search
Combination of vector and keyword search.
# Vector search
vector_results = vector_db.search(query_embedding)
# Keyword search (BM25)
keyword_results = bm25_search(query)
# Combine results
final_results = merge_and_rerank(vector_results, keyword_results)Parent Document Retrieval
Find small chunks but return larger contexts.
# Store in DB: small chunks
small_chunk = "Python is a programming language"
# But also store reference to large document
parent_doc_id = "python_tutorial_chapter_1"
# On retrieval:
# 1. Find relevant small_chunk
# 2. Retrieve entire parent_doc
# 3. Pass larger context to LLMReranking
Reorder retrieved documents by relevance.
from cohere import Client
co = Client()
# Initial retrieval phase
docs = vector_db.search(query, top_k=100)
# Reranking
reranked = co.rerank(
query=query,
documents=[doc.text for doc in docs],
top_n=5
)
# Use only top-5 after rerankingRAG Evaluation
Metrics
1. Faithfulness Does LLM answer based ONLY on context?
# Check: Are all facts in answer present in context?
faithfulness_score = check_all_facts_in_context(answer, context)2. Answer Relevance Does LLM answer the asked question?
# Check: Does the answer address the question?
relevance_score = check_answer_relevance(question, answer)3. Context Recall Were all relevant documents retrieved?
# Check: Are all necessary facts in retrieved context?
recall = relevant_docs_retrieved / total_relevant_docs4. Context Precision How many retrieved documents are actually relevant?
# Check: How many retrieved chunks are relevant?
precision = relevant_chunks / total_retrieved_chunksEvaluation Frameworks
- RAGAS — automatic RAG evaluation
- TruLens — monitoring and evaluation
- LangSmith — end-to-end testing
Practical Aspects
Costs
# Typical costs per request:
# 1. Query embedding: ~$0.0001
# 2. Vector search: nearly free (self-hosted) or ~$0.001 (cloud)
# 3. LLM generation: $0.001 - $0.03 (depending on model)
# Total: ~$0.001 - $0.03 per requestCaching
# Cache frequent queries
cache = {}
def rag_with_cache(query):
if query in cache:
return cache[query]
result = full_rag_pipeline(query)
cache[query] = result
return resultMonitoring
# Track important metrics
metrics = {
"retrieval_latency": time_to_retrieve,
"llm_latency": time_to_generate,
"total_tokens": input_tokens + output_tokens,
"confidence_score": model_confidence,
"num_chunks_used": len(context_chunks)
}
log_metrics(metrics)Security
# Filter sensitive information
def sanitize_context(context, user_permissions):
# Remove chunks user doesn't have access to
filtered = [
chunk for chunk in context
if user_can_access(chunk, user_permissions)
]
return filteredCommon Problems and Solutions
Problem 1: Irrelevant Results
# Solution: Improve chunking strategy
# - Smaller chunks
# - Add metadata
# - Use better embeddingsProblem 2: Outdated Information
# Solution: Regular updates
def update_vector_db():
new_docs = fetch_updated_docs()
for doc in new_docs:
chunks = chunk_document(doc)
embeddings = get_embeddings(chunks)
vector_db.upsert(chunks, embeddings)Problem 3: Too Much Context
# Solution: Intelligent filtering
# 1. Use smaller top_k
# 2. Set similarity threshold
# 3. Use reranking
results = vector_db.search(query, top_k=10)
filtered = [r for r in results if r.score > 0.7] # Only highly relevantBest Practices
-
Find Good Chunk Size
- Too small = lose context
- Too large = irrelevant info
-
Use Metadata
metadata = { "source": "documentation.pdf", "date": "2024-01-15", "author": "John Doe", "section": "API Reference" } -
Use Hybrid Search
- Vector search for semantic similarity
- Keyword search for exact matches
-
Automate Evaluation
from ragas import evaluate scores = evaluate( dataset=test_questions, metrics=[faithfulness, answer_relevance] ) -
Make Production-Ready
- Implement caching
- Set up monitoring
- Add error handling
- Use rate limiting
Conclusion
RAG is a bridge between LLMs and real-world data:
- Enables work with private/current data
- Reduces hallucinations
- Provides sources for facts
- Easier than fine-tuning
For most applications, RAG is the best approach to connect LLMs with enterprise data.