RAG (Retrieval-Augmented Generation)

What is RAG?

RAG — is an approach where an LLM is combined with an external knowledge base to provide more accurate and relevant answers.

Problem: LLMs know nothing about your internal data (documents, databases, APIs).

Solution: Retrieve relevant information and pass it to the LLM in context.

Analogy: Exam with Cheat Sheets

Without RAG:

Student (LLM) answers from memory
→ may hallucinate or give outdated info

With RAG:

Student (LLM) + cheat sheets (external DB)
→ finds relevant info and answers based on it

RAG Algorithm

1. Chunking

Text is split into smaller parts (chunks).

Strategies:

# Fixed size
chunk_size = 500  # Tokens
overlap = 50      # Overlap between chunks
 
# Semantic
# Split by paragraphs, headings, logical blocks
 
# By structure
# Split by document sections (chapters, subsections)

Best Practices:

Chunk size: 200-1000 tokens
Overlap: 10-20% of chunk size
Consider document structure

2. Embeddings

Each chunk is converted into a vector (array of numbers).

from openai import OpenAI
 
client = OpenAI()
 
text = "Python is a programming language"
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=text
)
 
embedding = response.data[0].embedding
# embedding = [0.023, -0.015, 0.048, ...] # 1536 dimensions

Models:

OpenAI: text-embedding-3-small (1536d), text-embedding-3-large (3072d)
Sentence-BERT: various models for different languages
Cohere: embed-multilingual-v3.0

3. Vector Database

Chunks and their embeddings are stored.

Popular databases:

Pinecone — fully managed cloud service
Weaviate — open-source with GraphQL
Chroma — lightweight, for local development
Qdrant — fast, with filtering
Milvus — scalable for production

Storage:

import chromadb
 
client = chromadb.Client()
collection = client.create_collection("docs")
 
collection.add(
    embeddings=[embedding],
    documents=[text],
    metadatas=[{"source": "doc1.pdf", "page": 1}],
    ids=["chunk_1"]
)

4. Retrieval

Find most similar chunks to user query.

# User query
query = "How to create a function in Python?"
 
# Create query embedding
query_embedding = get_embedding(query)
 
# Search for most similar chunks
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=5
)
 
# results contains 5 most relevant chunks

Similarity metrics:

Cosine Similarity — most commonly used
Euclidean Distance — simpler, but less precise
Dot Product — fast, but requires normalized vectors

5. Generation

Pass context + query to LLM.

context = "\n\n".join([doc for doc in results['documents'][0]])
 
prompt = f"""
Context: {context}
 
Question: {query}
 
Answer based ONLY on the provided context.
If the answer is not in the context, say "I don't know".
"""
 
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}]
)

Advanced Techniques

Hybrid Search

Combination of vector and keyword search.

# Vector search
vector_results = vector_db.search(query_embedding)
 
# Keyword search (BM25)
keyword_results = bm25_search(query)
 
# Combine results
final_results = merge_and_rerank(vector_results, keyword_results)

Parent Document Retrieval

Find small chunks but return larger contexts.

# Store in DB: small chunks
small_chunk = "Python is a programming language"
 
# But also store reference to large document
parent_doc_id = "python_tutorial_chapter_1"
 
# On retrieval:
# 1. Find relevant small_chunk
# 2. Retrieve entire parent_doc
# 3. Pass larger context to LLM

Reranking

Reorder retrieved documents by relevance.

from cohere import Client
 
co = Client()
 
# Initial retrieval phase
docs = vector_db.search(query, top_k=100)
 
# Reranking
reranked = co.rerank(
    query=query,
    documents=[doc.text for doc in docs],
    top_n=5
)
 
# Use only top-5 after reranking

RAG Evaluation

Metrics

1. Faithfulness Does LLM answer based ONLY on context?

# Check: Are all facts in answer present in context?
faithfulness_score = check_all_facts_in_context(answer, context)

2. Answer Relevance Does LLM answer the asked question?

# Check: Does the answer address the question?
relevance_score = check_answer_relevance(question, answer)

3. Context Recall Were all relevant documents retrieved?

# Check: Are all necessary facts in retrieved context?
recall = relevant_docs_retrieved / total_relevant_docs

4. Context Precision How many retrieved documents are actually relevant?

# Check: How many retrieved chunks are relevant?
precision = relevant_chunks / total_retrieved_chunks

Evaluation Frameworks

RAGAS — automatic RAG evaluation
TruLens — monitoring and evaluation
LangSmith — end-to-end testing

Practical Aspects

Costs

# Typical costs per request:
# 1. Query embedding: ~$0.0001
# 2. Vector search: nearly free (self-hosted) or ~$0.001 (cloud)
# 3. LLM generation: $0.001 - $0.03 (depending on model)
 
# Total: ~$0.001 - $0.03 per request

Caching

# Cache frequent queries
cache = {}
 
def rag_with_cache(query):
    if query in cache:
        return cache[query]
    
    result = full_rag_pipeline(query)
    cache[query] = result
    return result

Monitoring

# Track important metrics
metrics = {
    "retrieval_latency": time_to_retrieve,
    "llm_latency": time_to_generate,
    "total_tokens": input_tokens + output_tokens,
    "confidence_score": model_confidence,
    "num_chunks_used": len(context_chunks)
}
 
log_metrics(metrics)

Security

# Filter sensitive information
def sanitize_context(context, user_permissions):
    # Remove chunks user doesn't have access to
    filtered = [
        chunk for chunk in context
        if user_can_access(chunk, user_permissions)
    ]
    return filtered

Common Problems and Solutions

Problem 1: Irrelevant Results

# Solution: Improve chunking strategy
# - Smaller chunks
# - Add metadata
# - Use better embeddings

Problem 2: Outdated Information

# Solution: Regular updates
def update_vector_db():
    new_docs = fetch_updated_docs()
    for doc in new_docs:
        chunks = chunk_document(doc)
        embeddings = get_embeddings(chunks)
        vector_db.upsert(chunks, embeddings)

Problem 3: Too Much Context

# Solution: Intelligent filtering
# 1. Use smaller top_k
# 2. Set similarity threshold
# 3. Use reranking
 
results = vector_db.search(query, top_k=10)
filtered = [r for r in results if r.score > 0.7]  # Only highly relevant

Best Practices

Find Good Chunk Size
- Too small = lose context
- Too large = irrelevant info

Use Metadata

metadata = {
    "source": "documentation.pdf",
    "date": "2024-01-15",
    "author": "John Doe",
    "section": "API Reference"
}

Use Hybrid Search
- Vector search for semantic similarity
- Keyword search for exact matches

Automate Evaluation

from ragas import evaluate
 
scores = evaluate(
    dataset=test_questions,
    metrics=[faithfulness, answer_relevance]
)

Make Production-Ready
- Implement caching
- Set up monitoring
- Add error handling
- Use rate limiting

Conclusion

RAG is a bridge between LLMs and real-world data:

Enables work with private/current data
Reduces hallucinations
Provides sources for facts
Easier than fine-tuning

For most applications, RAG is the best approach to connect LLMs with enterprise data.

VBO Wiki

Explorer

2. AI. LLM. RAG