RU | EN | DE

Basics of Large Language Models (LLM)

What is an LLM?

Large Language Model (LLM) — is a neural network trained on large amounts of text data, capable of generating text, understanding languages, and solving various natural language processing (NLP) tasks.

Tokenization

Tokenization — is the process of splitting text into smaller units (tokens).

# Example
Text: "Hello, world!"
Tokens: ["Hello", ",", " world", "!"]

Important:

  • LLM works not with text but with token IDs
  • Each token has a unique numeric ID
  • Costs are calculated in tokens, not characters

Cost example:

Request: 1000 tokens
Response: 500 tokens
Total cost = (1000 × input price) + (500 × output price)

Attention Mechanism

Attention allows the model to focus on important parts of the input text.

Example:

Text: "The dog buried the bone because he was hungry"
Question: "Why did the dog bury the bone?"
Attention → focuses on "was hungry"

Transformer Architecture

Transformer — is the base architecture of modern LLMs.

Main components:

  1. Self-Attention — allows the model to understand relationships between words
  2. Multi-Head Attention — multiple attention mechanisms in parallel
  3. Feed-Forward Networks — processing each token
  4. Positional Encoding — information about token position

Context Window

Context Window — maximum number of tokens the model can process simultaneously.

Examples:

  • GPT-3.5: 4K tokens (~3000 words)
  • GPT-4: 8K / 32K / 128K tokens
  • Claude 3: 200K tokens
  • Gemini 1.5: up to 1M tokens

Long context problem:

Cost = O(n²) where n = number of tokens

At 100K tokens, cost is ~10,000× higher than at 1K tokens.

Training vs. Inference

Training:

  • One-time process
  • Requires massive computational resources (thousands of GPUs)
  • Takes weeks/months
  • Very expensive ($millions)

Inference:

  • Each API call
  • Relatively cheap
  • Fast (seconds)
  • Scalable

Model Types

1. LLM (Base Models)

  • Generate text based on input
  • Examples: GPT-4, Claude, Gemini

2. Reasoning Models

  • “Think” before answering
  • Show thought process
  • Examples: o1, o3

Difference:

LLM: Question → Immediate Answer
Reasoning: Question → Analysis → Thought Process → Answer

3. Agents

  • Can use tools
  • Make decisions
  • Execute actions

Example:

User: "Book a flight to Berlin"
Agent: 
1. Searches flights (Tool: FlightSearch)
2. Compares prices
3. Books ticket (Tool: BookFlight)
4. Sends confirmation

Prompt Engineering vs. Contextual Engineering

Prompt Engineering

Optimizing the request to the model

# Bad
"Write code"

# Good
"Write Python code for a function that calculates
the Fibonacci sequence up to the N-th number.
Use memoization for optimization.
Add docstrings and type hints."

Contextual Engineering

Providing relevant context in the window

context = """
Project rules:
- Use TypeScript
- Follow Clean Code principles
- Write unit tests
"""
 
prompt = f"{context}\n\nTask: {user_task}"

Memory Management

Short-term Memory

  • Current context within the window
  • Limited by context window

Long-term Memory

  • Stored in external DB
  • Retrieved as needed
  • Unlimited, but requires RAG

Implementation:

# Short-term
conversation_history = [
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hello! How can I help?"}
]
 
# Long-term
vector_db.store(embedding(text), metadata)
relevant_context = vector_db.search(query)

Cost Optimization

Strategies:

  1. Prompt Caching — don’t repeat same context
  2. Token Reduction — remove unnecessary words
  3. Model Selection — use smaller models for simple tasks
  4. Batching — combine multiple requests

Example:

Instead of:
Request 1: Context (1000 tokens) + Question 1 (50 tokens)
Request 2: Context (1000 tokens) + Question 2 (50 tokens)

Better:
Request: Context (1000 tokens) + Question 1 + Question 2 (100 tokens)

Temperature and Top-P

Temperature (0.0 - 2.0)

  • 0.0 — deterministic, always same answer
  • 1.0 — balanced
  • 2.0 — creative, random

Top-P (0.0 - 1.0)

  • 0.1 — only most probable tokens
  • 0.9 — wider selection
  • 1.0 — all tokens considered

Usage:

# For code generation
temperature = 0.2  # Precision
top_p = 0.9
 
# For creative writing
temperature = 0.8  # Creativity
top_p = 0.95

Common Problems

1. Hallucinations

Model invents facts.

Solution:

  • Give clear instructions
  • Use RAG for factual data
  • Validate important answers

2. Token Limit Exceeded

Solution:

  • Shorten context
  • Use summaries
  • Split into smaller requests

3. Inconsistent Output

Solution:

  • Lower temperature
  • Use structured output (JSON mode)
  • Add examples (few-shot)

Production Considerations

Monitoring:

  • Track token usage
  • Measure latency
  • Monitor error rate

Security:

  • Input validation
  • Output filtering
  • Rate limiting
  • API key protection

Scaling:

  • Load balancing
  • Caching layers
  • Asynchronous processing
  • Fallback models

Best Practices

  1. Clear Instructions

    "Act as a Senior Python Developer.
    Write production-ready code with
    error handling and logging."
    
  2. Few-Shot Learning

    Example 1: Input → Output
    Example 2: Input → Output
    Now your task: Input → ?
    
  3. Chain-of-Thought

    "Explain step by step:
    1. Analyze the problem
    2. Identify possible solutions
    3. Choose the best option
    4. Implement"
    
  4. Validation

    response = llm.generate(prompt)
    if not validate(response):
        response = llm.generate(improved_prompt)

Conclusion

LLMs are powerful tools, but:

  • Understand their limitations
  • Optimize costs
  • Validate outputs
  • Plan for production
  • Stay updated with developments

For production systems, it’s crucial to understand LLM internals, not just use the API.