Basics of Large Language Models (LLM)

What is an LLM?

Large Language Model (LLM) — is a neural network trained on large amounts of text data, capable of generating text, understanding languages, and solving various natural language processing (NLP) tasks.

Tokenization

Tokenization — is the process of splitting text into smaller units (tokens).

# Example
Text: "Hello, world!"
Tokens: ["Hello", ",", " world", "!"]

Important:

LLM works not with text but with token IDs
Each token has a unique numeric ID
Costs are calculated in tokens, not characters

Cost example:

Request: 1000 tokens
Response: 500 tokens
Total cost = (1000 × input price) + (500 × output price)

Attention Mechanism

Attention allows the model to focus on important parts of the input text.

Example:

Text: "The dog buried the bone because he was hungry"
Question: "Why did the dog bury the bone?"
Attention → focuses on "was hungry"

Transformer Architecture

Transformer — is the base architecture of modern LLMs.

Main components:

Self-Attention — allows the model to understand relationships between words
Multi-Head Attention — multiple attention mechanisms in parallel
Feed-Forward Networks — processing each token
Positional Encoding — information about token position

Context Window

Context Window — maximum number of tokens the model can process simultaneously.

Examples:

GPT-3.5: 4K tokens (~3000 words)
GPT-4: 8K / 32K / 128K tokens
Claude 3: 200K tokens
Gemini 1.5: up to 1M tokens

Long context problem:

Cost = O(n²) where n = number of tokens

At 100K tokens, cost is ~10,000× higher than at 1K tokens.

Training vs. Inference

Training:

One-time process
Requires massive computational resources (thousands of GPUs)
Takes weeks/months
Very expensive ($millions)

Inference:

Each API call
Relatively cheap
Fast (seconds)
Scalable

Model Types

1. LLM (Base Models)

Generate text based on input
Examples: GPT-4, Claude, Gemini

2. Reasoning Models

“Think” before answering
Show thought process
Examples: o1, o3

Difference:

LLM: Question → Immediate Answer
Reasoning: Question → Analysis → Thought Process → Answer

3. Agents

Can use tools
Make decisions
Execute actions

Example:

User: "Book a flight to Berlin"
Agent: 
1. Searches flights (Tool: FlightSearch)
2. Compares prices
3. Books ticket (Tool: BookFlight)
4. Sends confirmation

Prompt Engineering vs. Contextual Engineering

Prompt Engineering

Optimizing the request to the model

# Bad
"Write code"

# Good
"Write Python code for a function that calculates
the Fibonacci sequence up to the N-th number.
Use memoization for optimization.
Add docstrings and type hints."

Contextual Engineering

Providing relevant context in the window

context = """
Project rules:
- Use TypeScript
- Follow Clean Code principles
- Write unit tests
"""
 
prompt = f"{context}\n\nTask: {user_task}"

Memory Management

Short-term Memory

Current context within the window
Limited by context window

Long-term Memory

Stored in external DB
Retrieved as needed
Unlimited, but requires RAG

Implementation:

# Short-term
conversation_history = [
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hello! How can I help?"}
]
 
# Long-term
vector_db.store(embedding(text), metadata)
relevant_context = vector_db.search(query)

Cost Optimization

Strategies:

Prompt Caching — don’t repeat same context
Token Reduction — remove unnecessary words
Model Selection — use smaller models for simple tasks
Batching — combine multiple requests

Example:

Instead of:
Request 1: Context (1000 tokens) + Question 1 (50 tokens)
Request 2: Context (1000 tokens) + Question 2 (50 tokens)

Better:
Request: Context (1000 tokens) + Question 1 + Question 2 (100 tokens)

Temperature and Top-P

Temperature (0.0 - 2.0)

0.0 — deterministic, always same answer
1.0 — balanced
2.0 — creative, random

Top-P (0.0 - 1.0)

0.1 — only most probable tokens
0.9 — wider selection
1.0 — all tokens considered

Usage:

# For code generation
temperature = 0.2  # Precision
top_p = 0.9
 
# For creative writing
temperature = 0.8  # Creativity
top_p = 0.95

Common Problems

1. Hallucinations

Model invents facts.

Solution:

Give clear instructions
Use RAG for factual data
Validate important answers

2. Token Limit Exceeded

Solution:

Shorten context
Use summaries
Split into smaller requests

3. Inconsistent Output

Solution:

Lower temperature
Use structured output (JSON mode)
Add examples (few-shot)

Production Considerations

Monitoring:

Track token usage
Measure latency
Monitor error rate

Security:

Input validation
Output filtering
Rate limiting
API key protection

Scaling:

Load balancing
Caching layers
Asynchronous processing
Fallback models

Best Practices

Clear Instructions

"Act as a Senior Python Developer.
Write production-ready code with
error handling and logging."

Few-Shot Learning

Example 1: Input → Output
Example 2: Input → Output
Now your task: Input → ?

Chain-of-Thought

"Explain step by step:
1. Analyze the problem
2. Identify possible solutions
3. Choose the best option
4. Implement"

Validation

response = llm.generate(prompt)
if not validate(response):
    response = llm.generate(improved_prompt)

Conclusion

LLMs are powerful tools, but:

Understand their limitations
Optimize costs
Validate outputs
Plan for production
Stay updated with developments

For production systems, it’s crucial to understand LLM internals, not just use the API.

VBO Wiki

Explorer

1. AI. LLM. Theoretic

Basics of Large Language Models (LLM)

What is an LLM?

Tokenization

Attention Mechanism

Transformer Architecture

Context Window

Training vs. Inference

Model Types

1. LLM (Base Models)

2. Reasoning Models

3. Agents

Prompt Engineering vs. Contextual Engineering

Prompt Engineering

Contextual Engineering

Memory Management

Short-term Memory

Long-term Memory

Cost Optimization

Temperature and Top-P

Temperature (0.0 - 2.0)

Top-P (0.0 - 1.0)

Common Problems

1. Hallucinations

2. Token Limit Exceeded

3. Inconsistent Output

Production Considerations

Best Practices

Conclusion

Table of Contents

Backlinks