Basics of Large Language Models (LLM)
What is an LLM?
Large Language Model (LLM) — is a neural network trained on large amounts of text data, capable of generating text, understanding languages, and solving various natural language processing (NLP) tasks.
Tokenization
Tokenization — is the process of splitting text into smaller units (tokens).
# Example
Text: "Hello, world!"
Tokens: ["Hello", ",", " world", "!"]Important:
- LLM works not with text but with token IDs
- Each token has a unique numeric ID
- Costs are calculated in tokens, not characters
Cost example:
Request: 1000 tokens
Response: 500 tokens
Total cost = (1000 × input price) + (500 × output price)
Attention Mechanism
Attention allows the model to focus on important parts of the input text.
Example:
Text: "The dog buried the bone because he was hungry"
Question: "Why did the dog bury the bone?"
Attention → focuses on "was hungry"
Transformer Architecture
Transformer — is the base architecture of modern LLMs.
Main components:
- Self-Attention — allows the model to understand relationships between words
- Multi-Head Attention — multiple attention mechanisms in parallel
- Feed-Forward Networks — processing each token
- Positional Encoding — information about token position
Context Window
Context Window — maximum number of tokens the model can process simultaneously.
Examples:
- GPT-3.5: 4K tokens (~3000 words)
- GPT-4: 8K / 32K / 128K tokens
- Claude 3: 200K tokens
- Gemini 1.5: up to 1M tokens
Long context problem:
Cost = O(n²) where n = number of tokens
At 100K tokens, cost is ~10,000× higher than at 1K tokens.
Training vs. Inference
Training:
- One-time process
- Requires massive computational resources (thousands of GPUs)
- Takes weeks/months
- Very expensive ($millions)
Inference:
- Each API call
- Relatively cheap
- Fast (seconds)
- Scalable
Model Types
1. LLM (Base Models)
- Generate text based on input
- Examples: GPT-4, Claude, Gemini
2. Reasoning Models
- “Think” before answering
- Show thought process
- Examples: o1, o3
Difference:
LLM: Question → Immediate Answer
Reasoning: Question → Analysis → Thought Process → Answer
3. Agents
- Can use tools
- Make decisions
- Execute actions
Example:
User: "Book a flight to Berlin"
Agent:
1. Searches flights (Tool: FlightSearch)
2. Compares prices
3. Books ticket (Tool: BookFlight)
4. Sends confirmation
Prompt Engineering vs. Contextual Engineering
Prompt Engineering
Optimizing the request to the model
# Bad
"Write code"
# Good
"Write Python code for a function that calculates
the Fibonacci sequence up to the N-th number.
Use memoization for optimization.
Add docstrings and type hints."
Contextual Engineering
Providing relevant context in the window
context = """
Project rules:
- Use TypeScript
- Follow Clean Code principles
- Write unit tests
"""
prompt = f"{context}\n\nTask: {user_task}"Memory Management
Short-term Memory
- Current context within the window
- Limited by context window
Long-term Memory
- Stored in external DB
- Retrieved as needed
- Unlimited, but requires RAG
Implementation:
# Short-term
conversation_history = [
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hello! How can I help?"}
]
# Long-term
vector_db.store(embedding(text), metadata)
relevant_context = vector_db.search(query)Cost Optimization
Strategies:
- Prompt Caching — don’t repeat same context
- Token Reduction — remove unnecessary words
- Model Selection — use smaller models for simple tasks
- Batching — combine multiple requests
Example:
Instead of:
Request 1: Context (1000 tokens) + Question 1 (50 tokens)
Request 2: Context (1000 tokens) + Question 2 (50 tokens)
Better:
Request: Context (1000 tokens) + Question 1 + Question 2 (100 tokens)
Temperature and Top-P
Temperature (0.0 - 2.0)
- 0.0 — deterministic, always same answer
- 1.0 — balanced
- 2.0 — creative, random
Top-P (0.0 - 1.0)
- 0.1 — only most probable tokens
- 0.9 — wider selection
- 1.0 — all tokens considered
Usage:
# For code generation
temperature = 0.2 # Precision
top_p = 0.9
# For creative writing
temperature = 0.8 # Creativity
top_p = 0.95Common Problems
1. Hallucinations
Model invents facts.
Solution:
- Give clear instructions
- Use RAG for factual data
- Validate important answers
2. Token Limit Exceeded
Solution:
- Shorten context
- Use summaries
- Split into smaller requests
3. Inconsistent Output
Solution:
- Lower temperature
- Use structured output (JSON mode)
- Add examples (few-shot)
Production Considerations
Monitoring:
- Track token usage
- Measure latency
- Monitor error rate
Security:
- Input validation
- Output filtering
- Rate limiting
- API key protection
Scaling:
- Load balancing
- Caching layers
- Asynchronous processing
- Fallback models
Best Practices
-
Clear Instructions
"Act as a Senior Python Developer. Write production-ready code with error handling and logging." -
Few-Shot Learning
Example 1: Input → Output Example 2: Input → Output Now your task: Input → ? -
Chain-of-Thought
"Explain step by step: 1. Analyze the problem 2. Identify possible solutions 3. Choose the best option 4. Implement" -
Validation
response = llm.generate(prompt) if not validate(response): response = llm.generate(improved_prompt)
Conclusion
LLMs are powerful tools, but:
- Understand their limitations
- Optimize costs
- Validate outputs
- Plan for production
- Stay updated with developments
For production systems, it’s crucial to understand LLM internals, not just use the API.