How LLMs Work: The Architecture Behind Every AI Answer


You type “explain consistent hashing” into ChatGPT. Two seconds later, you get a 500-word explanation with examples, analogies, and even a code snippet. It feels like the model understands your question. It does not. What actually happened is a sequence of matrix multiplications across billions of parameters, each one nudging a probability distribution over the next word to generate. No database lookup. No retrieval from a knowledge base. Just pattern completion at a scale that produces emergent behavior indistinguishable from understanding.

If you are building products on top of LLMs, you need to know what is happening under the hood - not to become an ML researcher, but to understand why your prompts sometimes fail, why context windows matter, and why hallucinations are a feature of the architecture, not a bug.

What a large language model actually is

An LLM is a neural network trained to predict the next token in a sequence. That is the entire objective. Given “The capital of France is”, the model assigns probabilities to every token in its vocabulary. “Paris” gets a high probability. “pizza” gets a low one. The model generates text by sampling from this distribution, one token at a time, feeding each generated token back as input.

The “large” in LLM refers to parameter count. GPT-4 has over a trillion parameters. Claude 3.5 Sonnet has hundreds of billions. These parameters are the weights learned during training - they encode compressed patterns from the training data.

graph TD
  A["Input: 'The capital of France is'"] --> B["Tokenizer"]
  B --> C["Token IDs: [464, 3361, 315, 4788, 374]"]
  C --> D["Embedding Layer"]
  D --> E["Transformer Blocks x N"]
  E --> F["Output Probabilities"]
  F --> G["'Paris' (0.92), 'the' (0.03), 'Lyon' (0.01), ..."]
  G --> H["Sample → 'Paris'"]

  style A fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style E fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style F fill:#FAEEDA,stroke:#854F0B,color:#633806
  style H fill:#EEEDFE,stroke:#534AB7,color:#3C3489

The transformer architecture

Every modern LLM is built on the transformer, introduced in the 2017 paper “Attention Is All You Need.” The key innovation is self-attention - a mechanism that lets the model weigh how much each token in the input should influence the representation of every other token.

Self-attention in one paragraph

Consider the sentence “The bank by the river was muddy.” To understand that “bank” means riverbank (not a financial institution), the model needs to relate “bank” to “river.” Self-attention does this by computing three vectors for each token - Query, Key, and Value. The attention score between any two tokens is the dot product of one token’s Query with another’s Key. High scores mean “pay attention to this relationship.” The output for each position is a weighted sum of all Value vectors, where the weights are these attention scores.

graph LR
  subgraph sa["Self-Attention Mechanism"]
      direction TB
      T1["Token: 'bank'"] --> Q1["Query"]
      T1 --> K1["Key"]
      T1 --> V1["Value"]
      T2["Token: 'river'"] --> K2["Key"]
      T2 --> V2["Value"]
      Q1 -.->|"high score"| K2
      Q1 -.->|"low score"| K3["Key: 'the'"]
  end

  style T1 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style T2 fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style Q1 fill:#FAEEDA,stroke:#854F0B,color:#633806
  style K2 fill:#E1F5EE,stroke:#0F6E56,color:#085041

Multi-head attention

Instead of one attention computation, transformers run multiple attention “heads” in parallel. Each head can learn different types of relationships - one head might track syntactic structure, another might track coreference, another might focus on semantic similarity. GPT-4 likely uses 96+ attention heads per layer.

The full transformer block

Each transformer block contains:

  1. Multi-head self-attention - relates tokens to each other
  2. Feed-forward network - processes each token independently through two linear layers with a non-linearity
  3. Layer normalization - stabilizes training
  4. Residual connections - lets gradients flow through deep networks

A model like GPT-4 stacks 100+ of these blocks. Each block refines the representation, building increasingly abstract features from raw token patterns to syntactic structure to semantic meaning.

Training: where the knowledge comes from

LLM training happens in stages:

Pre-training: The model processes trillions of tokens from the internet, books, code, and other text. For each position, it predicts the next token and updates its weights to reduce prediction error. This is unsupervised - no human labels needed. The result is a model that has compressed statistical patterns of human language and knowledge into its parameters.

Fine-tuning (SFT): The pre-trained model is further trained on curated question-answer pairs to make it follow instructions rather than just complete text.

RLHF/RLAIF: Reinforcement learning from human (or AI) feedback aligns the model with human preferences - making responses helpful, harmless, and honest.

Where it breaks: understanding the failure modes

Because you now know this is next-token prediction, several failure modes become obvious:

Hallucinations are structural. The model does not “know” facts - it generates statistically likely continuations. If the most likely next tokens form a false statement, it will generate that statement with confidence. There is no internal fact-checking mechanism.

Recency bias. Knowledge is frozen at the training cutoff. The model cannot know about events after its last training data. This is not a bug to fix - it is a consequence of parametric memory.

Context window limits. Attention is O(n^2) in sequence length. Processing 100K tokens means computing attention scores between every pair - that is 10 billion operations per layer. This is why context windows have hard limits.

Brittleness to phrasing. Because the model learned patterns from specific phrasings in training data, slightly different wordings of the same question can produce different quality answers.

Real-world systems that use this architecture

  • OpenAI GPT-4/GPT-4o - dense transformer, rumored mixture-of-experts, ~1.8T parameters
  • Anthropic Claude - dense transformer with constitutional AI training
  • Google Gemini - multimodal transformer trained on text, images, audio, video
  • Meta LLaMA 3 - open-weight dense transformer, 8B to 405B parameters
  • Mistral/Mixtral - mixture-of-experts architecture where only a subset of parameters activate per token

How to apply this knowledge

When building RAG systems: You are compensating for the fact that parametric memory is frozen and lossy. RAG injects relevant context into the prompt so the model does not need to rely on compressed training patterns.

When debugging bad outputs: Ask whether the issue is a knowledge gap (needs RAG or fine-tuning) or a reasoning gap (needs better prompting, chain-of-thought, or a more capable model).

When choosing models: Larger models have more parameters to encode patterns, but also cost more per token. If your task only requires pattern matching over well-represented domains, a smaller model works fine. Novel reasoning over unfamiliar domains needs larger models.

When designing prompts: You are crafting the input sequence to activate the right patterns in the model’s parameters. Explicit context, clear structure, and examples all reduce ambiguity about which patterns to activate.

FAQ

Q: Do LLMs actually understand language?

This is a philosophical question more than a technical one. Functionally, they produce outputs indistinguishable from understanding for many tasks. Mechanistically, they are doing pattern matching over statistical regularities. Whether that constitutes “understanding” depends on your definition. For engineering purposes, treat them as very capable pattern matchers that can fail in ways that reveal they lack genuine comprehension.

Q: Why can not we just make the context window infinite?

Self-attention has quadratic complexity - doubling the context window quadruples the compute for attention. There are linear attention variants and approaches like sliding window attention, but they trade off global context for efficiency. The fundamental constraint is that relating every token to every other token is expensive.

Q: Is a bigger model always better?

No. Bigger models have more capacity to memorize patterns, but they also cost more, have higher latency, and can be harder to control. For well-defined, narrow tasks, a fine-tuned smaller model often outperforms a general-purpose large model while being 10-100x cheaper.

Interview questions

Q: Walk me through what happens when a user sends a prompt to an LLM-based application.

Strong answers cover: tokenization, embedding lookup, passage through transformer blocks (attention + FFN), output probability distribution, sampling strategy, and detokenization. Great answers also mention KV caching for efficiency during generation.

Q: Why do LLMs hallucinate, and what architectural properties make this difficult to solve?

The model generates the most probable next token based on learned patterns - it has no mechanism to verify factual correctness. The knowledge is distributed across billions of parameters with no discrete “fact storage.” Solutions require external verification (RAG, tool use) because the architecture itself cannot distinguish correct from plausible.

Q: You are designing a system that needs to answer questions about your company’s internal documentation. Would you fine-tune an LLM or use RAG? Justify your choice.

RAG is almost always the right starting point: it keeps knowledge updatable without retraining, provides attributable sources, and is cheaper. Fine-tuning makes sense for style/format adaptation or when the knowledge is too large for context windows. The best systems often combine both - fine-tuning for domain vocabulary and RAG for specific facts.