What is a Transformer? LLM Architecture Explained

What is a transformer?

A transformer is a neural network architecture that uses self-attention to process sequences. Introduced in 2017's "Attention Is All You Need" paper, it's the foundation of all modern LLMs including GPT-5, Claude 4, Gemini, and Llama.

Key Components

Self-attention: Weighs relationships between tokens
Feed-forward: Processes each position
Layer norm: Stabilizes training
Positional encoding: Adds sequence order

Transformer Variants

Decoder-only: GPT, Claude, Llama (text generation)
Encoder-only: BERT (understanding)
Encoder-decoder: T5 (translation, summarization)

Why Transformers Enabled LLMs

Parallel processing enables massive scale
Attention captures long-range dependencies
Scales predictably with compute
Flexible for many modalities

Why are transformers important?

Transformers enabled the scaling that created modern LLMs. Unlike previous architectures, they can process entire sequences in parallel and capture long-range dependencies. This made training on massive datasets practical.