Tokenization & Embeddings

ok what even is a token. been ignoring this question for too long.

1. The core problem: models can’t read text
2. Tokenization: what even is a token?
3. BPE — Byte Pair Encoding
- WordPiece (BERT)
- SentencePiece
4. Embeddings — the geometry of meaning
5. Word2Vec — the conceptual foundation
6. What embeddings look like geometrically
7. Contextual embeddings — the transformer upgrade
8. The embedding matrix is massive
9. How representations change across layers
10. Positional encoding
11. The full pipeline — from raw text to semantic search
References

1. The core problem: models can’t read text

Neural networks operate on real-valued vectors. They cannot read the string "hello". They can only do arithmetic — multiply matrices, add vectors, apply nonlinearities. So the entire pipeline of modern NLP begins with one question: how do you turn a word into a number in a way that preserves meaning?

The naive answer is obvious and wrong: just assign every word an integer. apple=1, banana=2, cat=3. Two problems:

Magnitude lies. The model sees cat=3 > apple=1 and thinks cat is “more” than apple. That’s nonsense.
No structure. apple and banana are both fruit. apple and cat aren’t. Integer IDs encode zero information about relationships.

The fix for (1) is one-hot encoding. The fix for (1) and (2) is learned embeddings. But before we get there, we need to talk about the step before representation: tokenization.

2. Tokenization: what even is a token?

A token is not a word. I thought it was. It isn’t.

A token is the atomic unit the model operates on. Depending on the tokenizer, "unbelievable" might be one token, or ["un", "believ", "able"], or ["u","n","b","e","l","i","e","v","a","b","l","e"].

Character-level

Split every character. Vocabulary is small (say, 256 for ASCII or ~70k for full unicode). Easy to implement. But:

Sequences get very long. "hello world" → 11 tokens. Transformers scale as $O(n^2)$ in sequence length because of self-attention. This hurts.
The model has to learn to compose letters into meaning from scratch. That’s a lot to ask.

Word-level

Split on whitespace (roughly). Vocabulary might be 100k+ words. But:

"running" and "runs" are completely separate entries. The model has to learn their relationship from data.
Out-of-vocabulary problem: a word not seen in training gets [UNK]. Goodbye, that information.
Morphologically rich languages (Turkish, Finnish, German compounds) basically explode.

Subword — the actual answer

The insight: split rare words into pieces, keep common words whole. "tokenization" → ["token", "ization"]. "cat" stays ["cat"]. You get:

Bounded vocabulary size (~30k–100k typical)
No true OOV: any word can be spelled out as subword pieces
Shared morphological roots: "run", "running", "runner" all contain the "run" piece

The two dominant algorithms are BPE and WordPiece.

3. BPE — Byte Pair Encoding

Originally a compression algorithm (1994). Repurposed for NLP and now the backbone of GPT-2, GPT-4, LLaMA. The idea is elegant:

Algorithm:

Start with a character-level vocabulary (every character is its own token, plus a special end-of-word marker)
Count all adjacent pair frequencies across the corpus
Merge the most frequent pair into a new token
Repeat until vocabulary reaches target size

Worked example:

Say our corpus has:

"low" × 5,  "lower" × 2,  "newest" × 6,  "widest" × 3

Initial character split (with </w> as end-of-word):

l o w </w>         × 5
l o w e r </w>     × 2
n e w e s t </w>   × 6
w i d e s t </w>   × 3

Count pairs → e s appears 9 times (6+3). Merge → es:

l o w </w>         × 5
l o w e r </w>     × 2
n e w es t </w>    × 6
w i d es t </w>    × 3

Next most frequent: es t → 9 times. Merge → est:

l o w </w>          × 5
l o w e r </w>      × 2
n e w est </w>      × 6
w i d est </w>      × 3

Keep going until vocab target is hit. What you end up with is a merge table — an ordered list of pairs to merge — that gets applied to any new text at inference.

The order matters. BPE is greedy: earlier merges take priority.

Encoding a new word:

"lowest" → split to chars → l o w e s t </w> → apply merges in order → l o w est </w> → three tokens.

# rough pseudocode of GPT-2 BPE encoding
def encode(text: str) -> list[int]:
    # byte-level: works on UTF-8 bytes, not chars
    tokens = list(text.encode("utf-8"))
    while True:
        pairs = get_adjacent_pairs(tokens)
        best = min(pairs, key=lambda p: bpe_ranks.get(p, float("inf")))
        if best not in bpe_ranks:
            break
        tokens = merge_pair(tokens, best)
    return [vocab[token] for token in tokens]

GPT-2 and descendants use byte-level BPE — base vocabulary is 256 bytes, not Unicode characters. This means truly no OOV: any byte sequence is representable. GPT-2’s vocabulary size is 50,257 (50,000 BPE merges + 256 byte tokens + 1 special <|endoftext|> token).

WordPiece (BERT)

Very similar to BPE but the merge criterion is different. Instead of raw frequency, WordPiece maximizes:

\[\text{score}(A, B) = \frac{\text{freq}(AB)}{\text{freq}(A) \cdot \text{freq}(B)}\]

This is basically pointwise mutual information (PMI) — it favors pairs that co-occur more than chance, not just pairs that are common. Two very frequent individual tokens won’t necessarily merge unless they appear together more than expected.

Visible difference: BERT uses ## prefix for continuation pieces. "playing" → ["play", "##ing"].

SentencePiece

Language-agnostic. Doesn’t assume whitespace is a word boundary (important for Japanese, Chinese, Thai). Treats the raw text stream as input and learns both pre-tokenization and BPE/unigram jointly. Used in T5, LLaMA, Gemma.

One clever trick SentencePiece does during training (not inference): it samples different valid tokenizations of the same word stochastically. So the model never overfits to one specific way to segment a word — it sees "running" as ["run", "ning"] sometimes, ["runn", "ing"] other times. Acts like data augmentation at the tokenization level.

4. Embeddings — the geometry of meaning

Ok so we can turn text into a sequence of integer token IDs. Now what?

The token ID is an index into an embedding matrix $E \in \mathbb{R}^{\lvert V \rvert \times d}$ where $\lvert V \rvert$ is vocab size and $d$ is embedding dimension (e.g. 768 for BERT-base, 4096 for LLaMA-7B).

\[\mathbf{e}_i = E[i, :] \in \mathbb{R}^d\]

This is just a lookup table. Token 42 → row 42 of $E$. The magic is that $E$ is a learned parameter. During training, gradients flow back through the lookup and update the rows of $E$ so that tokens that appear in similar contexts end up with similar vectors.

graph LR
    T["token ID: 42"] --> LT["E[42, :]\nlookup row 42"]
    LT --> V["dense vector ∈ ℝ^768"]
    V --> TF["transformer layers"]
    TF --> O["output / prediction"]

Why does similar context → similar vector? Because of how the model is trained. In the simplest case (Word2Vec skip-gram): predict surrounding words from a center word. If "dog" and "cat" both appear between "the" and "sat", their embedding vectors get pulled toward whatever linear combination predicts "the" and "sat". They converge.

5. Word2Vec — the conceptual foundation

Not used in production models directly anymore, but understanding it builds the right intuition for why embeddings work at all.

Skip-gram objective

Given a center word $w_c$ at position $t$, predict context words within a window of size $k$:

\[\mathcal{L} = \frac{1}{T} \sum_{t=1}^{T} \sum_{\substack{-k \le j \le k \\ j \neq 0}} \log P(w_{t+j} \mid w_t)\]

where:

\[P(w_o \mid w_c) = \frac{\exp(\mathbf{u}_{w_o}^\top \mathbf{v}_{w_c})}{\sum_{w=1}^{\lvert V \rvert} \exp(\mathbf{u}_w^\top \mathbf{v}_{w_c})}\]

This is a softmax over the dot product between the output embedding $\mathbf{u}$ of the context word and the input embedding $\mathbf{v}$ of the center word.

(Two embedding matrices — $V$ for center words, $U$ for context words. They’re usually averaged or just one is kept afterward.)

The denominator — summing over all $\lvert V \rvert$ words — is what kills you computationally for large vocabularies. $\lvert V \rvert = 50000$ means 50k exponentials for every gradient step. The fix: negative sampling.

Negative sampling

Instead of computing the full softmax, reframe as binary classification: is this (center, context) pair real or noise?

\[\mathcal{L}_{\text{NS}} = \log \sigma(\mathbf{u}_{w_o}^\top \mathbf{v}_{w_c}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n} \left[ \log \sigma(-\mathbf{u}_{w_i}^\top \mathbf{v}_{w_c}) \right]\]

where $P_n(w) \propto \text{freq}(w)^{3/4}$ is the noise distribution. The $3/4$ power smooths it — rare words get sampled more often than pure unigram frequency would give them, common words get sampled a bit less. $k$ is typically 5–20 negative samples.

The gradient now touches $k+1$ rows of the embedding matrix instead of $\lvert V \rvert$. Massively cheaper.

The king − man + woman ≈ queen thing

This is not a trick. It’s a real geometric property that emerges from training:

\[\mathbf{v}_{\text{queen}} \approx \mathbf{v}_{\text{king}} - \mathbf{v}_{\text{man}} + \mathbf{v}_{\text{woman}}\]

Why? Because "king" and "queen" appear in similar contexts ("ruled", "throne", "crowned"). The difference is the gendered contexts. The model factorizes royalty and gender into separate linear directions in the embedding space. Subtracting man removes the gender component, adding woman re-injects it. The royalty direction is preserved.

This is the core intuition of the distributional hypothesis: words that appear in similar contexts have similar meanings. Embeddings are a learned compression of context statistics.

6. What embeddings look like geometrically

$d=2$ is a toy but it’s useful to think in:

              science
                 •
                              art
         physics              •
            •    •math
                      
    engineering •           music •
                                     dance
                                       •
                    ← technical     creative →

In high-dimensional space (768, 4096), these clusters exist but they’re not axis-aligned. The “gender” direction, the “tense” direction, the “sentiment” direction — they’re all learned linear subspaces. The geometry encodes grammar and semantics simultaneously.

Cosine similarity is the standard measure:

\[\cos(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}\]

Why cosine and not Euclidean distance? Magnitude doesn’t mean much here. A word that appears 10 million times might have a larger embedding vector norm than a rare word, not because it’s “more meaningful” but because of gradient accumulation. Normalizing out magnitude focuses purely on direction, which is what encodes semantic content.

$\cos = 1$: identical direction (synonyms)
$\cos = 0$: orthogonal (unrelated)
$\cos = -1$: opposite direction (antonyms — sometimes)

import torch
import torch.nn.functional as F

def analogy(E, a, b, c):
    # a - b + c ≈ ?  e.g. king - man + woman
    query = E[a] - E[b] + E[c]
    query = F.normalize(query.unsqueeze(0), dim=-1)
    scores = F.normalize(E, dim=-1) @ query.T  # [vocab_size, 1]
    return scores.squeeze().topk(5)

7. Contextual embeddings — the transformer upgrade

Word2Vec gives each word one static vector. "bank" (financial) and "bank" (river) get the same embedding. That’s clearly wrong.

BERT, GPT, and all modern LLMs produce contextual embeddings: the representation of a token depends on all surrounding tokens. The embedding for "bank" in "deposit money at the bank" is numerically different from "bank" in "river bank at sunset".

This happens because the transformer’s self-attention layers continuously rewrite each token’s representation based on what it attends to. The embedding matrix $E$ is just the starting position in vector space. Every subsequent layer moves each vector based on context.

flowchart TD
    subgraph INPUT["Input tokens"]
        direction LR
        W1["the"] 
        W2["bank"]
        W3["is"]
        W4["closed"]
    end

    subgraph E["Embedding lookup  E ∈ ℝ^(V×d)"]
        direction LR
        E1["e₁"] 
        E2["e₂  (bank — no context yet)"]
        E3["e₃"]
        E4["e₄"]
    end

    subgraph L1["Layer 1 — syntactic patterns"]
        direction LR
        H11["h¹₁"] 
        H12["h¹₂  (bank sees: 'the', 'is')"]
        H13["h¹₃"]
        H14["h¹₄"]
    end

    subgraph L2["Layer 2 — semantic context building"]
        direction LR
        H21["h²₁"] 
        H22["h²₂  (bank sees: 'closed' → financial)"]
        H23["h²₃"]
        H24["h²₄"]
    end

    subgraph LN["Layer N — final contextual representation"]
        direction LR
        HN1["hᴺ₁"] 
        HN2["hᴺ₂  ← this is 'bank' as financial institution\nnot river bank. same input, totally different vector."]
        HN3["hᴺ₃"]
        HN4["hᴺ₄"]
    end

    INPUT -->|"one vector per token"| E
    E -->|"self-attention: every token attends to every other"| L1
    L1 -->|"representations updated"| L2
    L2 -->|"...× N layers"| LN

The final $\mathbf{h}^N_2$ for "bank" encodes the entire sentence. This is why transformer-based models demolish Word2Vec on every downstream task.

8. The embedding matrix is massive

For GPT-2: vocab = 50,257, $d$ = 768.

\[50{,}257 \times 768 = 38{,}597{,}376 \text{ parameters}\]

That’s ~38.6M just for the input embedding layer. GPT-2 small has ~117M total parameters — the embedding table is roughly a third of the entire model.

LLaMA-2-7B: vocab = 32,000, $d$ = 4,096 → ~131M just for input embeddings. And a separate output (unembedding) matrix of the same shape $W_U \in \mathbb{R}^{d \times \lvert V \rvert}$ to go from hidden state back to token logits — though weights are often tied: $W_U = E^\top$.

Tied weights make geometric sense: if two tokens have similar input embeddings (they appear in similar contexts), they should also score similarly as predictions. The same geometry should hold in both directions.

Dead embeddings are a real thing too: tokens that never appear in training data get random-initialized rows in $E$ and never receive gradient updates. They’re noise sitting in the embedding space. Some vocabularies have slots that are literally never used.

This is why vocabulary size is a real tradeoff: bigger vocab = fewer tokens per sentence (faster attention, shorter sequences) but bigger embedding matrices (more memory, slower convergence).

9. How representations change across layers

The layers aren’t all doing the same thing. This is well-studied via probing classifiers — you freeze the model, take the hidden state at layer $l$, and train a small classifier on top to predict something (POS tag, NER label, sentiment, etc.).

What’s consistently found:

Early layers (1–4): Local syntactic structure. Part-of-speech, basic morphology. The token is still mostly “itself.”
Middle layers: Semantic content. Coreference, entity types, general meaning. This is where "bank" starts disambiguating.
Final layers: Task-specific. In instruction-tuned models, this is where the model’s “intent” lives. The representation is shaped toward the output distribution.

This is why people doing RAG and semantic search usually pull embeddings from a middle layer or a dedicated sentence-encoder model (like sentence-transformers) rather than using the last layer of a generation model.

10. Positional encoding

Embeddings alone lose sequence order. The set ${E[\text{cat}], E[\text{sat}], E[\text{the}]}$ is identical regardless of word order. Transformers have no recurrence (unlike RNNs), so position has to be injected explicitly.

Sinusoidal (original Transformer, Vaswani 2017):

\[PE_{(pos,\, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d}}\right)\] \[PE_{(pos,\, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)\]

where $pos$ is the token position, $i$ is the dimension index, $d$ is embedding dimension. This gets added to the embedding: $\mathbf{x} = \mathbf{e} + PE_{pos}$.

The key property: $PE_{pos+k}$ is always a linear function of $PE_{pos}$ (falls out of the angle-addition formula for sin/cos). So the model can learn to attend to “token $k$ positions ahead” by learning a linear transformation. That’s elegant.

Why different frequencies per dimension? Low-$i$ dimensions cycle fast — they encode fine-grained local position. High-$i$ dimensions cycle slowly — they encode coarse global position. Together they form a unique signature for each position, like a binary number but continuous.

RoPE — what LLaMA uses:

Instead of adding position before attention, RoPE rotates the query and key vectors before computing dot products. The rotation angle depends on position. The dot product then naturally encodes relative position:

\[\mathbf{q}_m^\top \mathbf{k}_n = f(\mathbf{q}, \mathbf{k},\; m - n)\]

Only the difference $m - n$ matters, not the absolute values. This generalizes much better to sequences longer than seen during training — which is why LLaMA can be extended via RoPE scaling tricks post-training.

11. The full pipeline — from raw text to semantic search

This is the part that clicked everything for me. Here’s the full flow from training to actually using embeddings:

Phase 1 — Training (happens once, costs a lot of compute):

massive text corpus
        ↓
    tokenizer (BPE/WordPiece/SentencePiece)
        ↓
    sequences of token IDs
        ↓
    train model: minimize prediction loss
        ↓
    learned embedding matrix E  ←  this is the artifact we care about
    (each row = a dense vector for one token)

Phase 2 — Building an index (for semantic search / RAG):

Say you have a document collection and want to search it by meaning.

"The capital of France is Paris."
        ↓
    tokenizer  →  [token IDs]
        ↓
    embedding lookup + transformer layers
        ↓
    one vector per token (or pooled to one vector per sentence)
        ↓
    store in a vector database (FAISS, Pinecone, pgvector, etc.)

Repeat for every document. You now have a table: doc_id → embedding vector.

Phase 3 — Query time:

user query: "what is france's capital?"
        ↓
    same tokenizer  →  [token IDs]
        ↓
    same model  →  query vector q ∈ ℝ^d
        ↓
    cosine similarity against every stored vector:
        score(q, dᵢ) = (q · dᵢ) / (‖q‖ · ‖dᵢ‖)
        ↓
    return top-k documents by score

The reason this works: the model that produced the document vectors and the query vector is the same model with the same geometry. "france's capital" and "capital of France" end up near each other in the embedding space because they appear in similar contexts in training data. Cosine similarity finds them.

flowchart LR
    subgraph TRAIN["Build index"]
        D["document corpus"] --> TK1["tokenizer"]
        TK1 --> EM1["embedding model"]
        EM1 --> VDB["vector DB\n[v₁, v₂, ..., vₙ]"]
    end

    subgraph QUERY["Query"]
        Q["user query"] --> TK2["same tokenizer"]
        TK2 --> EM2["same model"]
        EM2 --> QV["query vector q"]
    end

    QV -->|"cosine similarity\nagainst all stored vectors"| VDB
    VDB -->|"top-k results"| R["retrieved documents"]

The crucial constraint: the tokenizer and model used for documents and queries must be identical. Mixing models breaks the geometry entirely.

References

Neural Machine Translation of Rare Words with Subword Units — Sennrich et al. 2016, the BPE paper
Efficient Estimation of Word Representations in Vector Space — Mikolov et al. 2013, Word2Vec
Karpathy’s minBPE — cleanest BPE implementation I’ve read, worth going through line by line