Episode 4 — Generative AI Engineering / 4.11 — Understanding Embeddings

4.11 — Understanding Embeddings: Quick Revision

Compact cheat sheet. Print-friendly.

How to use this material (instructions)

Skim before labs or interviews.
Drill gaps — reopen README.md → 4.11.a…4.11.c.
Practice — 4.11-Exercise-Questions.md.
Polish answers — 4.11-Interview-Questions.md.

Core vocabulary

Term	One-liner
Embedding	Fixed-length vector of floats that captures semantic meaning of text
Vector	Array of numbers representing a point in high-dimensional space
Cosine similarity	Measures angle between vectors; 1.0 = identical, 0 = unrelated
Euclidean distance	Straight-line distance between vectors; 0 = identical
Dot product	Sum of element-wise products; equals cosine sim for normalized vectors
Semantic search	Finding content by meaning, not exact keywords
Chunking	Splitting documents into smaller pieces for embedding
Overlap	Repeating text at chunk boundaries to preserve context
ANN	Approximate Nearest Neighbor — fast search for large vector collections
HNSW	Hierarchical Navigable Small World — most common ANN algorithm
Matryoshka embeddings	Dimension reduction — fewer dims, smaller storage, slight quality loss
RAG	Retrieval-Augmented Generation — embed + search + generate

Embedding models

Model	Dimensions	Cost/1M tokens	Notes
`text-embedding-3-small`	1536	~$0.02	Best default for most apps
`text-embedding-3-large`	3072	~$0.13	Higher quality, 6x cost
`text-embedding-ada-002`	1536	~$0.10	Legacy — do not use for new projects

Embedding model vs LLM:
  Embedding model: text → vector (numbers)    $0.02/1M tokens
  LLM (GPT-4o):   text → text (generation)   $2.50/1M tokens (input)

  Embeddings are ~125x cheaper than LLM calls.

How embeddings work

Text → Tokenizer → Transformer → Pooling → Normalize → Vector

"I love JavaScript"
  → ["I", " love", " JavaScript"]           (tokenize)
  → [contextual vectors per token]            (transformer)
  → [single averaged vector]                  (mean pooling)
  → [0.023, -0.041, 0.008, ..., -0.016]     (normalize to length 1)
     (1536 dimensions)

Semantic meaning in vectors

CLOSE vectors (high cosine similarity):
  "happy" ↔ "joyful"           ~0.92
  "JavaScript" ↔ "TypeScript"  ~0.88
  "fix a bug" ↔ "debug code"   ~0.85

FAR vectors (low cosine similarity):
  "happy" ↔ "database"         ~0.12
  "JavaScript" ↔ "sourdough"   ~0.08

Similarity metrics cheat sheet

COSINE SIMILARITY (default choice):
  Range: -1 to 1 (higher = more similar)
  Formula: (A · B) / (|A| × |B|)
  For normalized vectors: just dot product
  Use: almost always

DOT PRODUCT (fast alternative):
  Range: -inf to inf (higher = more similar)
  Formula: sum(A[i] × B[i])
  For normalized vectors: EQUALS cosine similarity
  Use: when vectors are normalized (OpenAI embeddings)

EUCLIDEAN DISTANCE:
  Range: 0 to inf (LOWER = more similar)
  Formula: sqrt(sum((A[i] - B[i])²))
  Use: rarely for embeddings (sensitive to magnitude)

Quick code

// Cosine similarity
function cosineSim(a, b) {
  let dot = 0, magA = 0, magB = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    magA += a[i] * a[i];
    magB += b[i] * b[i];
  }
  return dot / (Math.sqrt(magA) * Math.sqrt(magB));
}

// For normalized vectors (OpenAI) — just dot product
function dotProduct(a, b) {
  let dot = 0;
  for (let i = 0; i < a.length; i++) dot += a[i] * b[i];
  return dot;
}

Similarity thresholds

>= 0.90  Very high   — near-duplicate / paraphrase
0.80-0.90 High       — same topic, clearly relevant
0.70-0.80 Moderate   — related, probably relevant
0.60-0.70 Low        — tangentially related
< 0.50   Noise       — probably irrelevant

Starting points by use case:
  Duplicate detection:  0.92+
  RAG / factual Q&A:   0.78-0.85
  Exploratory search:   0.60-0.70

ALWAYS calibrate on YOUR data — these are guidelines, not rules.

Semantic search vs keyword search

	Keyword	Semantic
Matches	Exact words	Meaning
"car" finds "automobile"?	No	Yes
Setup	Simple text index	Embedding model + vector DB
Cost	Free	API calls + storage
Best for	Exact terms, IDs	Natural language queries

Production: use HYBRID (70% semantic + 30% keyword).

Chunking strategies

Strategy          │ Quality │ Complexity │ Best For
──────────────────┼─────────┼────────────┼──────────────
Fixed-size        │ Basic   │ Low        │ Prototypes, logs
Sentence-based    │ Good    │ Low        │ Articles
Paragraph-based   │ Good    │ Low        │ Structured docs, FAQ
Recursive char    │ Better  │ Medium     │ Most production systems
Semantic          │ Best    │ High       │ High-stakes (medical, legal)

Recursive character splitting (industry standard)

Separator hierarchy (try best first):
  1. "\n\n"  — paragraph break
  2. "\n"    — line break
  3. ". "    — sentence end
  4. " "     — word break
  5. ""      — character (last resort)

Chunk size guide

TOO SMALL (50-100 tokens):
  "To reset the password" — no context, vague embedding

SWEET SPOT (200-500 tokens):
  Complete paragraph with enough context for precise matching

TOO LARGE (2000+ tokens):
  Entire chapter — embedding is diluted, weak matches

Recommended by content type:
  FAQ / Q&A:        100-200 tokens, 0 overlap
  Tech docs:        200-400 tokens, 50-100 overlap
  Articles:         300-500 tokens, 50-100 overlap
  Legal/medical:    400-600 tokens, 100-150 overlap
  Code docs:        200-400 tokens, 50 overlap

Overlap

chunk_size=400, overlap=60:

  Chunk 1: [==========================|overlap|]
  Chunk 2:                    [overlap|==========================|overlap|]
  Chunk 3:                                          [overlap|================]

  Guideline: 10-15% of chunk size
  Too much (>25%): duplicate storage and results
  Too little (0%):  context lost at boundaries

Storage math

Per vector:
  1536 dims × 4 bytes = 6 KB
  3072 dims × 4 bytes = 12 KB
  256 dims  × 4 bytes = 1 KB

Scaling:
  1M documents × 1536 dims = ~6 GB
  1M documents × 256 dims  = ~1 GB
  Add 20-50% for metadata + index overhead

Embedding API quick reference

import OpenAI from 'openai';
const openai = new OpenAI();

// Single embedding
const res = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: 'Your text here',
});
const vector = res.data[0].embedding; // Float[1536]

// Batch embedding (faster + cheaper)
const res = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: ['Text 1', 'Text 2', 'Text 3'],
});
// res.data[0].embedding, res.data[1].embedding, ...

// Reduced dimensions
const res = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: 'Your text here',
  dimensions: 256,  // 256 instead of 1536
});

Search performance

Documents │  Brute Force  │  ANN (HNSW)
──────────┼───────────────┼───────────────
1,000     │  < 1ms        │  < 1ms
10,000    │  ~5ms         │  < 1ms
100,000   │  ~50ms        │  ~1-2ms
1,000,000 │  ~500ms       │  ~2-5ms

Rule: brute-force OK under 50K. Use ANN (vector DB) above that.

Complete pipeline flow

1. LOAD    → Extract text from documents (PDF, MD, HTML)
2. CLEAN   → Remove boilerplate, normalize whitespace
3. CHUNK   → Recursive splitting (400 tokens, 60 overlap)
4. ENRICH  → Add metadata (source, page, section, date)
5. EMBED   → Batch embed with text-embedding-3-small
6. STORE   → Upsert vectors + metadata into vector DB
7. QUERY   → Embed user question with SAME model
8. SEARCH  → Find top-k nearest vectors (cosine sim)
9. FILTER  → Apply threshold + metadata filters
10. INJECT → Feed retrieved chunks into LLM prompt

Common gotchas

Gotcha	Why
Mixing embedding models	Vectors from different models are incompatible
No threshold filtering	Low-similarity noise pollutes results
Embedding huge documents whole	Diluted vectors, weak matches
Forgetting to re-embed on model change	Old vectors are useless with new model
Ignoring metadata	Can't filter by source, date, or category
Chunk size too small	Fragments with no context
Chunk size too large	Diluted meaning, weak retrieval
Using cosine sim on unnormalized vectors	Works but slower — normalize first
Not batching API calls	1000 individual calls vs 10 batches of 100
Skipping evaluation	No way to know if chunking/threshold is right

Critical rules

1. SAME MODEL for indexing and querying — always
2. CHUNK before embedding — never embed whole documents
3. BATCH embedding calls — 100 texts per API call
4. ATTACH METADATA — source, page, date minimum
5. CALIBRATE THRESHOLDS — on your data, not generic defaults
6. EVALUATE — build test queries, measure recall@5
7. START SIMPLE — recursive splitting + text-embedding-3-small

End of 4.11 quick revision.