Episode 4 — Generative AI Engineering / 4.11 — Understanding Embeddings
4.11 — Understanding Embeddings: Quick Revision
Compact cheat sheet. Print-friendly.
How to use this material (instructions)
- Skim before labs or interviews.
- Drill gaps — reopen
README.md→4.11.a…4.11.c. - Practice —
4.11-Exercise-Questions.md. - Polish answers —
4.11-Interview-Questions.md.
Core vocabulary
| Term | One-liner |
|---|---|
| Embedding | Fixed-length vector of floats that captures semantic meaning of text |
| Vector | Array of numbers representing a point in high-dimensional space |
| Cosine similarity | Measures angle between vectors; 1.0 = identical, 0 = unrelated |
| Euclidean distance | Straight-line distance between vectors; 0 = identical |
| Dot product | Sum of element-wise products; equals cosine sim for normalized vectors |
| Semantic search | Finding content by meaning, not exact keywords |
| Chunking | Splitting documents into smaller pieces for embedding |
| Overlap | Repeating text at chunk boundaries to preserve context |
| ANN | Approximate Nearest Neighbor — fast search for large vector collections |
| HNSW | Hierarchical Navigable Small World — most common ANN algorithm |
| Matryoshka embeddings | Dimension reduction — fewer dims, smaller storage, slight quality loss |
| RAG | Retrieval-Augmented Generation — embed + search + generate |
Embedding models
| Model | Dimensions | Cost/1M tokens | Notes |
|---|---|---|---|
text-embedding-3-small | 1536 | ~$0.02 | Best default for most apps |
text-embedding-3-large | 3072 | ~$0.13 | Higher quality, 6x cost |
text-embedding-ada-002 | 1536 | ~$0.10 | Legacy — do not use for new projects |
Embedding model vs LLM:
Embedding model: text → vector (numbers) $0.02/1M tokens
LLM (GPT-4o): text → text (generation) $2.50/1M tokens (input)
Embeddings are ~125x cheaper than LLM calls.
How embeddings work
Text → Tokenizer → Transformer → Pooling → Normalize → Vector
"I love JavaScript"
→ ["I", " love", " JavaScript"] (tokenize)
→ [contextual vectors per token] (transformer)
→ [single averaged vector] (mean pooling)
→ [0.023, -0.041, 0.008, ..., -0.016] (normalize to length 1)
(1536 dimensions)
Semantic meaning in vectors
CLOSE vectors (high cosine similarity):
"happy" ↔ "joyful" ~0.92
"JavaScript" ↔ "TypeScript" ~0.88
"fix a bug" ↔ "debug code" ~0.85
FAR vectors (low cosine similarity):
"happy" ↔ "database" ~0.12
"JavaScript" ↔ "sourdough" ~0.08
Similarity metrics cheat sheet
COSINE SIMILARITY (default choice):
Range: -1 to 1 (higher = more similar)
Formula: (A · B) / (|A| × |B|)
For normalized vectors: just dot product
Use: almost always
DOT PRODUCT (fast alternative):
Range: -inf to inf (higher = more similar)
Formula: sum(A[i] × B[i])
For normalized vectors: EQUALS cosine similarity
Use: when vectors are normalized (OpenAI embeddings)
EUCLIDEAN DISTANCE:
Range: 0 to inf (LOWER = more similar)
Formula: sqrt(sum((A[i] - B[i])²))
Use: rarely for embeddings (sensitive to magnitude)
Quick code
// Cosine similarity
function cosineSim(a, b) {
let dot = 0, magA = 0, magB = 0;
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i];
magA += a[i] * a[i];
magB += b[i] * b[i];
}
return dot / (Math.sqrt(magA) * Math.sqrt(magB));
}
// For normalized vectors (OpenAI) — just dot product
function dotProduct(a, b) {
let dot = 0;
for (let i = 0; i < a.length; i++) dot += a[i] * b[i];
return dot;
}
Similarity thresholds
>= 0.90 Very high — near-duplicate / paraphrase
0.80-0.90 High — same topic, clearly relevant
0.70-0.80 Moderate — related, probably relevant
0.60-0.70 Low — tangentially related
< 0.50 Noise — probably irrelevant
Starting points by use case:
Duplicate detection: 0.92+
RAG / factual Q&A: 0.78-0.85
Exploratory search: 0.60-0.70
ALWAYS calibrate on YOUR data — these are guidelines, not rules.
Semantic search vs keyword search
| Keyword | Semantic | |
|---|---|---|
| Matches | Exact words | Meaning |
| "car" finds "automobile"? | No | Yes |
| Setup | Simple text index | Embedding model + vector DB |
| Cost | Free | API calls + storage |
| Best for | Exact terms, IDs | Natural language queries |
Production: use HYBRID (70% semantic + 30% keyword).
Chunking strategies
Strategy │ Quality │ Complexity │ Best For
──────────────────┼─────────┼────────────┼──────────────
Fixed-size │ Basic │ Low │ Prototypes, logs
Sentence-based │ Good │ Low │ Articles
Paragraph-based │ Good │ Low │ Structured docs, FAQ
Recursive char │ Better │ Medium │ Most production systems
Semantic │ Best │ High │ High-stakes (medical, legal)
Recursive character splitting (industry standard)
Separator hierarchy (try best first):
1. "\n\n" — paragraph break
2. "\n" — line break
3. ". " — sentence end
4. " " — word break
5. "" — character (last resort)
Chunk size guide
TOO SMALL (50-100 tokens):
"To reset the password" — no context, vague embedding
SWEET SPOT (200-500 tokens):
Complete paragraph with enough context for precise matching
TOO LARGE (2000+ tokens):
Entire chapter — embedding is diluted, weak matches
Recommended by content type:
FAQ / Q&A: 100-200 tokens, 0 overlap
Tech docs: 200-400 tokens, 50-100 overlap
Articles: 300-500 tokens, 50-100 overlap
Legal/medical: 400-600 tokens, 100-150 overlap
Code docs: 200-400 tokens, 50 overlap
Overlap
chunk_size=400, overlap=60:
Chunk 1: [==========================|overlap|]
Chunk 2: [overlap|==========================|overlap|]
Chunk 3: [overlap|================]
Guideline: 10-15% of chunk size
Too much (>25%): duplicate storage and results
Too little (0%): context lost at boundaries
Storage math
Per vector:
1536 dims × 4 bytes = 6 KB
3072 dims × 4 bytes = 12 KB
256 dims × 4 bytes = 1 KB
Scaling:
1M documents × 1536 dims = ~6 GB
1M documents × 256 dims = ~1 GB
Add 20-50% for metadata + index overhead
Embedding API quick reference
import OpenAI from 'openai';
const openai = new OpenAI();
// Single embedding
const res = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: 'Your text here',
});
const vector = res.data[0].embedding; // Float[1536]
// Batch embedding (faster + cheaper)
const res = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: ['Text 1', 'Text 2', 'Text 3'],
});
// res.data[0].embedding, res.data[1].embedding, ...
// Reduced dimensions
const res = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: 'Your text here',
dimensions: 256, // 256 instead of 1536
});
Search performance
Documents │ Brute Force │ ANN (HNSW)
──────────┼───────────────┼───────────────
1,000 │ < 1ms │ < 1ms
10,000 │ ~5ms │ < 1ms
100,000 │ ~50ms │ ~1-2ms
1,000,000 │ ~500ms │ ~2-5ms
Rule: brute-force OK under 50K. Use ANN (vector DB) above that.
Complete pipeline flow
1. LOAD → Extract text from documents (PDF, MD, HTML)
2. CLEAN → Remove boilerplate, normalize whitespace
3. CHUNK → Recursive splitting (400 tokens, 60 overlap)
4. ENRICH → Add metadata (source, page, section, date)
5. EMBED → Batch embed with text-embedding-3-small
6. STORE → Upsert vectors + metadata into vector DB
7. QUERY → Embed user question with SAME model
8. SEARCH → Find top-k nearest vectors (cosine sim)
9. FILTER → Apply threshold + metadata filters
10. INJECT → Feed retrieved chunks into LLM prompt
Common gotchas
| Gotcha | Why |
|---|---|
| Mixing embedding models | Vectors from different models are incompatible |
| No threshold filtering | Low-similarity noise pollutes results |
| Embedding huge documents whole | Diluted vectors, weak matches |
| Forgetting to re-embed on model change | Old vectors are useless with new model |
| Ignoring metadata | Can't filter by source, date, or category |
| Chunk size too small | Fragments with no context |
| Chunk size too large | Diluted meaning, weak retrieval |
| Using cosine sim on unnormalized vectors | Works but slower — normalize first |
| Not batching API calls | 1000 individual calls vs 10 batches of 100 |
| Skipping evaluation | No way to know if chunking/threshold is right |
Critical rules
1. SAME MODEL for indexing and querying — always
2. CHUNK before embedding — never embed whole documents
3. BATCH embedding calls — 100 texts per API call
4. ATTACH METADATA — source, page, date minimum
5. CALIBRATE THRESHOLDS — on your data, not generic defaults
6. EVALUATE — build test queries, measure recall@5
7. START SIMPLE — recursive splitting + text-embedding-3-small
End of 4.11 quick revision.