Episode 4 — Generative AI Engineering / 4.11 — Understanding Embeddings

4.11 — Understanding Embeddings: Quick Revision

Compact cheat sheet. Print-friendly.

How to use this material (instructions)

  1. Skim before labs or interviews.
  2. Drill gaps — reopen README.md4.11.a4.11.c.
  3. Practice4.11-Exercise-Questions.md.
  4. Polish answers4.11-Interview-Questions.md.

Core vocabulary

TermOne-liner
EmbeddingFixed-length vector of floats that captures semantic meaning of text
VectorArray of numbers representing a point in high-dimensional space
Cosine similarityMeasures angle between vectors; 1.0 = identical, 0 = unrelated
Euclidean distanceStraight-line distance between vectors; 0 = identical
Dot productSum of element-wise products; equals cosine sim for normalized vectors
Semantic searchFinding content by meaning, not exact keywords
ChunkingSplitting documents into smaller pieces for embedding
OverlapRepeating text at chunk boundaries to preserve context
ANNApproximate Nearest Neighbor — fast search for large vector collections
HNSWHierarchical Navigable Small World — most common ANN algorithm
Matryoshka embeddingsDimension reduction — fewer dims, smaller storage, slight quality loss
RAGRetrieval-Augmented Generation — embed + search + generate

Embedding models

ModelDimensionsCost/1M tokensNotes
text-embedding-3-small1536~$0.02Best default for most apps
text-embedding-3-large3072~$0.13Higher quality, 6x cost
text-embedding-ada-0021536~$0.10Legacy — do not use for new projects
Embedding model vs LLM:
  Embedding model: text → vector (numbers)    $0.02/1M tokens
  LLM (GPT-4o):   text → text (generation)   $2.50/1M tokens (input)

  Embeddings are ~125x cheaper than LLM calls.

How embeddings work

Text → Tokenizer → Transformer → Pooling → Normalize → Vector

"I love JavaScript"
  → ["I", " love", " JavaScript"]           (tokenize)
  → [contextual vectors per token]            (transformer)
  → [single averaged vector]                  (mean pooling)
  → [0.023, -0.041, 0.008, ..., -0.016]     (normalize to length 1)
     (1536 dimensions)

Semantic meaning in vectors

CLOSE vectors (high cosine similarity):
  "happy" ↔ "joyful"           ~0.92
  "JavaScript" ↔ "TypeScript"  ~0.88
  "fix a bug" ↔ "debug code"   ~0.85

FAR vectors (low cosine similarity):
  "happy" ↔ "database"         ~0.12
  "JavaScript" ↔ "sourdough"   ~0.08

Similarity metrics cheat sheet

COSINE SIMILARITY (default choice):
  Range: -1 to 1 (higher = more similar)
  Formula: (A · B) / (|A| × |B|)
  For normalized vectors: just dot product
  Use: almost always

DOT PRODUCT (fast alternative):
  Range: -inf to inf (higher = more similar)
  Formula: sum(A[i] × B[i])
  For normalized vectors: EQUALS cosine similarity
  Use: when vectors are normalized (OpenAI embeddings)

EUCLIDEAN DISTANCE:
  Range: 0 to inf (LOWER = more similar)
  Formula: sqrt(sum((A[i] - B[i])²))
  Use: rarely for embeddings (sensitive to magnitude)

Quick code

// Cosine similarity
function cosineSim(a, b) {
  let dot = 0, magA = 0, magB = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    magA += a[i] * a[i];
    magB += b[i] * b[i];
  }
  return dot / (Math.sqrt(magA) * Math.sqrt(magB));
}

// For normalized vectors (OpenAI) — just dot product
function dotProduct(a, b) {
  let dot = 0;
  for (let i = 0; i < a.length; i++) dot += a[i] * b[i];
  return dot;
}

Similarity thresholds

>= 0.90  Very high   — near-duplicate / paraphrase
0.80-0.90 High       — same topic, clearly relevant
0.70-0.80 Moderate   — related, probably relevant
0.60-0.70 Low        — tangentially related
< 0.50   Noise       — probably irrelevant

Starting points by use case:
  Duplicate detection:  0.92+
  RAG / factual Q&A:   0.78-0.85
  Exploratory search:   0.60-0.70

ALWAYS calibrate on YOUR data — these are guidelines, not rules.

Semantic search vs keyword search

KeywordSemantic
MatchesExact wordsMeaning
"car" finds "automobile"?NoYes
SetupSimple text indexEmbedding model + vector DB
CostFreeAPI calls + storage
Best forExact terms, IDsNatural language queries

Production: use HYBRID (70% semantic + 30% keyword).


Chunking strategies

Strategy          │ Quality │ Complexity │ Best For
──────────────────┼─────────┼────────────┼──────────────
Fixed-size        │ Basic   │ Low        │ Prototypes, logs
Sentence-based    │ Good    │ Low        │ Articles
Paragraph-based   │ Good    │ Low        │ Structured docs, FAQ
Recursive char    │ Better  │ Medium     │ Most production systems
Semantic          │ Best    │ High       │ High-stakes (medical, legal)

Recursive character splitting (industry standard)

Separator hierarchy (try best first):
  1. "\n\n"  — paragraph break
  2. "\n"    — line break
  3. ". "    — sentence end
  4. " "     — word break
  5. ""      — character (last resort)

Chunk size guide

TOO SMALL (50-100 tokens):
  "To reset the password" — no context, vague embedding

SWEET SPOT (200-500 tokens):
  Complete paragraph with enough context for precise matching

TOO LARGE (2000+ tokens):
  Entire chapter — embedding is diluted, weak matches

Recommended by content type:
  FAQ / Q&A:        100-200 tokens, 0 overlap
  Tech docs:        200-400 tokens, 50-100 overlap
  Articles:         300-500 tokens, 50-100 overlap
  Legal/medical:    400-600 tokens, 100-150 overlap
  Code docs:        200-400 tokens, 50 overlap

Overlap

chunk_size=400, overlap=60:

  Chunk 1: [==========================|overlap|]
  Chunk 2:                    [overlap|==========================|overlap|]
  Chunk 3:                                          [overlap|================]

  Guideline: 10-15% of chunk size
  Too much (>25%): duplicate storage and results
  Too little (0%):  context lost at boundaries

Storage math

Per vector:
  1536 dims × 4 bytes = 6 KB
  3072 dims × 4 bytes = 12 KB
  256 dims  × 4 bytes = 1 KB

Scaling:
  1M documents × 1536 dims = ~6 GB
  1M documents × 256 dims  = ~1 GB
  Add 20-50% for metadata + index overhead

Embedding API quick reference

import OpenAI from 'openai';
const openai = new OpenAI();

// Single embedding
const res = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: 'Your text here',
});
const vector = res.data[0].embedding; // Float[1536]

// Batch embedding (faster + cheaper)
const res = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: ['Text 1', 'Text 2', 'Text 3'],
});
// res.data[0].embedding, res.data[1].embedding, ...

// Reduced dimensions
const res = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: 'Your text here',
  dimensions: 256,  // 256 instead of 1536
});

Search performance

Documents │  Brute Force  │  ANN (HNSW)
──────────┼───────────────┼───────────────
1,000     │  < 1ms        │  < 1ms
10,000    │  ~5ms         │  < 1ms
100,000   │  ~50ms        │  ~1-2ms
1,000,000 │  ~500ms       │  ~2-5ms

Rule: brute-force OK under 50K. Use ANN (vector DB) above that.

Complete pipeline flow

1. LOAD    → Extract text from documents (PDF, MD, HTML)
2. CLEAN   → Remove boilerplate, normalize whitespace
3. CHUNK   → Recursive splitting (400 tokens, 60 overlap)
4. ENRICH  → Add metadata (source, page, section, date)
5. EMBED   → Batch embed with text-embedding-3-small
6. STORE   → Upsert vectors + metadata into vector DB
7. QUERY   → Embed user question with SAME model
8. SEARCH  → Find top-k nearest vectors (cosine sim)
9. FILTER  → Apply threshold + metadata filters
10. INJECT → Feed retrieved chunks into LLM prompt

Common gotchas

GotchaWhy
Mixing embedding modelsVectors from different models are incompatible
No threshold filteringLow-similarity noise pollutes results
Embedding huge documents wholeDiluted vectors, weak matches
Forgetting to re-embed on model changeOld vectors are useless with new model
Ignoring metadataCan't filter by source, date, or category
Chunk size too smallFragments with no context
Chunk size too largeDiluted meaning, weak retrieval
Using cosine sim on unnormalized vectorsWorks but slower — normalize first
Not batching API calls1000 individual calls vs 10 batches of 100
Skipping evaluationNo way to know if chunking/threshold is right

Critical rules

1. SAME MODEL for indexing and querying — always
2. CHUNK before embedding — never embed whole documents
3. BATCH embedding calls — 100 texts per API call
4. ATTACH METADATA — source, page, date minimum
5. CALIBRATE THRESHOLDS — on your data, not generic defaults
6. EVALUATE — build test queries, measure recall@5
7. START SIMPLE — recursive splitting + text-embedding-3-small

End of 4.11 quick revision.