Episode 4 — Generative AI Engineering / 4.11 — Understanding Embeddings
4.11.a — What Embeddings Represent
In one sentence: An embedding is a fixed-length array of floating-point numbers (a vector) that captures the semantic meaning of text — words and sentences with similar meaning produce similar vectors, enabling machines to "understand" language mathematically.
Navigation: ← 4.11 Overview · 4.11.b — Similarity Search →
1. What Is an Embedding?
An embedding is a numerical representation of text. When you pass a sentence to an embedding model, it returns an array of numbers — typically 1536 or 3072 floating-point values. This array is called a vector.
Input: "JavaScript is a popular programming language"
Output: [0.0231, -0.0412, 0.0078, ..., -0.0156]
↑ ↑
dimension 1 dimension 1536
Each number in the vector represents some learned aspect of the text's meaning. You don't get to choose what each dimension means — the model learns these representations during training by processing billions of text examples.
Key distinction: Unlike an LLM that generates text, an embedding model converts text into numbers. It doesn't produce words — it produces coordinates in a mathematical space.
LLM (Generation Model)
Input: "What is JS?" → Output: "JavaScript is a programming language..."
(text in, text out)
Embedding Model
Input: "What is JS?" → Output: [0.023, -0.041, 0.008, ..., -0.016]
(text in, numbers out)
2. The Vector Space Concept
Think of embeddings as coordinates in a very high-dimensional space. Just like a point on a 2D map has (x, y) coordinates, an embedding has coordinates in 1536-dimensional or 3072-dimensional space.
Simplified 2D visualization (real embeddings have 1536+ dimensions):
Programming
▲
│
"Python" ● │ ● "JavaScript"
│
"Java" ● │ ● "TypeScript"
│
────────────────────┼────────────────────► Technology
│
│ ● "happy"
"sad" ● │
│ ● "joyful"
"angry" ● │
│
│
Notice: Programming languages cluster together.
Emotion words cluster together.
The two clusters are far apart.
In real embedding space, this happens across thousands of dimensions simultaneously. The model learns that:
- "king" and "queen" are close (both royalty)
- "king" and "banana" are far apart (unrelated concepts)
- "JavaScript" and "TypeScript" are very close (similar languages)
- "JavaScript" and "sadness" are far apart (different domains entirely)
3. How Text Becomes a Vector
When you send text to an embedding model, the model processes it through a transformer neural network (similar architecture to GPT, but trained differently):
Step-by-step: Text → Vector
┌─────────────────────────────────────────────────────────────┐
│ Step 1: Tokenization │
│ "I love JavaScript" → ["I", " love", " JavaScript"] │
│ │
│ Step 2: Token Embeddings (lookup table) │
│ Each token → initial vector from vocabulary table │
│ "I" → [0.1, 0.2, -0.1, ...] │
│ " love" → [0.3, -0.1, 0.4, ...] │
│ " JavaScript"→ [0.2, 0.5, 0.1, ...] │
│ │
│ Step 3: Transformer layers process all tokens together │
│ Attention mechanism lets each token "look at" every other │
│ token, building contextual understanding │
│ "love" in "I love JavaScript" ≠ "love" in "love letter" │
│ │
│ Step 4: Pooling — combine all token vectors into ONE vector │
│ Method: typically mean pooling (average all token vectors) │
│ or use [CLS] token representation │
│ │
│ Step 5: Normalize — scale the vector to unit length │
│ Final: [0.023, -0.041, 0.008, ..., -0.016] │
│ (1536 dimensions, length = 1.0) │
└─────────────────────────────────────────────────────────────┘
Why normalization matters: Normalized vectors (length = 1.0) make similarity calculations simpler and more consistent. When all vectors have the same length, the angle between them is the only thing that differs — and that angle represents semantic distance.
4. Dimensionality: 1536 vs 3072
Different embedding models produce vectors of different sizes. More dimensions can capture finer distinctions in meaning, but cost more to store and compute.
| Model | Dimensions | Relative Quality | Cost per 1M tokens | Best For |
|---|---|---|---|---|
text-embedding-3-small | 1536 | Good | ~$0.02 | Most applications, cost-sensitive |
text-embedding-3-large | 3072 | Better | ~$0.13 | High-accuracy retrieval, nuanced tasks |
text-embedding-ada-002 | 1536 | Legacy | ~$0.10 | Existing systems (not recommended for new) |
What do the dimensions represent?
Each dimension captures some abstract feature of the text. Unlike hand-crafted features, these are learned automatically — you can't point at dimension 742 and say "this measures formality." The model discovers patterns like:
Conceptual (what the model might learn — simplified):
Dimension 1: something related to "technical vs casual"
Dimension 2: something related to "positive vs negative sentiment"
Dimension 3: something related to "abstract vs concrete"
...
Dimension 1536: something related to "question vs statement"
In reality: each dimension captures a complex, non-human-interpretable
combination of features. No single dimension has a clean label.
Reducing dimensions (Matryoshka embeddings)
OpenAI's text-embedding-3-* models support dimension reduction. You can request fewer dimensions to save storage while keeping most of the quality:
import OpenAI from 'openai';
const openai = new OpenAI();
// Full dimensions (1536)
const fullResponse = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: 'JavaScript is a popular language',
});
console.log(fullResponse.data[0].embedding.length); // 1536
// Reduced dimensions (256) — saves 83% storage
const reducedResponse = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: 'JavaScript is a popular language',
dimensions: 256,
});
console.log(reducedResponse.data[0].embedding.length); // 256
Dimension vs quality trade-off (text-embedding-3-small):
Dimensions │ Relative Quality │ Storage per vector
────────────┼────────────────────┼────────────────────
1536 │ 100% (baseline) │ 6 KB
1024 │ ~99% │ 4 KB
512 │ ~96% │ 2 KB
256 │ ~92% │ 1 KB
Rule of thumb: 512 dimensions is usually the sweet spot for
storage-constrained applications. Below 256, quality drops fast.
5. Semantic Meaning in Vector Space
The most powerful property of embeddings is that similar meaning produces similar vectors. This happens automatically — the model learns it from billions of examples.
Synonyms are close
"happy" → [0.234, -0.112, 0.056, ...] ─┐
"joyful" → [0.229, -0.108, 0.061, ...] ├── Very close (similarity ~0.92)
"cheerful" → [0.241, -0.105, 0.049, ...] ─┘
"sad" → [-0.198, 0.231, -0.087, ...] ─┐
"unhappy" → [-0.201, 0.225, -0.091, ...] ├── Very close (similarity ~0.90)
"miserable" → [-0.189, 0.240, -0.079, ...] ─┘
"happy" vs "sad" → far apart (similarity ~0.35)
Concepts with similar meaning but different words
This is where embeddings shine over keyword search:
Query: "How do I fix a bug in my code?"
Document: "Debugging techniques for software errors"
Keyword match: 0 words in common → keyword search FAILS
Embedding match: high similarity (~0.85) → semantic search SUCCEEDS
The embedding model "understands" that:
"fix a bug" ≈ "debugging techniques"
"code" ≈ "software"
"bug" ≈ "errors"
Analogies emerge naturally
Classic example: embedding arithmetic reveals learned relationships.
vector("king") - vector("man") + vector("woman") ≈ vector("queen")
This means the model learned:
king is to man as queen is to woman
Similarly:
vector("Paris") - vector("France") + vector("Japan") ≈ vector("Tokyo")
(capital-country relationship)
vector("walked") - vector("walk") + vector("swim") ≈ vector("swam")
(past-tense relationship)
6. Embedding Models vs Generation Models
These are fundamentally different tools serving different purposes:
| Feature | Embedding Model | Generation Model (LLM) |
|---|---|---|
| Input | Text | Text (prompt) |
| Output | Vector of numbers | Text (completion) |
| Purpose | Represent meaning numerically | Generate new text |
| Task | Search, similarity, classification | Conversation, writing, reasoning |
| Output size | Fixed (always 1536 or 3072) | Variable (depends on response) |
| Cost | Very cheap (~$0.02/1M tokens) | Expensive (~$2.50-$15/1M tokens) |
| Speed | Very fast (milliseconds) | Slower (seconds for long responses) |
| Stateless | Same input → always same vector | Same input → may differ (temperature) |
| Context window | 8191 tokens (text-embedding-3) | 128K-200K+ tokens |
| Examples | text-embedding-3-small/large | GPT-4o, Claude 4, Llama 3 |
When to use each:
Use EMBEDDING MODEL when you need to:
✓ Search for similar documents
✓ Build a recommendation system
✓ Classify text into categories
✓ Detect duplicate content
✓ Cluster documents by topic
✓ Feed a RAG pipeline's retrieval step
Use GENERATION MODEL when you need to:
✓ Answer questions in natural language
✓ Summarize text
✓ Write or edit content
✓ Extract structured data
✓ Have a conversation
✓ Feed a RAG pipeline's generation step
In a typical RAG pipeline, you use both: the embedding model retrieves relevant documents, and the generation model produces the answer.
7. Creating Embeddings with the OpenAI API
Basic: embed a single string
import OpenAI from 'openai';
const openai = new OpenAI(); // Uses OPENAI_API_KEY env var
async function getEmbedding(text) {
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: text,
});
const embedding = response.data[0].embedding;
console.log(`Text: "${text}"`);
console.log(`Dimensions: ${embedding.length}`); // 1536
console.log(`First 5 values: [${embedding.slice(0, 5).join(', ')}]`);
console.log(`Token usage: ${response.usage.total_tokens}`);
return embedding;
}
const vector = await getEmbedding('JavaScript is a popular programming language');
// Text: "JavaScript is a popular programming language"
// Dimensions: 1536
// First 5 values: [0.02319, -0.04118, 0.00782, -0.01205, 0.03891]
// Token usage: 6
Batch: embed multiple strings at once
The API accepts an array of strings and returns all embeddings in a single request. This is much faster and cheaper than individual calls.
async function getEmbeddings(texts) {
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: texts, // Array of strings
});
// Results are in the same order as input
return response.data.map((item, index) => ({
text: texts[index],
embedding: item.embedding,
}));
}
const documents = [
'JavaScript was created in 1995 by Brendan Eich',
'Python is known for its simple syntax',
'TypeScript adds static types to JavaScript',
'React is a popular frontend framework',
'Machine learning requires large datasets',
];
const results = await getEmbeddings(documents);
console.log(`Embedded ${results.length} documents`);
console.log(`Each vector has ${results[0].embedding.length} dimensions`);
// results[0].text = "JavaScript was created in 1995 by Brendan Eich"
// results[0].embedding = [0.023, -0.041, ...] (1536 numbers)
With reduced dimensions
async function getCompactEmbedding(text, dims = 512) {
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: text,
dimensions: dims, // Request fewer dimensions
});
return response.data[0].embedding;
}
const full = await getEmbedding('Hello world'); // 1536 dims, ~6 KB
const compact = await getCompactEmbedding('Hello world', 256); // 256 dims, ~1 KB
console.log(`Full: ${full.length} dimensions`); // 1536
console.log(`Compact: ${compact.length} dimensions`); // 256
Error handling and rate limits
async function getEmbeddingSafe(text) {
// Input validation
if (!text || typeof text !== 'string') {
throw new Error('Input must be a non-empty string');
}
// The embedding model has an 8191 token limit
// Rough check: 8191 tokens ≈ 32,000 characters
if (text.length > 32000) {
console.warn('Text may exceed token limit, consider chunking');
}
try {
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: text,
});
return response.data[0].embedding;
} catch (error) {
if (error.status === 429) {
// Rate limited — wait and retry
console.log('Rate limited, waiting 1 second...');
await new Promise(resolve => setTimeout(resolve, 1000));
return getEmbeddingSafe(text); // Retry once
}
if (error.status === 400) {
console.error('Invalid input — text may be too long');
}
throw error;
}
}
Embedding many documents efficiently
// Process documents in batches to respect rate limits
async function embedDocuments(documents, batchSize = 100) {
const allEmbeddings = [];
for (let i = 0; i < documents.length; i += batchSize) {
const batch = documents.slice(i, i + batchSize);
console.log(`Processing batch ${Math.floor(i / batchSize) + 1} ` +
`(${batch.length} documents)...`);
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: batch,
});
const embeddings = response.data.map((item, index) => ({
text: batch[index],
embedding: item.embedding,
index: i + index,
}));
allEmbeddings.push(...embeddings);
// Small delay between batches to avoid rate limits
if (i + batchSize < documents.length) {
await new Promise(resolve => setTimeout(resolve, 200));
}
}
console.log(`Embedded ${allEmbeddings.length} documents total`);
return allEmbeddings;
}
// Usage
const corpus = [
'First document text...',
'Second document text...',
// ... potentially thousands of documents
];
const embedded = await embedDocuments(corpus);
8. What Makes a Good Embedding?
Not all text produces equally useful embeddings. Understanding what works well (and what doesn't) helps you design better systems.
GOOD embeddings (high information density):
✓ "React is a JavaScript library for building user interfaces"
→ Clear, specific, rich in semantic content
✓ "The patient presented with acute chest pain and shortness of breath"
→ Domain-specific, descriptive, contextual
BAD embeddings (low information density):
✗ "This is a document"
→ Too vague, no semantic content
✗ "Click here for more info"
→ Navigational text, not meaningful content
✗ "asdfghjkl"
→ Nonsense text
✗ "................"
→ No information at all
SURPRISING embeddings (context matters):
"bank" alone → ambiguous (financial bank? river bank?)
"bank account" → clearly financial
"river bank" → clearly geographical
→ Embedding models handle this through CONTEXT
Text preparation tips
// BEFORE embedding: clean and prepare text
function prepareForEmbedding(text) {
return text
.replace(/\s+/g, ' ') // Collapse whitespace
.replace(/\n+/g, ' ') // Remove newlines
.trim() // Trim edges
.slice(0, 8000); // Respect token limits (~8191 tokens)
}
// Add metadata context for better embeddings
function enrichText(text, metadata) {
// Prepending metadata helps the embedding model understand context
const prefix = metadata.title ? `Title: ${metadata.title}. ` : '';
const category = metadata.category ? `Category: ${metadata.category}. ` : '';
return `${prefix}${category}${text}`;
}
// Example
const raw = "Click here to learn about closures and how they work";
const enriched = enrichText(raw, {
title: 'JavaScript Closures',
category: 'Programming Tutorials'
});
// "Title: JavaScript Closures. Category: Programming Tutorials. Click here to learn about closures and how they work"
// → Much better embedding because the model has richer context
9. Embedding Costs and Performance
Embeddings are extremely cheap compared to generation. This makes them practical for large-scale applications.
Cost comparison (approximate, per 1M tokens):
text-embedding-3-small: $0.02 ← 125x cheaper than GPT-4o input
text-embedding-3-large: $0.13 ← 19x cheaper than GPT-4o input
GPT-4o input: $2.50
GPT-4o output: $10.00
Practical example:
Embed 1 million documents (average 200 tokens each) = 200M tokens
text-embedding-3-small: 200 × $0.02 = $4.00 total
text-embedding-3-large: 200 × $0.13 = $26.00 total
That's your entire knowledge base embedded for under $30.
Speed:
Single embedding: ~50-100ms
Batch of 100: ~200-500ms
1 million documents: ~30-60 minutes (with batching)
Storage requirements
Storage per vector:
1536 dimensions × 4 bytes (float32) = 6,144 bytes ≈ 6 KB
3072 dimensions × 4 bytes (float32) = 12,288 bytes ≈ 12 KB
1 million documents at 1536 dims = ~6 GB
1 million documents at 3072 dims = ~12 GB
1 million documents at 256 dims = ~1 GB (reduced)
Metadata + index overhead typically adds 20-50% more.
10. Visualizing Embeddings (Dimensionality Reduction)
You can't visualize 1536 dimensions directly, but you can project them down to 2D or 3D using techniques like t-SNE or UMAP. This helps you verify that semantically similar documents cluster together.
After embedding 20 documents and projecting to 2D:
▲ y
│
│ ● "React hooks guide"
│ ● "Vue composition API" Programming
│ ● "Angular components" cluster
│
│ ● "chocolate cake recipe"
│ ● "pasta carbonara" Cooking
│ ● "grilled salmon" cluster
│
│ ● "JavaScript closures"
│ ● "TypeScript generics"
│ ● "Python decorators" Programming
│ cluster
│
│ ● "how to train for marathon"
│ ● "best running shoes" Fitness
│ ● "yoga for beginners" cluster
│
└──────────────────────────────────────────────────────────► x
Documents about similar topics naturally cluster together,
even though they use completely different words.
11. Common Misconceptions
| Misconception | Reality |
|---|---|
| "Embeddings understand text" | Embeddings capture statistical patterns of meaning, not understanding. They are mathematical representations. |
| "More dimensions = always better" | Diminishing returns after a point. 1536 is sufficient for most use cases. |
| "Same text = same embedding across models" | Different models produce completely different vectors. You cannot mix embeddings from different models. |
| "Embeddings are just bag-of-words" | Modern embeddings capture word order, context, and nuance. "Dog bites man" and "Man bites dog" produce different embeddings. |
| "You can embed infinite text" | Embedding models have token limits (8191 for OpenAI). Longer text must be chunked. |
| "Embedding once is enough" | If you change the model or update the model version, you must re-embed everything. |
12. Key Takeaways
- An embedding is a fixed-length vector of floating-point numbers that represents the semantic meaning of text.
- Similar meaning produces similar vectors — "happy" and "joyful" are close, "happy" and "database" are far apart.
- text-embedding-3-small (1536 dims) is the best starting point — cheap, fast, and good enough for most applications.
- Embedding models are different from generation models — they convert text to numbers, not text to text.
- Batch embedding is critical for performance — always embed multiple documents in a single API call.
- Text quality affects embedding quality — clean, context-rich text produces better vectors than vague or noisy text.
- You cannot mix embeddings from different models — always use the same model for both indexing and querying.
Explain-It Challenge
- A colleague asks "why can't we just use keyword search — why do we need embeddings?" Explain with a concrete example where keyword search fails.
- Your vector database has 10 million documents embedded with
text-embedding-ada-002. A new modeltext-embedding-3-smallis released with better quality. Can you just start querying with the new model? Why or why not? - Why is
"bank"by itself a worse embedding than"the bank of the river had eroded after the flood"?
Navigation: ← 4.11 Overview · 4.11.b — Similarity Search →