Episode 4 — Generative AI Engineering / 4.11 — Understanding Embeddings

4.11.a — What Embeddings Represent

In one sentence: An embedding is a fixed-length array of floating-point numbers (a vector) that captures the semantic meaning of text — words and sentences with similar meaning produce similar vectors, enabling machines to "understand" language mathematically.

Navigation: ← 4.11 Overview · 4.11.b — Similarity Search →

1. What Is an Embedding?

An embedding is a numerical representation of text. When you pass a sentence to an embedding model, it returns an array of numbers — typically 1536 or 3072 floating-point values. This array is called a vector.

Input:  "JavaScript is a popular programming language"
Output: [0.0231, -0.0412, 0.0078, ..., -0.0156]
         ↑                                    ↑
         dimension 1              dimension 1536

Each number in the vector represents some learned aspect of the text's meaning. You don't get to choose what each dimension means — the model learns these representations during training by processing billions of text examples.

Key distinction: Unlike an LLM that generates text, an embedding model converts text into numbers. It doesn't produce words — it produces coordinates in a mathematical space.

                    LLM (Generation Model)
Input: "What is JS?"  →  Output: "JavaScript is a programming language..."
                           (text in, text out)

                    Embedding Model
Input: "What is JS?"  →  Output: [0.023, -0.041, 0.008, ..., -0.016]
                           (text in, numbers out)

2. The Vector Space Concept

Think of embeddings as coordinates in a very high-dimensional space. Just like a point on a 2D map has (x, y) coordinates, an embedding has coordinates in 1536-dimensional or 3072-dimensional space.

Simplified 2D visualization (real embeddings have 1536+ dimensions):

                    Programming
                        ▲
                        │
         "Python" ●     │     ● "JavaScript"
                        │
         "Java" ●       │   ● "TypeScript"
                        │
    ────────────────────┼────────────────────► Technology
                        │
                        │     ● "happy"
         "sad" ●        │
                        │  ● "joyful"
         "angry" ●      │
                        │
                        │

    Notice: Programming languages cluster together.
            Emotion words cluster together.
            The two clusters are far apart.

In real embedding space, this happens across thousands of dimensions simultaneously. The model learns that:

"king" and "queen" are close (both royalty)
"king" and "banana" are far apart (unrelated concepts)
"JavaScript" and "TypeScript" are very close (similar languages)
"JavaScript" and "sadness" are far apart (different domains entirely)

3. How Text Becomes a Vector

When you send text to an embedding model, the model processes it through a transformer neural network (similar architecture to GPT, but trained differently):

Step-by-step: Text → Vector

┌─────────────────────────────────────────────────────────────┐
│  Step 1: Tokenization                                        │
│  "I love JavaScript" → ["I", " love", " JavaScript"]        │
│                                                              │
│  Step 2: Token Embeddings (lookup table)                     │
│  Each token → initial vector from vocabulary table            │
│  "I"          → [0.1, 0.2, -0.1, ...]                       │
│  " love"      → [0.3, -0.1, 0.4, ...]                       │
│  " JavaScript"→ [0.2, 0.5, 0.1, ...]                        │
│                                                              │
│  Step 3: Transformer layers process all tokens together      │
│  Attention mechanism lets each token "look at" every other   │
│  token, building contextual understanding                    │
│  "love" in "I love JavaScript" ≠ "love" in "love letter"    │
│                                                              │
│  Step 4: Pooling — combine all token vectors into ONE vector │
│  Method: typically mean pooling (average all token vectors)  │
│  or use [CLS] token representation                           │
│                                                              │
│  Step 5: Normalize — scale the vector to unit length         │
│  Final: [0.023, -0.041, 0.008, ..., -0.016]                 │
│         (1536 dimensions, length = 1.0)                      │
└─────────────────────────────────────────────────────────────┘

Why normalization matters: Normalized vectors (length = 1.0) make similarity calculations simpler and more consistent. When all vectors have the same length, the angle between them is the only thing that differs — and that angle represents semantic distance.

4. Dimensionality: 1536 vs 3072

Different embedding models produce vectors of different sizes. More dimensions can capture finer distinctions in meaning, but cost more to store and compute.

Model	Dimensions	Relative Quality	Cost per 1M tokens	Best For
`text-embedding-3-small`	1536	Good	~$0.02	Most applications, cost-sensitive
`text-embedding-3-large`	3072	Better	~$0.13	High-accuracy retrieval, nuanced tasks
`text-embedding-ada-002`	1536	Legacy	~$0.10	Existing systems (not recommended for new)

What do the dimensions represent?

Each dimension captures some abstract feature of the text. Unlike hand-crafted features, these are learned automatically — you can't point at dimension 742 and say "this measures formality." The model discovers patterns like:

Conceptual (what the model might learn — simplified):

  Dimension 1:   something related to "technical vs casual"
  Dimension 2:   something related to "positive vs negative sentiment"
  Dimension 3:   something related to "abstract vs concrete"
  ...
  Dimension 1536: something related to "question vs statement"

  In reality: each dimension captures a complex, non-human-interpretable
  combination of features. No single dimension has a clean label.

Reducing dimensions (Matryoshka embeddings)

OpenAI's text-embedding-3-* models support dimension reduction. You can request fewer dimensions to save storage while keeping most of the quality:

import OpenAI from 'openai';
const openai = new OpenAI();

// Full dimensions (1536)
const fullResponse = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: 'JavaScript is a popular language',
});
console.log(fullResponse.data[0].embedding.length); // 1536

// Reduced dimensions (256) — saves 83% storage
const reducedResponse = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: 'JavaScript is a popular language',
  dimensions: 256,
});
console.log(reducedResponse.data[0].embedding.length); // 256

Dimension vs quality trade-off (text-embedding-3-small):

  Dimensions  │  Relative Quality  │  Storage per vector
  ────────────┼────────────────────┼────────────────────
  1536        │  100% (baseline)   │  6 KB
  1024        │  ~99%              │  4 KB
  512         │  ~96%              │  2 KB
  256         │  ~92%              │  1 KB

  Rule of thumb: 512 dimensions is usually the sweet spot for
  storage-constrained applications. Below 256, quality drops fast.

5. Semantic Meaning in Vector Space

The most powerful property of embeddings is that similar meaning produces similar vectors. This happens automatically — the model learns it from billions of examples.

Synonyms are close

"happy"      → [0.234, -0.112, 0.056, ...]  ─┐
"joyful"     → [0.229, -0.108, 0.061, ...]   ├── Very close (similarity ~0.92)
"cheerful"   → [0.241, -0.105, 0.049, ...]  ─┘

"sad"        → [-0.198, 0.231, -0.087, ...]  ─┐
"unhappy"    → [-0.201, 0.225, -0.091, ...]   ├── Very close (similarity ~0.90)
"miserable"  → [-0.189, 0.240, -0.079, ...]  ─┘

"happy" vs "sad" → far apart (similarity ~0.35)

Concepts with similar meaning but different words

This is where embeddings shine over keyword search:

Query:    "How do I fix a bug in my code?"
Document: "Debugging techniques for software errors"

Keyword match:  0 words in common → keyword search FAILS
Embedding match: high similarity (~0.85) → semantic search SUCCEEDS

The embedding model "understands" that:
  "fix a bug" ≈ "debugging techniques"
  "code"      ≈ "software"
  "bug"       ≈ "errors"

Analogies emerge naturally

Classic example: embedding arithmetic reveals learned relationships.

vector("king") - vector("man") + vector("woman") ≈ vector("queen")

This means the model learned:
  king is to man as queen is to woman

Similarly:
  vector("Paris") - vector("France") + vector("Japan") ≈ vector("Tokyo")
  (capital-country relationship)

  vector("walked") - vector("walk") + vector("swim") ≈ vector("swam")
  (past-tense relationship)

6. Embedding Models vs Generation Models

These are fundamentally different tools serving different purposes:

Feature	Embedding Model	Generation Model (LLM)
Input	Text	Text (prompt)
Output	Vector of numbers	Text (completion)
Purpose	Represent meaning numerically	Generate new text
Task	Search, similarity, classification	Conversation, writing, reasoning
Output size	Fixed (always 1536 or 3072)	Variable (depends on response)
Cost	Very cheap (~$0.02/1M tokens)	Expensive (~$2.50-$15/1M tokens)
Speed	Very fast (milliseconds)	Slower (seconds for long responses)
Stateless	Same input → always same vector	Same input → may differ (temperature)
Context window	8191 tokens (text-embedding-3)	128K-200K+ tokens
Examples	text-embedding-3-small/large	GPT-4o, Claude 4, Llama 3

When to use each:

Use EMBEDDING MODEL when you need to:
  ✓ Search for similar documents
  ✓ Build a recommendation system
  ✓ Classify text into categories
  ✓ Detect duplicate content
  ✓ Cluster documents by topic
  ✓ Feed a RAG pipeline's retrieval step

Use GENERATION MODEL when you need to:
  ✓ Answer questions in natural language
  ✓ Summarize text
  ✓ Write or edit content
  ✓ Extract structured data
  ✓ Have a conversation
  ✓ Feed a RAG pipeline's generation step

In a typical RAG pipeline, you use both: the embedding model retrieves relevant documents, and the generation model produces the answer.

7. Creating Embeddings with the OpenAI API

Basic: embed a single string

import OpenAI from 'openai';

const openai = new OpenAI(); // Uses OPENAI_API_KEY env var

async function getEmbedding(text) {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: text,
  });

  const embedding = response.data[0].embedding;
  console.log(`Text: "${text}"`);
  console.log(`Dimensions: ${embedding.length}`);           // 1536
  console.log(`First 5 values: [${embedding.slice(0, 5).join(', ')}]`);
  console.log(`Token usage: ${response.usage.total_tokens}`);
  
  return embedding;
}

const vector = await getEmbedding('JavaScript is a popular programming language');
// Text: "JavaScript is a popular programming language"
// Dimensions: 1536
// First 5 values: [0.02319, -0.04118, 0.00782, -0.01205, 0.03891]
// Token usage: 6

Batch: embed multiple strings at once

The API accepts an array of strings and returns all embeddings in a single request. This is much faster and cheaper than individual calls.

async function getEmbeddings(texts) {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: texts,  // Array of strings
  });

  // Results are in the same order as input
  return response.data.map((item, index) => ({
    text: texts[index],
    embedding: item.embedding,
  }));
}

const documents = [
  'JavaScript was created in 1995 by Brendan Eich',
  'Python is known for its simple syntax',
  'TypeScript adds static types to JavaScript',
  'React is a popular frontend framework',
  'Machine learning requires large datasets',
];

const results = await getEmbeddings(documents);
console.log(`Embedded ${results.length} documents`);
console.log(`Each vector has ${results[0].embedding.length} dimensions`);

// results[0].text = "JavaScript was created in 1995 by Brendan Eich"
// results[0].embedding = [0.023, -0.041, ...] (1536 numbers)

With reduced dimensions

async function getCompactEmbedding(text, dims = 512) {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: text,
    dimensions: dims,  // Request fewer dimensions
  });

  return response.data[0].embedding;
}

const full = await getEmbedding('Hello world');       // 1536 dims, ~6 KB
const compact = await getCompactEmbedding('Hello world', 256); // 256 dims, ~1 KB

console.log(`Full: ${full.length} dimensions`);       // 1536
console.log(`Compact: ${compact.length} dimensions`); // 256

Error handling and rate limits

async function getEmbeddingSafe(text) {
  // Input validation
  if (!text || typeof text !== 'string') {
    throw new Error('Input must be a non-empty string');
  }

  // The embedding model has an 8191 token limit
  // Rough check: 8191 tokens ≈ 32,000 characters
  if (text.length > 32000) {
    console.warn('Text may exceed token limit, consider chunking');
  }

  try {
    const response = await openai.embeddings.create({
      model: 'text-embedding-3-small',
      input: text,
    });
    return response.data[0].embedding;
  } catch (error) {
    if (error.status === 429) {
      // Rate limited — wait and retry
      console.log('Rate limited, waiting 1 second...');
      await new Promise(resolve => setTimeout(resolve, 1000));
      return getEmbeddingSafe(text); // Retry once
    }
    if (error.status === 400) {
      console.error('Invalid input — text may be too long');
    }
    throw error;
  }
}

Embedding many documents efficiently

// Process documents in batches to respect rate limits
async function embedDocuments(documents, batchSize = 100) {
  const allEmbeddings = [];

  for (let i = 0; i < documents.length; i += batchSize) {
    const batch = documents.slice(i, i + batchSize);

    console.log(`Processing batch ${Math.floor(i / batchSize) + 1} ` +
                `(${batch.length} documents)...`);

    const response = await openai.embeddings.create({
      model: 'text-embedding-3-small',
      input: batch,
    });

    const embeddings = response.data.map((item, index) => ({
      text: batch[index],
      embedding: item.embedding,
      index: i + index,
    }));

    allEmbeddings.push(...embeddings);

    // Small delay between batches to avoid rate limits
    if (i + batchSize < documents.length) {
      await new Promise(resolve => setTimeout(resolve, 200));
    }
  }

  console.log(`Embedded ${allEmbeddings.length} documents total`);
  return allEmbeddings;
}

// Usage
const corpus = [
  'First document text...',
  'Second document text...',
  // ... potentially thousands of documents
];

const embedded = await embedDocuments(corpus);

8. What Makes a Good Embedding?

Not all text produces equally useful embeddings. Understanding what works well (and what doesn't) helps you design better systems.

GOOD embeddings (high information density):
  ✓ "React is a JavaScript library for building user interfaces"
    → Clear, specific, rich in semantic content

  ✓ "The patient presented with acute chest pain and shortness of breath"
    → Domain-specific, descriptive, contextual

BAD embeddings (low information density):
  ✗ "This is a document"
    → Too vague, no semantic content

  ✗ "Click here for more info"
    → Navigational text, not meaningful content

  ✗ "asdfghjkl"
    → Nonsense text

  ✗ "................"
    → No information at all

SURPRISING embeddings (context matters):
  "bank" alone    → ambiguous (financial bank? river bank?)
  "bank account"  → clearly financial
  "river bank"    → clearly geographical
  → Embedding models handle this through CONTEXT

Text preparation tips

// BEFORE embedding: clean and prepare text

function prepareForEmbedding(text) {
  return text
    .replace(/\s+/g, ' ')          // Collapse whitespace
    .replace(/\n+/g, ' ')          // Remove newlines
    .trim()                         // Trim edges
    .slice(0, 8000);               // Respect token limits (~8191 tokens)
}

// Add metadata context for better embeddings
function enrichText(text, metadata) {
  // Prepending metadata helps the embedding model understand context
  const prefix = metadata.title ? `Title: ${metadata.title}. ` : '';
  const category = metadata.category ? `Category: ${metadata.category}. ` : '';
  return `${prefix}${category}${text}`;
}

// Example
const raw = "Click here to learn about closures and how they work";
const enriched = enrichText(raw, {
  title: 'JavaScript Closures',
  category: 'Programming Tutorials'
});
// "Title: JavaScript Closures. Category: Programming Tutorials. Click here to learn about closures and how they work"
// → Much better embedding because the model has richer context

9. Embedding Costs and Performance

Embeddings are extremely cheap compared to generation. This makes them practical for large-scale applications.

Cost comparison (approximate, per 1M tokens):

  text-embedding-3-small:  $0.02     ← 125x cheaper than GPT-4o input
  text-embedding-3-large:  $0.13     ← 19x cheaper than GPT-4o input
  GPT-4o input:            $2.50
  GPT-4o output:           $10.00

Practical example:
  Embed 1 million documents (average 200 tokens each) = 200M tokens
  text-embedding-3-small: 200 × $0.02 = $4.00 total
  text-embedding-3-large: 200 × $0.13 = $26.00 total

  That's your entire knowledge base embedded for under $30.

Speed:
  Single embedding:    ~50-100ms
  Batch of 100:        ~200-500ms
  1 million documents: ~30-60 minutes (with batching)

Storage requirements

Storage per vector:

  1536 dimensions × 4 bytes (float32) = 6,144 bytes ≈ 6 KB
  3072 dimensions × 4 bytes (float32) = 12,288 bytes ≈ 12 KB

  1 million documents at 1536 dims = ~6 GB
  1 million documents at 3072 dims = ~12 GB
  1 million documents at 256 dims  = ~1 GB (reduced)

  Metadata + index overhead typically adds 20-50% more.

10. Visualizing Embeddings (Dimensionality Reduction)

You can't visualize 1536 dimensions directly, but you can project them down to 2D or 3D using techniques like t-SNE or UMAP. This helps you verify that semantically similar documents cluster together.

After embedding 20 documents and projecting to 2D:

    ▲ y
    │
    │   ● "React hooks guide"
    │     ● "Vue composition API"              Programming
    │       ● "Angular components"             cluster
    │
    │                         ● "chocolate cake recipe"
    │                           ● "pasta carbonara"        Cooking
    │                         ● "grilled salmon"           cluster
    │
    │  ● "JavaScript closures"
    │    ● "TypeScript generics"
    │      ● "Python decorators"               Programming
    │                                          cluster
    │
    │                                    ● "how to train for marathon"
    │                                      ● "best running shoes"    Fitness
    │                                    ● "yoga for beginners"      cluster
    │
    └──────────────────────────────────────────────────────────► x

    Documents about similar topics naturally cluster together,
    even though they use completely different words.

11. Common Misconceptions

Misconception	Reality
"Embeddings understand text"	Embeddings capture statistical patterns of meaning, not understanding. They are mathematical representations.
"More dimensions = always better"	Diminishing returns after a point. 1536 is sufficient for most use cases.
"Same text = same embedding across models"	Different models produce completely different vectors. You cannot mix embeddings from different models.
"Embeddings are just bag-of-words"	Modern embeddings capture word order, context, and nuance. "Dog bites man" and "Man bites dog" produce different embeddings.
"You can embed infinite text"	Embedding models have token limits (8191 for OpenAI). Longer text must be chunked.
"Embedding once is enough"	If you change the model or update the model version, you must re-embed everything.

12. Key Takeaways

An embedding is a fixed-length vector of floating-point numbers that represents the semantic meaning of text.
Similar meaning produces similar vectors — "happy" and "joyful" are close, "happy" and "database" are far apart.
text-embedding-3-small (1536 dims) is the best starting point — cheap, fast, and good enough for most applications.
Embedding models are different from generation models — they convert text to numbers, not text to text.
Batch embedding is critical for performance — always embed multiple documents in a single API call.
Text quality affects embedding quality — clean, context-rich text produces better vectors than vague or noisy text.
You cannot mix embeddings from different models — always use the same model for both indexing and querying.

Explain-It Challenge

A colleague asks "why can't we just use keyword search — why do we need embeddings?" Explain with a concrete example where keyword search fails.
Your vector database has 10 million documents embedded with text-embedding-ada-002. A new model text-embedding-3-small is released with better quality. Can you just start querying with the new model? Why or why not?
Why is "bank" by itself a worse embedding than "the bank of the river had eroded after the flood"?

Navigation: ← 4.11 Overview · 4.11.b — Similarity Search →