Episode 4 — Generative AI Engineering / 4.11 — Understanding Embeddings

4.11.c — Document Chunking Strategies

In one sentence: Chunking is the process of splitting large documents into smaller, meaningful pieces before embedding them — because embedding models have token limits, and smaller chunks produce more precise and relevant search results than embedding an entire document as one giant vector.

Navigation: ← 4.11.b — Similarity Search · 4.11 Overview

1. Why You Must Chunk Documents

Embedding models have a token limit (8191 tokens for OpenAI's text-embedding-3 models, roughly 32,000 characters). But even if you could embed unlimited text, you shouldn't embed large documents as a single vector.

The Chunking Problem:

  SCENARIO: You have a 50-page technical manual.
  QUERY: "How do I reset the admin password?"
  The answer is in paragraph 3 of page 27.

  APPROACH 1: Embed the entire document as ONE vector
  ┌─────────────────────────────────────────────────────────┐
  │  50 pages of text → [single vector of 1536 numbers]     │
  │                                                          │
  │  Problem: The vector represents the AVERAGE meaning of   │
  │  all 50 pages. The password reset info is DILUTED by     │
  │  48 pages of unrelated content. The vector is "about     │
  │  everything" which means it's "about nothing specific."  │
  │                                                          │
  │  Similarity to "reset admin password": ~0.45 (weak)      │
  └─────────────────────────────────────────────────────────┘

  APPROACH 2: Chunk the document into paragraphs, embed each
  ┌─────────────────────────────────────────────────────────┐
  │  Page 27, para 3: "To reset the admin password,          │
  │  navigate to Settings > Security > Password Reset..."    │
  │  → [vector that is SPECIFICALLY about password reset]    │
  │                                                          │
  │  Similarity to "reset admin password": ~0.91 (strong!)   │
  └─────────────────────────────────────────────────────────┘

Three reasons to chunk

1. TOKEN LIMITS
   Embedding models cap at 8191 tokens (~32K characters).
   Most real documents exceed this.
   You MUST split to fit the model's input limit.

2. PRECISION
   Smaller chunks = more focused vectors = better search results.
   A 200-word chunk about password resets will match "how to
   reset password" much better than a 10,000-word document
   that mentions passwords once.

3. CONTEXT INJECTION
   In RAG, you inject retrieved chunks into the LLM prompt.
   The LLM's context window is limited. Injecting 5 focused
   200-word chunks (1000 words) is far better than injecting
   one 5000-word document where 90% is irrelevant.

2. Chunking Methods Overview

Chunking Strategy Comparison:

  Strategy          │ Complexity │ Quality │ Best For
  ──────────────────┼────────────┼─────────┼──────────────────────────
  Fixed-size        │ Low        │ Basic   │ Quick prototypes, logs
  Sentence-based    │ Low        │ Good    │ Articles, documentation
  Paragraph-based   │ Low        │ Good    │ Well-structured documents
  Recursive char    │ Medium     │ Better  │ Mixed content (LangChain)
  Semantic          │ High       │ Best    │ High-quality RAG systems

3. Fixed-Size Chunking

The simplest approach: split text into chunks of N characters (or N tokens), regardless of content boundaries.

Fixed-size chunking (chunk_size=200 characters):

  Original text (600 characters):
  "JavaScript was created in 1995 by Brendan Eich at Netscape.
   It was originally called Mocha, then LiveScript, before being
   renamed to JavaScript. Despite the name, JavaScript has no
   direct relation to Java. The language was standardized as
   ECMAScript. Today, JavaScript runs in browsers, servers
   (Node.js), mobile apps, and even desktop applications. It
   is one of the most popular programming languages in the world."

  Chunk 1 (chars 0-199):
  "JavaScript was created in 1995 by Brendan Eich at Netscape.
   It was originally called Mocha, then LiveScript, before being
   renamed to JavaScript. Despite the name, JavaScript has no di"
                                                              ^^
                                                    Word cut in half!

  Chunk 2 (chars 200-399):
  "rect relation to Java. The language was standardized as
   ECMAScript. Today, JavaScript runs in browsers, servers
   (Node.js), mobile apps, and even desktop applications. It is"

  Chunk 3 (chars 400-599):
  " one of the most popular programming languages in the world."

Implementation

function fixedSizeChunk(text, chunkSize = 1000, overlap = 200) {
  const chunks = [];
  let start = 0;

  while (start < text.length) {
    const end = Math.min(start + chunkSize, text.length);
    chunks.push({
      text: text.slice(start, end),
      startIndex: start,
      endIndex: end,
    });
    start += chunkSize - overlap; // Move forward by chunkSize - overlap
  }

  return chunks;
}

// Example
const text = 'A'.repeat(500) + 'B'.repeat(500) + 'C'.repeat(500);
const chunks = fixedSizeChunk(text, 400, 100);
console.log(`${chunks.length} chunks created`);
// Each chunk is 400 chars with 100 chars overlap with the next

Pros and cons

PROS:
  ✓ Dead simple to implement
  ✓ Predictable chunk sizes (good for token budgeting)
  ✓ Works on any text format

CONS:
  ✗ Cuts words and sentences in the middle
  ✗ Chunks may lack coherent meaning
  ✗ Important context can be split across chunks
  ✗ "JavaScript has no di" / "rect relation" — meaningless fragments

4. Sentence-Based Chunking

Split text at sentence boundaries, then group sentences into chunks of a target size.

Sentence-based chunking:

  Original text split into sentences:
  S1: "JavaScript was created in 1995 by Brendan Eich at Netscape."
  S2: "It was originally called Mocha, then LiveScript."
  S3: "Despite the name, JavaScript has no direct relation to Java."
  S4: "The language was standardized as ECMAScript."
  S5: "Today, JavaScript runs in browsers, servers, and mobile apps."
  S6: "It is one of the most popular programming languages."

  Group into chunks of ~3 sentences:
  Chunk 1: S1 + S2 + S3  (coherent: history and naming)
  Chunk 2: S4 + S5 + S6  (coherent: standardization and usage)

  Each chunk is a complete, meaningful unit of text.

Implementation

function sentenceChunk(text, sentencesPerChunk = 3, overlapSentences = 1) {
  // Split into sentences (simple regex — production systems use NLP libraries)
  const sentences = text
    .replace(/\n+/g, ' ')
    .split(/(?<=[.!?])\s+/)
    .filter(s => s.trim().length > 0);

  const chunks = [];

  for (let i = 0; i < sentences.length; i += sentencesPerChunk - overlapSentences) {
    const chunkSentences = sentences.slice(i, i + sentencesPerChunk);
    if (chunkSentences.length === 0) break;

    chunks.push({
      text: chunkSentences.join(' '),
      sentenceStart: i,
      sentenceEnd: i + chunkSentences.length - 1,
      sentenceCount: chunkSentences.length,
    });

    // Stop if we've reached the end
    if (i + sentencesPerChunk >= sentences.length) break;
  }

  return chunks;
}

// Example
const article = `JavaScript was created in 1995 by Brendan Eich. It was built in just 10 days. The language was originally called Mocha. It was later renamed to LiveScript. Finally it became JavaScript. Today it powers most of the web.`;

const chunks = sentenceChunk(article, 3, 1);
chunks.forEach((c, i) => {
  console.log(`Chunk ${i + 1} (${c.sentenceCount} sentences): "${c.text}"\n`);
});
// Chunk 1: "JavaScript was created in 1995 by Brendan Eich. It was built in just 10 days. The language was originally called Mocha."
// Chunk 2: "The language was originally called Mocha. It was later renamed to LiveScript. Finally it became JavaScript."
// Chunk 3: "Finally it became JavaScript. Today it powers most of the web."

Pros and cons

PROS:
  ✓ Never cuts mid-sentence
  ✓ Chunks are readable and coherent
  ✓ Respects natural language boundaries

CONS:
  ✗ Sentences vary wildly in length — chunk sizes are unpredictable
  ✗ Sentence detection is imperfect (abbreviations like "U.S." cause splits)
  ✗ Long sentences can exceed desired chunk size
  ✗ Doesn't respect paragraph or section boundaries

5. Paragraph-Based Chunking

Split text at paragraph boundaries (double newlines). Each paragraph is a natural unit of thought.

Paragraph-based chunking:

  Original document:
  ────────────────────
  "# Introduction                          ← Paragraph 1
   JavaScript is the most popular...

   # History                               ← Paragraph 2
   Created in 1995, JavaScript was...

   # Modern Usage                          ← Paragraph 3
   Today JavaScript runs on servers..."

  Chunks:
  Chunk 1: "Introduction\nJavaScript is the most popular..."
  Chunk 2: "History\nCreated in 1995, JavaScript was..."
  Chunk 3: "Modern Usage\nToday JavaScript runs on servers..."

Implementation

function paragraphChunk(text, maxChunkSize = 2000, overlap = 0) {
  // Split on double newlines (paragraph breaks)
  const paragraphs = text
    .split(/\n\s*\n/)
    .map(p => p.trim())
    .filter(p => p.length > 0);

  const chunks = [];
  let currentChunk = '';

  for (const paragraph of paragraphs) {
    // If adding this paragraph exceeds max size, save current and start new
    if (currentChunk.length + paragraph.length > maxChunkSize && currentChunk.length > 0) {
      chunks.push({ text: currentChunk.trim() });
      // Start new chunk (optionally with overlap from last paragraph)
      currentChunk = overlap > 0
        ? currentChunk.slice(-overlap) + '\n\n'
        : '';
    }

    currentChunk += (currentChunk ? '\n\n' : '') + paragraph;
  }

  // Don't forget the last chunk
  if (currentChunk.trim().length > 0) {
    chunks.push({ text: currentChunk.trim() });
  }

  return chunks;
}

// Example
const doc = `JavaScript was created in 1995 by Brendan Eich at Netscape Communications. It was designed to add interactivity to web pages.

The language was originally called Mocha during development. It was briefly renamed to LiveScript before receiving its final name, JavaScript, as part of a marketing agreement with Sun Microsystems.

Today, JavaScript is one of the most popular programming languages in the world. It runs in browsers, on servers via Node.js, in mobile apps, and even in desktop applications.`;

const chunks = paragraphChunk(doc, 300);
chunks.forEach((c, i) => {
  console.log(`Chunk ${i + 1} (${c.text.length} chars):\n"${c.text}"\n`);
});

Pros and cons

PROS:
  ✓ Respects natural document structure
  ✓ Each chunk is a coherent unit of thought
  ✓ Simple to implement
  ✓ Works well for articles, docs, and structured text

CONS:
  ✗ Paragraph sizes vary enormously (1 sentence to 50 sentences)
  ✗ Very short paragraphs produce low-information chunks
  ✗ Very long paragraphs may exceed embedding model limits
  ✗ Not all text has clear paragraph structure (code, logs, chat)

6. Recursive Character Splitting

The most popular strategy (used by LangChain). It tries a hierarchy of separators, splitting on the best available boundary.

Recursive Character Splitting — How It Works:

  Separators (tried in order):
  1. "\n\n"  — paragraph break (best boundary)
  2. "\n"    — line break
  3. ". "    — sentence end
  4. " "     — word break
  5. ""      — character break (last resort)

  Algorithm:
  1. Try to split on "\n\n" (paragraphs)
  2. If any chunk is still too large, split those on "\n"
  3. If still too large, split on ". "
  4. If still too large, split on " "
  5. Never cut mid-character

  This ensures the BEST possible split boundary is always used.

Implementation

function recursiveCharacterSplit(text, {
  chunkSize = 1000,
  chunkOverlap = 200,
  separators = ['\n\n', '\n', '. ', ' ', ''],
} = {}) {
  const chunks = [];

  function splitRecursive(text, separatorIndex) {
    // Base case: text fits in one chunk
    if (text.length <= chunkSize) {
      if (text.trim().length > 0) {
        chunks.push(text.trim());
      }
      return;
    }

    // Try current separator
    const separator = separators[separatorIndex];

    if (separatorIndex >= separators.length - 1) {
      // Last resort: hard cut at chunkSize
      chunks.push(text.slice(0, chunkSize).trim());
      const remaining = text.slice(chunkSize - chunkOverlap);
      if (remaining.trim().length > 0) {
        splitRecursive(remaining, 0); // Restart separator hierarchy
      }
      return;
    }

    const parts = text.split(separator).filter(p => p.trim().length > 0);

    if (parts.length <= 1) {
      // This separator didn't help — try the next one
      splitRecursive(text, separatorIndex + 1);
      return;
    }

    // Merge parts into chunks that fit within chunkSize
    let currentChunk = '';

    for (const part of parts) {
      const combined = currentChunk
        ? currentChunk + separator + part
        : part;

      if (combined.length <= chunkSize) {
        currentChunk = combined;
      } else {
        // Save current chunk
        if (currentChunk.trim().length > 0) {
          chunks.push(currentChunk.trim());
        }

        // If this single part is too large, recursively split it
        if (part.length > chunkSize) {
          splitRecursive(part, separatorIndex + 1);
          currentChunk = '';
        } else {
          currentChunk = part;
        }
      }
    }

    // Don't forget the last chunk
    if (currentChunk.trim().length > 0) {
      chunks.push(currentChunk.trim());
    }
  }

  splitRecursive(text, 0);

  // Add overlap between chunks
  if (chunkOverlap > 0 && chunks.length > 1) {
    return addOverlap(chunks, chunkOverlap);
  }

  return chunks;
}

function addOverlap(chunks, overlapSize) {
  const result = [chunks[0]];

  for (let i = 1; i < chunks.length; i++) {
    const prevChunk = chunks[i - 1];
    const overlapText = prevChunk.slice(-overlapSize);
    result.push(overlapText + ' ' + chunks[i]);
  }

  return result;
}

// Example
const document = `# Getting Started with React

React is a JavaScript library for building user interfaces. It was created by Facebook and is now maintained by Meta.

## Components

Components are the building blocks of React applications. Each component is a JavaScript function that returns JSX.

function Greeting({ name }) {
  return <h1>Hello, {name}!</h1>;
}

## State Management

State is data that changes over time. React provides the useState hook for managing state in functional components.

const [count, setCount] = useState(0);

This allows components to be interactive and respond to user actions.`;

const chunks = recursiveCharacterSplit(document, {
  chunkSize: 300,
  chunkOverlap: 50,
});

chunks.forEach((c, i) => {
  console.log(`--- Chunk ${i + 1} (${c.length} chars) ---`);
  console.log(c);
  console.log();
});

Pros and cons

PROS:
  ✓ Respects natural boundaries (paragraphs > lines > sentences > words)
  ✓ Handles mixed content well (prose, code, headers)
  ✓ Predictable chunk sizes
  ✓ Industry standard (used by LangChain, LlamaIndex)

CONS:
  ✗ More complex implementation
  ✗ Still purely structural — doesn't understand content meaning
  ✗ Code blocks may get split at arbitrary points
  ✗ Separator hierarchy is a heuristic, not always optimal

7. Semantic Chunking

The most sophisticated approach: split text where the meaning changes, not where structural boundaries happen. Uses embeddings themselves to detect topic shifts.

Semantic Chunking — How It Works:

  1. Split text into sentences
  2. Embed each sentence
  3. Compare consecutive sentence embeddings
  4. Where similarity DROPS, insert a chunk boundary

  Sentences:  S1    S2    S3    S4    S5    S6    S7    S8
  Similarity:    0.92  0.88  0.34  0.91  0.85  0.31  0.89
                              ^^^^                ^^^^
                         Topic change!       Topic change!

  Chunks:  [S1, S2, S3] | [S4, S5, S6] | [S7, S8]
           Topic A         Topic B         Topic C

Implementation

import OpenAI from 'openai';

const openai = new OpenAI();

async function semanticChunk(text, {
  similarityThreshold = 0.75,
  minChunkSize = 100,
  maxChunkSize = 2000,
} = {}) {
  // Step 1: Split into sentences
  const sentences = text
    .replace(/\n+/g, ' ')
    .split(/(?<=[.!?])\s+/)
    .filter(s => s.trim().length > 10); // Filter out tiny fragments

  if (sentences.length <= 1) {
    return [text]; // Nothing to chunk
  }

  // Step 2: Embed all sentences in one batch
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: sentences,
  });

  const embeddings = response.data.map(d => d.embedding);

  // Step 3: Calculate similarity between consecutive sentences
  const similarities = [];
  for (let i = 0; i < embeddings.length - 1; i++) {
    similarities.push({
      index: i,
      similarity: cosineSimilarity(embeddings[i], embeddings[i + 1]),
    });
  }

  // Step 4: Find breakpoints where similarity drops below threshold
  const breakpoints = similarities
    .filter(s => s.similarity < similarityThreshold)
    .map(s => s.index + 1); // Break AFTER the low-similarity pair

  // Step 5: Build chunks from breakpoints
  const chunks = [];
  let start = 0;

  for (const breakpoint of breakpoints) {
    const chunkText = sentences.slice(start, breakpoint).join(' ');

    // Respect min/max chunk size
    if (chunkText.length >= minChunkSize) {
      chunks.push(chunkText);
      start = breakpoint;
    }
  }

  // Add remaining sentences as the last chunk
  const lastChunk = sentences.slice(start).join(' ');
  if (lastChunk.length >= minChunkSize) {
    chunks.push(lastChunk);
  } else if (chunks.length > 0) {
    // Merge tiny last chunk with previous
    chunks[chunks.length - 1] += ' ' + lastChunk;
  } else {
    chunks.push(lastChunk);
  }

  return chunks;
}

function cosineSimilarity(a, b) {
  let dot = 0;
  for (let i = 0; i < a.length; i++) dot += a[i] * b[i];
  return dot; // Vectors are normalized
}

// Usage
const text = `
React is a JavaScript library for building user interfaces. It uses a virtual DOM for efficient updates. Components are the building blocks of React apps. Each component manages its own state and lifecycle.

Machine learning is a subset of artificial intelligence. It involves training models on data to make predictions. Neural networks are a popular machine learning technique. Deep learning uses many layers of neural networks.

PostgreSQL is a powerful relational database. It supports complex queries and ACID transactions. Indexes improve query performance significantly. Foreign keys maintain referential integrity between tables.
`;

const chunks = await semanticChunk(text, { similarityThreshold: 0.7 });
chunks.forEach((c, i) => {
  console.log(`--- Semantic Chunk ${i + 1} ---`);
  console.log(c);
  console.log();
});
// Chunk 1: React content (sentences about React cluster together)
// Chunk 2: ML content (sentences about ML cluster together)
// Chunk 3: Database content (sentences about PostgreSQL cluster together)

Pros and cons

PROS:
  ✓ Chunks are semantically coherent (each chunk = one topic)
  ✓ Produces the highest-quality chunks for RAG
  ✓ Adapts to content — no fixed rules about where to split
  ✓ Works well for documents that mix topics

CONS:
  ✗ Expensive — requires embedding every sentence (API calls)
  ✗ Slower — can't chunk locally without a model
  ✗ Complex to implement and tune
  ✗ Similarity threshold needs calibration per domain
  ✗ Overkill for well-structured documents (headers/paragraphs work fine)

8. Chunk Size: The Critical Trade-off

Chunk size is the single most important decision in your chunking strategy. Too small and you lose context. Too large and you dilute relevance.

The Chunk Size Spectrum:

  TOO SMALL (50-100 tokens)
  ┌──────────────────────────────────────────────┐
  │  "To reset the password"                      │
  │                                                │
  │  Problem: No context! Reset WHICH password?    │
  │  For which system? What are the steps?         │
  │  The embedding is too vague to be useful.      │
  └──────────────────────────────────────────────┘

  JUST RIGHT (200-500 tokens)
  ┌──────────────────────────────────────────────┐
  │  "To reset the admin password in the          │
  │   dashboard, navigate to Settings > Security  │
  │   > Password Reset. Enter your current        │
  │   password, then your new password twice.     │
  │   Passwords must be at least 12 characters    │
  │   and include a number and special character." │
  │                                                │
  │  Perfect: Enough context to be specific,       │
  │  focused enough to match relevant queries.     │
  └──────────────────────────────────────────────┘

  TOO LARGE (2000+ tokens)
  ┌──────────────────────────────────────────────┐
  │  [Entire 'Security Settings' chapter]         │
  │  Password reset + 2FA setup + API keys +      │
  │  session management + audit logs + ...         │
  │                                                │
  │  Problem: The embedding averages ALL these     │
  │  topics. A query about "password reset"        │
  │  gets a weak match because the vector is       │
  │  diluted by unrelated security content.        │
  └──────────────────────────────────────────────┘

Guidelines by content type

Content Type	Recommended Chunk Size	Overlap	Why
FAQ / Q&A pairs	100-200 tokens	0	Each Q&A is self-contained
Technical docs	200-400 tokens	50-100	Need enough context for procedures
Articles / blog posts	300-500 tokens	50-100	Balanced detail and focus
Legal / medical docs	400-600 tokens	100-150	Precision matters, need full context
Code documentation	200-400 tokens	50	Function-level chunks work well
Chat logs	100-300 tokens	0	Per-message or per-exchange
Books / long reports	500-1000 tokens	100-200	Section-level chunks

Measuring the impact of chunk size

// Experiment: test different chunk sizes on your data
async function evaluateChunkSizes(document, testQueries, chunkSizes) {
  const results = {};

  for (const size of chunkSizes) {
    // Chunk the document
    const chunks = recursiveCharacterSplit(document, {
      chunkSize: size,
      chunkOverlap: Math.floor(size * 0.15), // 15% overlap
    });

    // Embed all chunks
    const embeddedChunks = await embedDocuments(chunks);

    // Test each query
    let totalRelevanceScore = 0;

    for (const { query, expectedContent } of testQueries) {
      const queryEmb = await getEmbedding(query);
      const topChunk = embeddedChunks
        .map(c => ({
          text: c.text,
          sim: cosineSimilarity(queryEmb, c.embedding),
        }))
        .sort((a, b) => b.sim - a.sim)[0];

      // Check if the top result contains the expected content
      const containsExpected = topChunk.text.includes(expectedContent);
      totalRelevanceScore += containsExpected ? 1 : 0;
    }

    results[size] = {
      chunkCount: chunks.length,
      avgChunkLength: Math.round(chunks.reduce((a, c) => a + c.length, 0) / chunks.length),
      relevanceScore: totalRelevanceScore / testQueries.length,
    };
  }

  console.table(results);
  return results;
}

// Run the experiment
await evaluateChunkSizes(myDocument, myTestQueries, [100, 200, 400, 800, 1600]);
// Typical result: 200-400 wins for most document types

9. Overlap Between Chunks

Overlap means that the end of one chunk is repeated at the beginning of the next chunk. This prevents important context from being lost at chunk boundaries.

Without overlap (chunk_size=200):

  Chunk 1: "...React uses a virtual DOM for efficient rendering."
  Chunk 2: "It batches updates and minimizes real DOM operations..."

  If a query asks about "virtual DOM efficiency," chunk 1 has
  "virtual DOM" and chunk 2 has "efficient" context. Neither
  chunk alone has the full picture.

With overlap (chunk_size=200, overlap=50):

  Chunk 1: "...React uses a virtual DOM for efficient rendering."
  Chunk 2: "for efficient rendering. It batches updates and..."
              ^^^^^^^^^^^^^^^^^^^^^^^^
              Overlapping text — appears in BOTH chunks

  Now chunk 2 also contains "efficient rendering," preserving
  context across the boundary.

How much overlap?

Overlap Guidelines:

  0% overlap:
    ✓ No duplicate content in database
    ✓ Lower storage cost
    ✗ Context lost at boundaries

  10-15% overlap (recommended starting point):
    ✓ Good balance of context preservation and storage
    ✓ Works well for most content types
    Example: chunk_size=400, overlap=50

  20-30% overlap:
    ✓ Maximum context preservation
    ✗ Significant duplicate content
    ✗ Higher storage cost
    ✗ May return near-duplicate results
    Example: chunk_size=400, overlap=100

  >30% overlap:
    ✗ Diminishing returns
    ✗ Lots of wasted storage
    ✗ Search results will have many near-duplicates

Implementation with deduplication

function chunkWithOverlap(text, chunkSize = 400, overlap = 60) {
  const chunks = [];
  let start = 0;

  while (start < text.length) {
    const end = Math.min(start + chunkSize, text.length);
    const chunk = text.slice(start, end).trim();

    if (chunk.length > 0) {
      chunks.push({
        text: chunk,
        startChar: start,
        endChar: end,
        chunkIndex: chunks.length,
      });
    }

    // Move forward by (chunkSize - overlap)
    start += chunkSize - overlap;

    // Safety: ensure we're making progress
    if (chunkSize - overlap <= 0) break;
  }

  return chunks;
}

// When searching, deduplicate overlapping results
function deduplicateResults(results, overlapThreshold = 0.95) {
  const unique = [];

  for (const result of results) {
    const isDuplicate = unique.some(
      u => cosineSimilarity(u.embedding, result.embedding) > overlapThreshold
    );

    if (!isDuplicate) {
      unique.push(result);
    }
  }

  return unique;
}

10. Metadata: What to Attach to Each Chunk

Raw text chunks are not enough. You need metadata to filter, sort, and provide context when chunks are retrieved.

Chunk with metadata:

  {
    text: "To reset the admin password, navigate to Settings > Security...",
    metadata: {
      source: "admin-guide.pdf",          // Where it came from
      page: 27,                            // Page number
      section: "Security Settings",        // Section/chapter
      chunkIndex: 42,                      // Position in document
      totalChunks: 156,                    // Total chunks in source
      createdAt: "2025-03-15T10:00:00Z",  // When embedded
      documentVersion: "2.1",              // Document version
      category: "admin",                   // Content category
      language: "en",                      // Language
      tokenCount: 312,                     // Exact token count
    }
  }

Why metadata matters

Scenario: User asks "How do I reset the admin password?"

Without metadata:
  → Returns chunk: "To reset the password, click Settings > Security..."
  → But WHICH product? WHICH version? Is this outdated?

With metadata:
  → Returns chunk: "To reset the admin password..."
  → Metadata: source="admin-guide-v2.1.pdf", page=27, updatedAt="2025-03"
  → You can show: "From Admin Guide v2.1, page 27 (updated March 2025)"
  → You can filter: only search documents where category="admin"

Implementation

function chunkDocumentWithMetadata(document) {
  const { text, filename, category, version, createdAt } = document;

  // Chunk the text
  const rawChunks = recursiveCharacterSplit(text, {
    chunkSize: 400,
    chunkOverlap: 60,
  });

  // Attach metadata to each chunk
  return rawChunks.map((chunkText, index) => ({
    text: chunkText,
    embedding: null, // Will be filled by embedding step
    metadata: {
      source: filename,
      category: category,
      version: version,
      chunkIndex: index,
      totalChunks: rawChunks.length,
      charCount: chunkText.length,
      createdAt: createdAt || new Date().toISOString(),

      // Helpful for RAG: prepend context to help the LLM
      contextHeader: `[Source: ${filename}, Section ${index + 1}/${rawChunks.length}]`,
    },
  }));
}

// When retrieved, inject metadata as context for the LLM
function formatChunkForPrompt(chunk) {
  return `${chunk.metadata.contextHeader}\n${chunk.text}`;
}

// In your RAG pipeline:
const relevantChunks = await vectorSearch(query, topK);
const context = relevantChunks
  .map(formatChunkForPrompt)
  .join('\n\n---\n\n');

const prompt = `Based on the following context, answer the question.

Context:
${context}

Question: ${query}`;

Metadata-filtered search

// Search with metadata filters (pre-filter before similarity)
async function filteredSearch(query, filters = {}, topK = 5) {
  const queryEmb = await getEmbedding(query);

  let candidates = allDocuments;

  // Apply metadata filters BEFORE similarity search
  if (filters.category) {
    candidates = candidates.filter(d => d.metadata.category === filters.category);
  }
  if (filters.source) {
    candidates = candidates.filter(d => d.metadata.source === filters.source);
  }
  if (filters.after) {
    candidates = candidates.filter(d => d.metadata.createdAt > filters.after);
  }

  // Now do similarity search on the filtered set
  const results = candidates
    .map(doc => ({
      ...doc,
      similarity: cosineSimilarity(queryEmb, doc.embedding),
    }))
    .sort((a, b) => b.similarity - a.similarity)
    .slice(0, topK);

  return results;
}

// Usage: Only search admin docs from 2025
const results = await filteredSearch('reset password', {
  category: 'admin',
  after: '2025-01-01',
}, 3);

11. Complete Chunking Pipeline

Here's a full pipeline that puts it all together:

import OpenAI from 'openai';

const openai = new OpenAI();

// ── Step 1: Load and clean the document ──────────────────────────
function cleanText(text) {
  return text
    .replace(/\r\n/g, '\n')         // Normalize line endings
    .replace(/\t/g, '  ')           // Tabs to spaces
    .replace(/ +/g, ' ')            // Collapse multiple spaces
    .replace(/\n{3,}/g, '\n\n')     // Max 2 consecutive newlines
    .trim();
}

// ── Step 2: Chunk the document ───────────────────────────────────
function chunkDocument(text, strategy = 'recursive', options = {}) {
  const cleaned = cleanText(text);

  switch (strategy) {
    case 'fixed':
      return fixedSizeChunk(cleaned, options.chunkSize || 1000, options.overlap || 200);

    case 'sentence':
      return sentenceChunk(cleaned, options.sentencesPerChunk || 5, options.overlapSentences || 1);

    case 'paragraph':
      return paragraphChunk(cleaned, options.maxChunkSize || 1500, options.overlap || 0);

    case 'recursive':
    default:
      return recursiveCharacterSplit(cleaned, {
        chunkSize: options.chunkSize || 800,
        chunkOverlap: options.overlap || 120,
      }).map(text => ({ text }));
  }
}

// ── Step 3: Embed all chunks ─────────────────────────────────────
async function embedChunks(chunks, batchSize = 100) {
  const results = [];

  for (let i = 0; i < chunks.length; i += batchSize) {
    const batch = chunks.slice(i, i + batchSize);
    const texts = batch.map(c => c.text);

    const response = await openai.embeddings.create({
      model: 'text-embedding-3-small',
      input: texts,
    });

    for (let j = 0; j < batch.length; j++) {
      results.push({
        ...batch[j],
        embedding: response.data[j].embedding,
      });
    }
  }

  return results;
}

// ── Step 4: Full pipeline ────────────────────────────────────────
async function processDocument(document) {
  console.log(`Processing: ${document.filename}`);

  // Chunk
  const chunks = chunkDocument(document.text, 'recursive', {
    chunkSize: 600,
    overlap: 90,
  });

  console.log(`  Created ${chunks.length} chunks`);

  // Add metadata
  const chunksWithMeta = chunks.map((chunk, i) => ({
    ...chunk,
    metadata: {
      source: document.filename,
      chunkIndex: i,
      totalChunks: chunks.length,
      processedAt: new Date().toISOString(),
    },
  }));

  // Embed
  const embedded = await embedChunks(chunksWithMeta);
  console.log(`  Embedded ${embedded.length} chunks (${embedded[0].embedding.length} dims)`);

  return embedded;
}

// Usage
const document = {
  filename: 'react-docs.md',
  text: '... (your document content) ...',
};

const processed = await processDocument(document);
// Now store `processed` in your vector database

12. Choosing the Right Strategy

Decision Flowchart:

  What kind of content do you have?
  │
  ├── Well-structured (headers, paragraphs, Markdown)
  │   └── Use RECURSIVE CHARACTER SPLITTING
  │       (chunkSize: 400-800, overlap: 10-15%)
  │
  ├── Q&A pairs or FAQ
  │   └── Use PARAGRAPH-BASED (each Q&A = one chunk)
  │       No overlap needed
  │
  ├── Long-form prose (articles, reports)
  │   └── Use SENTENCE-BASED or RECURSIVE
  │       (chunkSize: 300-500, overlap: 50-100)
  │
  ├── Mixed content (code + prose)
  │   └── Use RECURSIVE with code-aware separators
  │       Add "\n```" and "\nfunction " to separator list
  │
  ├── Critical accuracy needed (medical, legal)
  │   └── Use SEMANTIC CHUNKING
  │       Worth the extra cost for quality
  │
  └── Prototyping / just getting started
      └── Use FIXED-SIZE (chunkSize: 500, overlap: 100)
          Upgrade later when you have eval data

13. Key Takeaways

Always chunk documents before embedding — one vector per chunk, not one per document.
Recursive character splitting is the industry default — it respects natural boundaries and handles mixed content.
Chunk size sweet spot is 200-500 tokens for most applications — too small loses context, too large dilutes relevance.
10-15% overlap between chunks preserves context at boundaries without excessive duplication.
Attach metadata to every chunk — source, page, section, timestamp. You'll need it for filtering and citation.
Semantic chunking produces the best results but costs more — use it when accuracy justifies the cost.
Always evaluate your chunking strategy against real queries — the right chunk size depends on your specific data and use case.

Explain-It Challenge

A user says "I embedded my entire 200-page manual as one vector and search results are terrible." Explain why, and describe how chunking fixes it.
You're building a RAG system for legal contracts. Each contract is 50 pages. What chunking strategy and chunk size would you choose, and why?
Two adjacent chunks cover related information. A user's query needs context from both chunks. How does overlap help, and what happens without it?

Navigation: ← 4.11.b — Similarity Search · 4.11 Overview