Episode 4 — Generative AI Engineering / 4.11 — Understanding Embeddings
4.11.c — Document Chunking Strategies
In one sentence: Chunking is the process of splitting large documents into smaller, meaningful pieces before embedding them — because embedding models have token limits, and smaller chunks produce more precise and relevant search results than embedding an entire document as one giant vector.
Navigation: ← 4.11.b — Similarity Search · 4.11 Overview
1. Why You Must Chunk Documents
Embedding models have a token limit (8191 tokens for OpenAI's text-embedding-3 models, roughly 32,000 characters). But even if you could embed unlimited text, you shouldn't embed large documents as a single vector.
The Chunking Problem:
SCENARIO: You have a 50-page technical manual.
QUERY: "How do I reset the admin password?"
The answer is in paragraph 3 of page 27.
APPROACH 1: Embed the entire document as ONE vector
┌─────────────────────────────────────────────────────────┐
│ 50 pages of text → [single vector of 1536 numbers] │
│ │
│ Problem: The vector represents the AVERAGE meaning of │
│ all 50 pages. The password reset info is DILUTED by │
│ 48 pages of unrelated content. The vector is "about │
│ everything" which means it's "about nothing specific." │
│ │
│ Similarity to "reset admin password": ~0.45 (weak) │
└─────────────────────────────────────────────────────────┘
APPROACH 2: Chunk the document into paragraphs, embed each
┌─────────────────────────────────────────────────────────┐
│ Page 27, para 3: "To reset the admin password, │
│ navigate to Settings > Security > Password Reset..." │
│ → [vector that is SPECIFICALLY about password reset] │
│ │
│ Similarity to "reset admin password": ~0.91 (strong!) │
└─────────────────────────────────────────────────────────┘
Three reasons to chunk
1. TOKEN LIMITS
Embedding models cap at 8191 tokens (~32K characters).
Most real documents exceed this.
You MUST split to fit the model's input limit.
2. PRECISION
Smaller chunks = more focused vectors = better search results.
A 200-word chunk about password resets will match "how to
reset password" much better than a 10,000-word document
that mentions passwords once.
3. CONTEXT INJECTION
In RAG, you inject retrieved chunks into the LLM prompt.
The LLM's context window is limited. Injecting 5 focused
200-word chunks (1000 words) is far better than injecting
one 5000-word document where 90% is irrelevant.
2. Chunking Methods Overview
Chunking Strategy Comparison:
Strategy │ Complexity │ Quality │ Best For
──────────────────┼────────────┼─────────┼──────────────────────────
Fixed-size │ Low │ Basic │ Quick prototypes, logs
Sentence-based │ Low │ Good │ Articles, documentation
Paragraph-based │ Low │ Good │ Well-structured documents
Recursive char │ Medium │ Better │ Mixed content (LangChain)
Semantic │ High │ Best │ High-quality RAG systems
3. Fixed-Size Chunking
The simplest approach: split text into chunks of N characters (or N tokens), regardless of content boundaries.
Fixed-size chunking (chunk_size=200 characters):
Original text (600 characters):
"JavaScript was created in 1995 by Brendan Eich at Netscape.
It was originally called Mocha, then LiveScript, before being
renamed to JavaScript. Despite the name, JavaScript has no
direct relation to Java. The language was standardized as
ECMAScript. Today, JavaScript runs in browsers, servers
(Node.js), mobile apps, and even desktop applications. It
is one of the most popular programming languages in the world."
Chunk 1 (chars 0-199):
"JavaScript was created in 1995 by Brendan Eich at Netscape.
It was originally called Mocha, then LiveScript, before being
renamed to JavaScript. Despite the name, JavaScript has no di"
^^
Word cut in half!
Chunk 2 (chars 200-399):
"rect relation to Java. The language was standardized as
ECMAScript. Today, JavaScript runs in browsers, servers
(Node.js), mobile apps, and even desktop applications. It is"
Chunk 3 (chars 400-599):
" one of the most popular programming languages in the world."
Implementation
function fixedSizeChunk(text, chunkSize = 1000, overlap = 200) {
const chunks = [];
let start = 0;
while (start < text.length) {
const end = Math.min(start + chunkSize, text.length);
chunks.push({
text: text.slice(start, end),
startIndex: start,
endIndex: end,
});
start += chunkSize - overlap; // Move forward by chunkSize - overlap
}
return chunks;
}
// Example
const text = 'A'.repeat(500) + 'B'.repeat(500) + 'C'.repeat(500);
const chunks = fixedSizeChunk(text, 400, 100);
console.log(`${chunks.length} chunks created`);
// Each chunk is 400 chars with 100 chars overlap with the next
Pros and cons
PROS:
✓ Dead simple to implement
✓ Predictable chunk sizes (good for token budgeting)
✓ Works on any text format
CONS:
✗ Cuts words and sentences in the middle
✗ Chunks may lack coherent meaning
✗ Important context can be split across chunks
✗ "JavaScript has no di" / "rect relation" — meaningless fragments
4. Sentence-Based Chunking
Split text at sentence boundaries, then group sentences into chunks of a target size.
Sentence-based chunking:
Original text split into sentences:
S1: "JavaScript was created in 1995 by Brendan Eich at Netscape."
S2: "It was originally called Mocha, then LiveScript."
S3: "Despite the name, JavaScript has no direct relation to Java."
S4: "The language was standardized as ECMAScript."
S5: "Today, JavaScript runs in browsers, servers, and mobile apps."
S6: "It is one of the most popular programming languages."
Group into chunks of ~3 sentences:
Chunk 1: S1 + S2 + S3 (coherent: history and naming)
Chunk 2: S4 + S5 + S6 (coherent: standardization and usage)
Each chunk is a complete, meaningful unit of text.
Implementation
function sentenceChunk(text, sentencesPerChunk = 3, overlapSentences = 1) {
// Split into sentences (simple regex — production systems use NLP libraries)
const sentences = text
.replace(/\n+/g, ' ')
.split(/(?<=[.!?])\s+/)
.filter(s => s.trim().length > 0);
const chunks = [];
for (let i = 0; i < sentences.length; i += sentencesPerChunk - overlapSentences) {
const chunkSentences = sentences.slice(i, i + sentencesPerChunk);
if (chunkSentences.length === 0) break;
chunks.push({
text: chunkSentences.join(' '),
sentenceStart: i,
sentenceEnd: i + chunkSentences.length - 1,
sentenceCount: chunkSentences.length,
});
// Stop if we've reached the end
if (i + sentencesPerChunk >= sentences.length) break;
}
return chunks;
}
// Example
const article = `JavaScript was created in 1995 by Brendan Eich. It was built in just 10 days. The language was originally called Mocha. It was later renamed to LiveScript. Finally it became JavaScript. Today it powers most of the web.`;
const chunks = sentenceChunk(article, 3, 1);
chunks.forEach((c, i) => {
console.log(`Chunk ${i + 1} (${c.sentenceCount} sentences): "${c.text}"\n`);
});
// Chunk 1: "JavaScript was created in 1995 by Brendan Eich. It was built in just 10 days. The language was originally called Mocha."
// Chunk 2: "The language was originally called Mocha. It was later renamed to LiveScript. Finally it became JavaScript."
// Chunk 3: "Finally it became JavaScript. Today it powers most of the web."
Pros and cons
PROS:
✓ Never cuts mid-sentence
✓ Chunks are readable and coherent
✓ Respects natural language boundaries
CONS:
✗ Sentences vary wildly in length — chunk sizes are unpredictable
✗ Sentence detection is imperfect (abbreviations like "U.S." cause splits)
✗ Long sentences can exceed desired chunk size
✗ Doesn't respect paragraph or section boundaries
5. Paragraph-Based Chunking
Split text at paragraph boundaries (double newlines). Each paragraph is a natural unit of thought.
Paragraph-based chunking:
Original document:
────────────────────
"# Introduction ← Paragraph 1
JavaScript is the most popular...
# History ← Paragraph 2
Created in 1995, JavaScript was...
# Modern Usage ← Paragraph 3
Today JavaScript runs on servers..."
Chunks:
Chunk 1: "Introduction\nJavaScript is the most popular..."
Chunk 2: "History\nCreated in 1995, JavaScript was..."
Chunk 3: "Modern Usage\nToday JavaScript runs on servers..."
Implementation
function paragraphChunk(text, maxChunkSize = 2000, overlap = 0) {
// Split on double newlines (paragraph breaks)
const paragraphs = text
.split(/\n\s*\n/)
.map(p => p.trim())
.filter(p => p.length > 0);
const chunks = [];
let currentChunk = '';
for (const paragraph of paragraphs) {
// If adding this paragraph exceeds max size, save current and start new
if (currentChunk.length + paragraph.length > maxChunkSize && currentChunk.length > 0) {
chunks.push({ text: currentChunk.trim() });
// Start new chunk (optionally with overlap from last paragraph)
currentChunk = overlap > 0
? currentChunk.slice(-overlap) + '\n\n'
: '';
}
currentChunk += (currentChunk ? '\n\n' : '') + paragraph;
}
// Don't forget the last chunk
if (currentChunk.trim().length > 0) {
chunks.push({ text: currentChunk.trim() });
}
return chunks;
}
// Example
const doc = `JavaScript was created in 1995 by Brendan Eich at Netscape Communications. It was designed to add interactivity to web pages.
The language was originally called Mocha during development. It was briefly renamed to LiveScript before receiving its final name, JavaScript, as part of a marketing agreement with Sun Microsystems.
Today, JavaScript is one of the most popular programming languages in the world. It runs in browsers, on servers via Node.js, in mobile apps, and even in desktop applications.`;
const chunks = paragraphChunk(doc, 300);
chunks.forEach((c, i) => {
console.log(`Chunk ${i + 1} (${c.text.length} chars):\n"${c.text}"\n`);
});
Pros and cons
PROS:
✓ Respects natural document structure
✓ Each chunk is a coherent unit of thought
✓ Simple to implement
✓ Works well for articles, docs, and structured text
CONS:
✗ Paragraph sizes vary enormously (1 sentence to 50 sentences)
✗ Very short paragraphs produce low-information chunks
✗ Very long paragraphs may exceed embedding model limits
✗ Not all text has clear paragraph structure (code, logs, chat)
6. Recursive Character Splitting
The most popular strategy (used by LangChain). It tries a hierarchy of separators, splitting on the best available boundary.
Recursive Character Splitting — How It Works:
Separators (tried in order):
1. "\n\n" — paragraph break (best boundary)
2. "\n" — line break
3. ". " — sentence end
4. " " — word break
5. "" — character break (last resort)
Algorithm:
1. Try to split on "\n\n" (paragraphs)
2. If any chunk is still too large, split those on "\n"
3. If still too large, split on ". "
4. If still too large, split on " "
5. Never cut mid-character
This ensures the BEST possible split boundary is always used.
Implementation
function recursiveCharacterSplit(text, {
chunkSize = 1000,
chunkOverlap = 200,
separators = ['\n\n', '\n', '. ', ' ', ''],
} = {}) {
const chunks = [];
function splitRecursive(text, separatorIndex) {
// Base case: text fits in one chunk
if (text.length <= chunkSize) {
if (text.trim().length > 0) {
chunks.push(text.trim());
}
return;
}
// Try current separator
const separator = separators[separatorIndex];
if (separatorIndex >= separators.length - 1) {
// Last resort: hard cut at chunkSize
chunks.push(text.slice(0, chunkSize).trim());
const remaining = text.slice(chunkSize - chunkOverlap);
if (remaining.trim().length > 0) {
splitRecursive(remaining, 0); // Restart separator hierarchy
}
return;
}
const parts = text.split(separator).filter(p => p.trim().length > 0);
if (parts.length <= 1) {
// This separator didn't help — try the next one
splitRecursive(text, separatorIndex + 1);
return;
}
// Merge parts into chunks that fit within chunkSize
let currentChunk = '';
for (const part of parts) {
const combined = currentChunk
? currentChunk + separator + part
: part;
if (combined.length <= chunkSize) {
currentChunk = combined;
} else {
// Save current chunk
if (currentChunk.trim().length > 0) {
chunks.push(currentChunk.trim());
}
// If this single part is too large, recursively split it
if (part.length > chunkSize) {
splitRecursive(part, separatorIndex + 1);
currentChunk = '';
} else {
currentChunk = part;
}
}
}
// Don't forget the last chunk
if (currentChunk.trim().length > 0) {
chunks.push(currentChunk.trim());
}
}
splitRecursive(text, 0);
// Add overlap between chunks
if (chunkOverlap > 0 && chunks.length > 1) {
return addOverlap(chunks, chunkOverlap);
}
return chunks;
}
function addOverlap(chunks, overlapSize) {
const result = [chunks[0]];
for (let i = 1; i < chunks.length; i++) {
const prevChunk = chunks[i - 1];
const overlapText = prevChunk.slice(-overlapSize);
result.push(overlapText + ' ' + chunks[i]);
}
return result;
}
// Example
const document = `# Getting Started with React
React is a JavaScript library for building user interfaces. It was created by Facebook and is now maintained by Meta.
## Components
Components are the building blocks of React applications. Each component is a JavaScript function that returns JSX.
function Greeting({ name }) {
return <h1>Hello, {name}!</h1>;
}
## State Management
State is data that changes over time. React provides the useState hook for managing state in functional components.
const [count, setCount] = useState(0);
This allows components to be interactive and respond to user actions.`;
const chunks = recursiveCharacterSplit(document, {
chunkSize: 300,
chunkOverlap: 50,
});
chunks.forEach((c, i) => {
console.log(`--- Chunk ${i + 1} (${c.length} chars) ---`);
console.log(c);
console.log();
});
Pros and cons
PROS:
✓ Respects natural boundaries (paragraphs > lines > sentences > words)
✓ Handles mixed content well (prose, code, headers)
✓ Predictable chunk sizes
✓ Industry standard (used by LangChain, LlamaIndex)
CONS:
✗ More complex implementation
✗ Still purely structural — doesn't understand content meaning
✗ Code blocks may get split at arbitrary points
✗ Separator hierarchy is a heuristic, not always optimal
7. Semantic Chunking
The most sophisticated approach: split text where the meaning changes, not where structural boundaries happen. Uses embeddings themselves to detect topic shifts.
Semantic Chunking — How It Works:
1. Split text into sentences
2. Embed each sentence
3. Compare consecutive sentence embeddings
4. Where similarity DROPS, insert a chunk boundary
Sentences: S1 S2 S3 S4 S5 S6 S7 S8
Similarity: 0.92 0.88 0.34 0.91 0.85 0.31 0.89
^^^^ ^^^^
Topic change! Topic change!
Chunks: [S1, S2, S3] | [S4, S5, S6] | [S7, S8]
Topic A Topic B Topic C
Implementation
import OpenAI from 'openai';
const openai = new OpenAI();
async function semanticChunk(text, {
similarityThreshold = 0.75,
minChunkSize = 100,
maxChunkSize = 2000,
} = {}) {
// Step 1: Split into sentences
const sentences = text
.replace(/\n+/g, ' ')
.split(/(?<=[.!?])\s+/)
.filter(s => s.trim().length > 10); // Filter out tiny fragments
if (sentences.length <= 1) {
return [text]; // Nothing to chunk
}
// Step 2: Embed all sentences in one batch
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: sentences,
});
const embeddings = response.data.map(d => d.embedding);
// Step 3: Calculate similarity between consecutive sentences
const similarities = [];
for (let i = 0; i < embeddings.length - 1; i++) {
similarities.push({
index: i,
similarity: cosineSimilarity(embeddings[i], embeddings[i + 1]),
});
}
// Step 4: Find breakpoints where similarity drops below threshold
const breakpoints = similarities
.filter(s => s.similarity < similarityThreshold)
.map(s => s.index + 1); // Break AFTER the low-similarity pair
// Step 5: Build chunks from breakpoints
const chunks = [];
let start = 0;
for (const breakpoint of breakpoints) {
const chunkText = sentences.slice(start, breakpoint).join(' ');
// Respect min/max chunk size
if (chunkText.length >= minChunkSize) {
chunks.push(chunkText);
start = breakpoint;
}
}
// Add remaining sentences as the last chunk
const lastChunk = sentences.slice(start).join(' ');
if (lastChunk.length >= minChunkSize) {
chunks.push(lastChunk);
} else if (chunks.length > 0) {
// Merge tiny last chunk with previous
chunks[chunks.length - 1] += ' ' + lastChunk;
} else {
chunks.push(lastChunk);
}
return chunks;
}
function cosineSimilarity(a, b) {
let dot = 0;
for (let i = 0; i < a.length; i++) dot += a[i] * b[i];
return dot; // Vectors are normalized
}
// Usage
const text = `
React is a JavaScript library for building user interfaces. It uses a virtual DOM for efficient updates. Components are the building blocks of React apps. Each component manages its own state and lifecycle.
Machine learning is a subset of artificial intelligence. It involves training models on data to make predictions. Neural networks are a popular machine learning technique. Deep learning uses many layers of neural networks.
PostgreSQL is a powerful relational database. It supports complex queries and ACID transactions. Indexes improve query performance significantly. Foreign keys maintain referential integrity between tables.
`;
const chunks = await semanticChunk(text, { similarityThreshold: 0.7 });
chunks.forEach((c, i) => {
console.log(`--- Semantic Chunk ${i + 1} ---`);
console.log(c);
console.log();
});
// Chunk 1: React content (sentences about React cluster together)
// Chunk 2: ML content (sentences about ML cluster together)
// Chunk 3: Database content (sentences about PostgreSQL cluster together)
Pros and cons
PROS:
✓ Chunks are semantically coherent (each chunk = one topic)
✓ Produces the highest-quality chunks for RAG
✓ Adapts to content — no fixed rules about where to split
✓ Works well for documents that mix topics
CONS:
✗ Expensive — requires embedding every sentence (API calls)
✗ Slower — can't chunk locally without a model
✗ Complex to implement and tune
✗ Similarity threshold needs calibration per domain
✗ Overkill for well-structured documents (headers/paragraphs work fine)
8. Chunk Size: The Critical Trade-off
Chunk size is the single most important decision in your chunking strategy. Too small and you lose context. Too large and you dilute relevance.
The Chunk Size Spectrum:
TOO SMALL (50-100 tokens)
┌──────────────────────────────────────────────┐
│ "To reset the password" │
│ │
│ Problem: No context! Reset WHICH password? │
│ For which system? What are the steps? │
│ The embedding is too vague to be useful. │
└──────────────────────────────────────────────┘
JUST RIGHT (200-500 tokens)
┌──────────────────────────────────────────────┐
│ "To reset the admin password in the │
│ dashboard, navigate to Settings > Security │
│ > Password Reset. Enter your current │
│ password, then your new password twice. │
│ Passwords must be at least 12 characters │
│ and include a number and special character." │
│ │
│ Perfect: Enough context to be specific, │
│ focused enough to match relevant queries. │
└──────────────────────────────────────────────┘
TOO LARGE (2000+ tokens)
┌──────────────────────────────────────────────┐
│ [Entire 'Security Settings' chapter] │
│ Password reset + 2FA setup + API keys + │
│ session management + audit logs + ... │
│ │
│ Problem: The embedding averages ALL these │
│ topics. A query about "password reset" │
│ gets a weak match because the vector is │
│ diluted by unrelated security content. │
└──────────────────────────────────────────────┘
Guidelines by content type
| Content Type | Recommended Chunk Size | Overlap | Why |
|---|---|---|---|
| FAQ / Q&A pairs | 100-200 tokens | 0 | Each Q&A is self-contained |
| Technical docs | 200-400 tokens | 50-100 | Need enough context for procedures |
| Articles / blog posts | 300-500 tokens | 50-100 | Balanced detail and focus |
| Legal / medical docs | 400-600 tokens | 100-150 | Precision matters, need full context |
| Code documentation | 200-400 tokens | 50 | Function-level chunks work well |
| Chat logs | 100-300 tokens | 0 | Per-message or per-exchange |
| Books / long reports | 500-1000 tokens | 100-200 | Section-level chunks |
Measuring the impact of chunk size
// Experiment: test different chunk sizes on your data
async function evaluateChunkSizes(document, testQueries, chunkSizes) {
const results = {};
for (const size of chunkSizes) {
// Chunk the document
const chunks = recursiveCharacterSplit(document, {
chunkSize: size,
chunkOverlap: Math.floor(size * 0.15), // 15% overlap
});
// Embed all chunks
const embeddedChunks = await embedDocuments(chunks);
// Test each query
let totalRelevanceScore = 0;
for (const { query, expectedContent } of testQueries) {
const queryEmb = await getEmbedding(query);
const topChunk = embeddedChunks
.map(c => ({
text: c.text,
sim: cosineSimilarity(queryEmb, c.embedding),
}))
.sort((a, b) => b.sim - a.sim)[0];
// Check if the top result contains the expected content
const containsExpected = topChunk.text.includes(expectedContent);
totalRelevanceScore += containsExpected ? 1 : 0;
}
results[size] = {
chunkCount: chunks.length,
avgChunkLength: Math.round(chunks.reduce((a, c) => a + c.length, 0) / chunks.length),
relevanceScore: totalRelevanceScore / testQueries.length,
};
}
console.table(results);
return results;
}
// Run the experiment
await evaluateChunkSizes(myDocument, myTestQueries, [100, 200, 400, 800, 1600]);
// Typical result: 200-400 wins for most document types
9. Overlap Between Chunks
Overlap means that the end of one chunk is repeated at the beginning of the next chunk. This prevents important context from being lost at chunk boundaries.
Without overlap (chunk_size=200):
Chunk 1: "...React uses a virtual DOM for efficient rendering."
Chunk 2: "It batches updates and minimizes real DOM operations..."
If a query asks about "virtual DOM efficiency," chunk 1 has
"virtual DOM" and chunk 2 has "efficient" context. Neither
chunk alone has the full picture.
With overlap (chunk_size=200, overlap=50):
Chunk 1: "...React uses a virtual DOM for efficient rendering."
Chunk 2: "for efficient rendering. It batches updates and..."
^^^^^^^^^^^^^^^^^^^^^^^^
Overlapping text — appears in BOTH chunks
Now chunk 2 also contains "efficient rendering," preserving
context across the boundary.
How much overlap?
Overlap Guidelines:
0% overlap:
✓ No duplicate content in database
✓ Lower storage cost
✗ Context lost at boundaries
10-15% overlap (recommended starting point):
✓ Good balance of context preservation and storage
✓ Works well for most content types
Example: chunk_size=400, overlap=50
20-30% overlap:
✓ Maximum context preservation
✗ Significant duplicate content
✗ Higher storage cost
✗ May return near-duplicate results
Example: chunk_size=400, overlap=100
>30% overlap:
✗ Diminishing returns
✗ Lots of wasted storage
✗ Search results will have many near-duplicates
Implementation with deduplication
function chunkWithOverlap(text, chunkSize = 400, overlap = 60) {
const chunks = [];
let start = 0;
while (start < text.length) {
const end = Math.min(start + chunkSize, text.length);
const chunk = text.slice(start, end).trim();
if (chunk.length > 0) {
chunks.push({
text: chunk,
startChar: start,
endChar: end,
chunkIndex: chunks.length,
});
}
// Move forward by (chunkSize - overlap)
start += chunkSize - overlap;
// Safety: ensure we're making progress
if (chunkSize - overlap <= 0) break;
}
return chunks;
}
// When searching, deduplicate overlapping results
function deduplicateResults(results, overlapThreshold = 0.95) {
const unique = [];
for (const result of results) {
const isDuplicate = unique.some(
u => cosineSimilarity(u.embedding, result.embedding) > overlapThreshold
);
if (!isDuplicate) {
unique.push(result);
}
}
return unique;
}
10. Metadata: What to Attach to Each Chunk
Raw text chunks are not enough. You need metadata to filter, sort, and provide context when chunks are retrieved.
Chunk with metadata:
{
text: "To reset the admin password, navigate to Settings > Security...",
metadata: {
source: "admin-guide.pdf", // Where it came from
page: 27, // Page number
section: "Security Settings", // Section/chapter
chunkIndex: 42, // Position in document
totalChunks: 156, // Total chunks in source
createdAt: "2025-03-15T10:00:00Z", // When embedded
documentVersion: "2.1", // Document version
category: "admin", // Content category
language: "en", // Language
tokenCount: 312, // Exact token count
}
}
Why metadata matters
Scenario: User asks "How do I reset the admin password?"
Without metadata:
→ Returns chunk: "To reset the password, click Settings > Security..."
→ But WHICH product? WHICH version? Is this outdated?
With metadata:
→ Returns chunk: "To reset the admin password..."
→ Metadata: source="admin-guide-v2.1.pdf", page=27, updatedAt="2025-03"
→ You can show: "From Admin Guide v2.1, page 27 (updated March 2025)"
→ You can filter: only search documents where category="admin"
Implementation
function chunkDocumentWithMetadata(document) {
const { text, filename, category, version, createdAt } = document;
// Chunk the text
const rawChunks = recursiveCharacterSplit(text, {
chunkSize: 400,
chunkOverlap: 60,
});
// Attach metadata to each chunk
return rawChunks.map((chunkText, index) => ({
text: chunkText,
embedding: null, // Will be filled by embedding step
metadata: {
source: filename,
category: category,
version: version,
chunkIndex: index,
totalChunks: rawChunks.length,
charCount: chunkText.length,
createdAt: createdAt || new Date().toISOString(),
// Helpful for RAG: prepend context to help the LLM
contextHeader: `[Source: ${filename}, Section ${index + 1}/${rawChunks.length}]`,
},
}));
}
// When retrieved, inject metadata as context for the LLM
function formatChunkForPrompt(chunk) {
return `${chunk.metadata.contextHeader}\n${chunk.text}`;
}
// In your RAG pipeline:
const relevantChunks = await vectorSearch(query, topK);
const context = relevantChunks
.map(formatChunkForPrompt)
.join('\n\n---\n\n');
const prompt = `Based on the following context, answer the question.
Context:
${context}
Question: ${query}`;
Metadata-filtered search
// Search with metadata filters (pre-filter before similarity)
async function filteredSearch(query, filters = {}, topK = 5) {
const queryEmb = await getEmbedding(query);
let candidates = allDocuments;
// Apply metadata filters BEFORE similarity search
if (filters.category) {
candidates = candidates.filter(d => d.metadata.category === filters.category);
}
if (filters.source) {
candidates = candidates.filter(d => d.metadata.source === filters.source);
}
if (filters.after) {
candidates = candidates.filter(d => d.metadata.createdAt > filters.after);
}
// Now do similarity search on the filtered set
const results = candidates
.map(doc => ({
...doc,
similarity: cosineSimilarity(queryEmb, doc.embedding),
}))
.sort((a, b) => b.similarity - a.similarity)
.slice(0, topK);
return results;
}
// Usage: Only search admin docs from 2025
const results = await filteredSearch('reset password', {
category: 'admin',
after: '2025-01-01',
}, 3);
11. Complete Chunking Pipeline
Here's a full pipeline that puts it all together:
import OpenAI from 'openai';
const openai = new OpenAI();
// ── Step 1: Load and clean the document ──────────────────────────
function cleanText(text) {
return text
.replace(/\r\n/g, '\n') // Normalize line endings
.replace(/\t/g, ' ') // Tabs to spaces
.replace(/ +/g, ' ') // Collapse multiple spaces
.replace(/\n{3,}/g, '\n\n') // Max 2 consecutive newlines
.trim();
}
// ── Step 2: Chunk the document ───────────────────────────────────
function chunkDocument(text, strategy = 'recursive', options = {}) {
const cleaned = cleanText(text);
switch (strategy) {
case 'fixed':
return fixedSizeChunk(cleaned, options.chunkSize || 1000, options.overlap || 200);
case 'sentence':
return sentenceChunk(cleaned, options.sentencesPerChunk || 5, options.overlapSentences || 1);
case 'paragraph':
return paragraphChunk(cleaned, options.maxChunkSize || 1500, options.overlap || 0);
case 'recursive':
default:
return recursiveCharacterSplit(cleaned, {
chunkSize: options.chunkSize || 800,
chunkOverlap: options.overlap || 120,
}).map(text => ({ text }));
}
}
// ── Step 3: Embed all chunks ─────────────────────────────────────
async function embedChunks(chunks, batchSize = 100) {
const results = [];
for (let i = 0; i < chunks.length; i += batchSize) {
const batch = chunks.slice(i, i + batchSize);
const texts = batch.map(c => c.text);
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: texts,
});
for (let j = 0; j < batch.length; j++) {
results.push({
...batch[j],
embedding: response.data[j].embedding,
});
}
}
return results;
}
// ── Step 4: Full pipeline ────────────────────────────────────────
async function processDocument(document) {
console.log(`Processing: ${document.filename}`);
// Chunk
const chunks = chunkDocument(document.text, 'recursive', {
chunkSize: 600,
overlap: 90,
});
console.log(` Created ${chunks.length} chunks`);
// Add metadata
const chunksWithMeta = chunks.map((chunk, i) => ({
...chunk,
metadata: {
source: document.filename,
chunkIndex: i,
totalChunks: chunks.length,
processedAt: new Date().toISOString(),
},
}));
// Embed
const embedded = await embedChunks(chunksWithMeta);
console.log(` Embedded ${embedded.length} chunks (${embedded[0].embedding.length} dims)`);
return embedded;
}
// Usage
const document = {
filename: 'react-docs.md',
text: '... (your document content) ...',
};
const processed = await processDocument(document);
// Now store `processed` in your vector database
12. Choosing the Right Strategy
Decision Flowchart:
What kind of content do you have?
│
├── Well-structured (headers, paragraphs, Markdown)
│ └── Use RECURSIVE CHARACTER SPLITTING
│ (chunkSize: 400-800, overlap: 10-15%)
│
├── Q&A pairs or FAQ
│ └── Use PARAGRAPH-BASED (each Q&A = one chunk)
│ No overlap needed
│
├── Long-form prose (articles, reports)
│ └── Use SENTENCE-BASED or RECURSIVE
│ (chunkSize: 300-500, overlap: 50-100)
│
├── Mixed content (code + prose)
│ └── Use RECURSIVE with code-aware separators
│ Add "\n```" and "\nfunction " to separator list
│
├── Critical accuracy needed (medical, legal)
│ └── Use SEMANTIC CHUNKING
│ Worth the extra cost for quality
│
└── Prototyping / just getting started
└── Use FIXED-SIZE (chunkSize: 500, overlap: 100)
Upgrade later when you have eval data
13. Key Takeaways
- Always chunk documents before embedding — one vector per chunk, not one per document.
- Recursive character splitting is the industry default — it respects natural boundaries and handles mixed content.
- Chunk size sweet spot is 200-500 tokens for most applications — too small loses context, too large dilutes relevance.
- 10-15% overlap between chunks preserves context at boundaries without excessive duplication.
- Attach metadata to every chunk — source, page, section, timestamp. You'll need it for filtering and citation.
- Semantic chunking produces the best results but costs more — use it when accuracy justifies the cost.
- Always evaluate your chunking strategy against real queries — the right chunk size depends on your specific data and use case.
Explain-It Challenge
- A user says "I embedded my entire 200-page manual as one vector and search results are terrible." Explain why, and describe how chunking fixes it.
- You're building a RAG system for legal contracts. Each contract is 50 pages. What chunking strategy and chunk size would you choose, and why?
- Two adjacent chunks cover related information. A user's query needs context from both chunks. How does overlap help, and what happens without it?
Navigation: ← 4.11.b — Similarity Search · 4.11 Overview