Episode 4 — Generative AI Engineering / 4.11 — Understanding Embeddings

Interview Questions: Understanding Embeddings

Model answers for embeddings, similarity search, and document chunking — the foundational concepts behind RAG pipelines and semantic search.

How to use this material (instructions)

Read lessons in order — README.md, then 4.11.a → 4.11.c.
Practice out loud — definition → example → pitfall.
Pair with exercises — 4.11-Exercise-Questions.md.
Quick review — 4.11-Quick-Revision.md.

Beginner (Q1–Q4)

Q1. What is an embedding and why do we need them?

Why interviewers ask: Tests whether you understand the bridge between text and mathematical operations — the foundation of all vector search and RAG systems.

Model answer:

An embedding is a fixed-length array of floating-point numbers (a vector) that represents the semantic meaning of text. When you pass a sentence to an embedding model like OpenAI's text-embedding-3-small, it returns a vector of 1536 numbers. These numbers are coordinates in a high-dimensional space where similar meanings land close together.

We need embeddings because computers can't directly compare the "meaning" of two sentences. With embeddings, comparing meaning becomes a math problem: calculate the cosine similarity between two vectors. A similarity of 0.92 means the texts are about the same thing; 0.15 means they're unrelated. This enables semantic search (finding documents by meaning, not keywords), RAG pipelines (retrieving relevant context for LLMs), duplicate detection, and recommendation systems.

The key distinction: embedding models convert text to numbers, while LLMs (generation models) convert text to more text. Embeddings are ~100x cheaper than LLM calls and orders of magnitude faster.

Q2. What is cosine similarity and why is it the standard for embedding search?

Why interviewers ask: Every production vector search system uses cosine similarity — understanding why tests your grasp of the underlying math.

Model answer:

Cosine similarity measures the angle between two vectors, returning a value between -1 and 1. It answers: "are these two vectors pointing in the same direction?" — regardless of their length.

It's the standard for embedding search for three reasons:

1. Scale-invariant: A short tweet and a long article about the same topic produce vectors of the same normalized length (OpenAI embeddings are unit-normalized), so cosine similarity only compares the direction (meaning), not the magnitude. This is exactly what we want.

2. Bounded range: The -1 to 1 range makes it easy to set thresholds. In practice, most embeddings fall between 0 and 1, with 0.8+ indicating strong similarity.

3. Computational efficiency: For normalized vectors (length = 1), cosine similarity simplifies to the dot product — just multiply corresponding dimensions and sum. This is the fastest possible comparison, which matters when searching millions of vectors.

The formula: cos(A, B) = (A · B) / (|A| × |B|). For normalized vectors: cos(A, B) = A · B.

Q3. Why do you need to chunk documents before embedding them?

Why interviewers ask: Chunking is arguably the most impactful design decision in a RAG pipeline — getting it wrong cripples retrieval quality.

Model answer:

Chunking is necessary for three practical reasons:

1. Token limits: Embedding models have input limits (8191 tokens for OpenAI). A 50-page document exceeds this — you must split it.

2. Precision: Embedding an entire document produces a single vector that represents the average meaning of everything in it. If page 27 discusses password resets and the other 49 pages discuss unrelated topics, the vector is "about everything" and matches nothing precisely. Chunking into smaller pieces (200-500 tokens) produces focused vectors that match specific queries accurately.

3. RAG efficiency: In RAG, retrieved chunks are injected into the LLM's prompt. The LLM's context window is limited. Five focused 200-word chunks give the LLM relevant, targeted information. One 5000-word document wastes context on irrelevant content.

The standard approach is recursive character splitting (used by LangChain) — it tries paragraph breaks first, then line breaks, then sentences, then words, ensuring the best natural boundary for each split. Typical chunk size is 200-500 tokens with 10-15% overlap to preserve context across boundaries.

Q4. Explain the difference between semantic search and keyword search.

Why interviewers ask: This is the "why embeddings matter" question — it tests whether you understand the practical value proposition.

Model answer:

Keyword search matches exact words. The query "how to fix a bug" searches for documents containing the words "fix," "bug," etc. If a document says "debugging techniques for software errors," keyword search misses it entirely — despite being exactly what the user wants — because no words overlap.

Semantic search matches meaning. Both the query and documents are converted to embedding vectors. The query "how to fix a bug" and the document "debugging techniques for software errors" produce similar vectors (cosine similarity ~0.85) because the embedding model learned that "fix a bug" and "debugging software errors" mean the same thing.

In practice, most production systems use hybrid search — combining both approaches. Keyword search provides precision on exact terms (product names, error codes, IDs), while semantic search provides recall on natural language queries. A typical weighting is 70% semantic + 30% keyword, though this varies by domain.

Semantic search requires more infrastructure (embedding model, vector database) and has ongoing costs (API calls for embedding), but it dramatically improves search quality for natural language queries.

Intermediate (Q5–Q8)

Q5. How do you choose the right embedding model and dimensionality?

Why interviewers ask: Shows you can make practical engineering trade-offs — not just "use the biggest model."

Model answer:

The decision involves four trade-offs: quality, cost, speed, and storage.

Model selection:

text-embedding-3-small (1536 dims, $0.02/1M tokens): Best default for most applications. Good quality, very cheap, fast.
text-embedding-3-large (3072 dims, $0.13/1M tokens): Use when accuracy is critical and budget allows — legal, medical, high-stakes search.
Open-source models (e.g., sentence-transformers): Use when you need to run locally (privacy requirements, no API calls, offline systems).

Dimensionality: OpenAI's v3 models support Matryoshka embeddings — you can request fewer dimensions. Going from 1536 to 512 dimensions saves 67% storage with ~4% quality loss. Going to 256 saves 83% with ~8% quality loss. The sweet spot for storage-constrained systems is usually 512 dimensions.

Decision framework: Start with text-embedding-3-small at full 1536 dimensions. Build an evaluation set (50+ queries with known relevant documents). Measure retrieval quality. Only switch models or reduce dimensions if you have data showing the current setup is insufficient or if storage/cost constraints force it.

Critical rule: Once you pick a model, you're locked in. All documents and all queries must use the same model. Switching means re-embedding everything.

Q6. Walk me through designing a chunking strategy for a production RAG system.

Why interviewers ask: Tests your ability to make end-to-end design decisions with real-world constraints.

Model answer:

I follow a five-step process:

1. Analyze the content: What are the documents? Well-structured docs with headers (use recursive splitting). Q&A pairs (use paragraph-based). Dense legal text (use larger chunks with more overlap). Mixed code + prose (add code-aware separators).

2. Choose chunk size: Start with 400 tokens as a baseline. The trade-off is precision vs context. Smaller chunks (200) give more precise matches but may lack context. Larger chunks (800+) provide more context but dilute the vector's specificity. The right size depends on query patterns — if users ask broad questions, go larger; if they ask specific questions, go smaller.

3. Set overlap: Start with 10-15% of chunk size (e.g., 60 tokens overlap for 400-token chunks). Overlap prevents information loss at chunk boundaries. Too much overlap (>25%) wastes storage and can cause duplicate results.

4. Attach metadata: Every chunk gets: source document, page/section, chunk index, timestamp, category, and any domain-specific fields. Metadata enables filtered search (e.g., "only search product docs") and citations in LLM responses.

5. Evaluate and iterate: Build a test set of 50+ queries with known relevant documents. Measure recall@5 (is the answer in the top 5 chunks?). Adjust chunk size, overlap, and strategy based on results. I've seen chunk size changes improve recall from 60% to 90%.

Q7. How do similarity thresholds work and how do you calibrate them?

Why interviewers ask: Practical knowledge — setting the wrong threshold either floods users with irrelevant results or returns nothing.

Model answer:

A similarity threshold is the minimum cosine similarity score a result must have to be returned. It acts as a quality gate.

The trade-off: High threshold (0.9+) = high precision, low recall — you only return very relevant results but miss some good ones. Low threshold (0.5) = low precision, high recall — you return everything remotely related but include noise.

Starting points by use case:

Duplicate detection: 0.92+ (near-identical content)
Factual Q&A / RAG: 0.78-0.85 (need relevant, accurate results)
Exploratory search: 0.60-0.70 (cast a wide net)

Calibration process:

Create a labeled evaluation set: 50+ queries, each with documents manually marked as relevant or irrelevant.
Embed all queries and documents. Calculate similarity scores.
Find the threshold that maximizes separation between relevant and irrelevant scores.
If average relevant similarity is 0.82 and average irrelevant is 0.45, a threshold of ~0.65 cleanly separates them.
Monitor in production — sample results and track user feedback. If users report irrelevant results, raise the threshold. If they report "no results found," lower it.

Important: Thresholds vary by embedding model, domain, and document type. A threshold calibrated for one model is meaningless for another.

Q8. Explain approximate nearest neighbor (ANN) search and when you need it.

Why interviewers ask: Tests understanding of scalability — the difference between a prototype and a production system.

Model answer:

Brute-force search compares the query vector to every document vector. For 1,000 documents, this takes milliseconds — no problem. For 10 million documents with 1536 dimensions, it takes seconds — unacceptable for real-time search.

ANN algorithms build an index structure that enables finding the approximately nearest vectors without comparing to all of them. The key trade-off: they might miss the true #1 closest vector, but they'll find it in the top 5 with >95% probability, in milliseconds instead of seconds.

HNSW (Hierarchical Navigable Small World) is the most popular algorithm, used by Pinecone, Qdrant, Weaviate, and pgvector. It builds a multi-layered graph where each vector is connected to its nearest neighbors. Searching navigates this graph from coarse to fine resolution.

When to switch from brute-force to ANN:

Under 50K documents: brute-force is fine (< 50ms)
50K-500K: consider ANN if latency matters
Over 500K: ANN is mandatory for real-time applications

The practical answer: Use a vector database (Pinecone, Qdrant, Weaviate, pgvector). They handle ANN indexing, scaling, and optimization for you. Don't implement ANN from scratch unless you have a very specific reason.

Advanced (Q9–Q11)

Q9. Design a complete embedding pipeline for a company's knowledge base with 100,000 documents.

Why interviewers ask: Tests system design thinking end-to-end — not just individual concepts but how they fit together.

Model answer:

Architecture:

  Document Sources → Ingestion Pipeline → Vector Database → Search API → RAG
  (Confluence, PDFs,    (chunk, embed,      (Pinecone/      (query,      (inject
   Notion, Slack)        store)              pgvector)        filter)      into LLM)

Ingestion pipeline:

Document loader: Extract text from PDFs, HTML, Markdown, etc. Handle encoding, tables, images (OCR if needed).
Cleaning: Remove boilerplate (headers, footers, navigation), normalize whitespace, handle special characters.
Chunking: Recursive character splitting, 400-token chunks, 60-token overlap. Add metadata: source URL, title, section, last-modified date, department.
Embedding: Batch embed with text-embedding-3-small. 100K docs x 10 chunks avg = 1M chunks. At $0.02/1M tokens and ~200 tokens/chunk: $4 total. Process in batches of 100 with rate-limit handling.
Storage: Upsert into vector database with vectors + metadata.

Search API:

Accept natural language query + optional metadata filters (department, date range, source)
Embed query with same model
Search with top-k=10, similarity threshold=0.75
Return chunks with metadata for citation

Operational concerns:

Re-indexing: When documents change, re-chunk and re-embed only changed documents. Track document versions.
Monitoring: Log query latency, similarity score distribution, "no results" rate. Alert if average similarity drops.
Model migration: If you change embedding models, you must re-embed everything. Plan for this — it takes ~30 minutes for 1M chunks.

Q10. How do you evaluate the quality of an embedding-based retrieval system?

Why interviewers ask: The hardest part of building RAG isn't building it — it's knowing if it's working well.

Model answer:

I measure retrieval quality with a three-layer evaluation framework:

Layer 1: Retrieval metrics (does the right chunk get retrieved?)

Recall@k: What percentage of test queries have the correct document in the top-k results? Target: >90% at k=5.
MRR (Mean Reciprocal Rank): Average of 1/rank-of-first-relevant-result. MRR=1.0 means the correct result is always #1.
nDCG (Normalized Discounted Cumulative Gain): Measures whether higher-ranked results are more relevant than lower-ranked ones.

Layer 2: End-to-end RAG quality (does the LLM produce correct answers?)

Faithfulness: Does the LLM's answer actually use the retrieved chunks? Or does it hallucinate?
Answer relevance: Does the answer address the user's question?
Context relevance: Are the retrieved chunks relevant to the question? Tools: RAGAS, DeepEval, or custom eval scripts.

Layer 3: Production monitoring

Similarity score distribution: Track the average top-1 similarity over time. A drop indicates content drift or degraded embedding quality.
"No results" rate: How often does the system fail to find anything above threshold?
User feedback signals: Thumbs up/down, click-through on recommended answers.

Building the eval set: Create 50-100 (query, expected_document) pairs. Include edge cases: typos, synonyms, ambiguous queries, queries with no good answer. Run evals after every change to chunking strategy, embedding model, or search parameters.

Q11. Explain the trade-offs between different chunking strategies and when each is optimal.

Why interviewers ask: Tests depth of understanding and ability to make nuanced engineering decisions.

Model answer:

The five main strategies sit on a complexity-vs-quality spectrum:

Fixed-size chunking — Split every N characters. Dead simple, predictable sizes, but cuts mid-sentence and mid-word. Use for: prototyping, logs, or data where boundaries don't matter. Never use for production RAG on natural language.

Sentence-based — Split at sentence boundaries, group N sentences per chunk. Respects language structure but sentence length varies wildly. A 3-word sentence and a 50-word sentence are treated equally. Use for: well-written articles with consistent sentence structure.

Paragraph-based — Split at double-newlines. Great for structured documents where each paragraph is a self-contained topic. Fails on unstructured text or documents without clear paragraph breaks. Use for: FAQs, structured Markdown docs.

Recursive character splitting — Try the best separator available (paragraphs > lines > sentences > words). The industry default because it handles mixed content gracefully. Use for: most production systems, especially mixed code/prose.

Semantic chunking — Embed each sentence, split where meaning changes. Produces the most coherent chunks but is expensive (requires an embedding call per sentence) and slower. Use for: high-stakes domains (medical, legal) where chunk quality directly impacts outcomes and the cost is justified.

The meta-principle: The "best" strategy depends entirely on your data and query patterns. I always start with recursive splitting, build an eval set, measure recall, then consider upgrading to semantic chunking only if recall is below target. Most teams over-invest in sophisticated chunking when simple recursive splitting + good chunk size tuning gets them 90% of the way.

Quick-fire

#	Question	One-line answer
1	What dimensions does text-embedding-3-small produce?	1536 floating-point numbers
2	Cosine similarity of two identical vectors?	1.0
3	Can you mix embeddings from different models?	No — incompatible vector spaces, results are meaningless
4	Ideal chunk size for most RAG systems?	200-500 tokens (start at 400)
5	Cost to embed 1M tokens with text-embedding-3-small?	~$0.02
6	Embedding model token limit (OpenAI)?	8,191 tokens per input
7	Overlap between chunks — typical percentage?	10-15% of chunk size
8	Cosine similarity vs dot product for normalized vectors?	Identical — use dot product (faster)
9	What is semantic chunking?	Split where meaning changes, detected by embedding each sentence
10	When to use ANN over brute-force?	When you have >50K documents and need real-time search

← Back to 4.11 — Understanding Embeddings (README)