Episode 4 — Generative AI Engineering / 4.11 — Understanding Embeddings

4.11 — Exercise Questions: Understanding Embeddings

Practice questions for all three subtopics in Section 4.11. Mix of conceptual, calculation, code, and hands-on tasks.

How to use this material (instructions)

Read lessons in order — README.md, then 4.11.a → 4.11.c.
Answer closed-book first — then compare to the matching lesson.
Build the code examples — run them to verify your understanding.
Interview prep — 4.11-Interview-Questions.md.
Quick review — 4.11-Quick-Revision.md.

4.11.a — What Embeddings Represent (Q1–Q12)

Q1. Define what an embedding is in one sentence. How is it different from a token ID?

Q2. What does it mean when we say two vectors are "close together" in embedding space? Give a concrete pair of sentences that would be close and a pair that would be far apart.

Q3. OpenAI's text-embedding-3-small produces 1536-dimensional vectors. What does each dimension represent? Can you label dimension #742?

Q4. Explain the difference between an embedding model and a generation model (LLM). What does each take as input and produce as output?

Q5. Why do the sentences "Dog bites man" and "Man bites dog" produce different embeddings, even though they contain the same words? What does this tell you about modern embedding models?

Q6. Calculation: If each dimension is stored as a 32-bit float (4 bytes), how much storage does a single text-embedding-3-large vector (3072 dimensions) require? How much for 5 million vectors?

Q7. You have a database of documents embedded with text-embedding-ada-002. You want to switch to text-embedding-3-small for better quality. Can you embed only new documents with the new model while keeping old embeddings? Why or why not?

Q8. Code: Write a JavaScript function using the OpenAI API that takes an array of strings and returns their embeddings. Include error handling for empty strings and rate limits.

Q9. Explain Matryoshka embeddings (dimension reduction). If text-embedding-3-small supports reducing from 1536 to 256 dimensions, what trade-off are you making? When would you use reduced dimensions?

Q10. Why does embedding the text "Click here for more info" produce a less useful vector than embedding "React useState hook manages component state in functional components"? What can you do to improve the quality of low-information text before embedding?

Q11. The classic embedding arithmetic example is: vector("king") - vector("man") + vector("woman") ≈ vector("queen"). What relationship has the model learned? Give another example of embedding arithmetic that should work.

Q12. Hands-on: Embed these 5 sentences using the OpenAI API and print the cosine similarity between each pair. Which pair is most similar? Which is least similar?

"JavaScript is a programming language"
"Python is used for data science"
"I love chocolate cake"
"TypeScript adds types to JS"
"The weather is sunny today"

4.11.b — Similarity Search (Q13–Q24)

Q13. Explain cosine similarity in plain English, without math. Why is it the most common metric for embedding search?

Q14. Calculation: Given two 3-dimensional vectors A = [1, 0, 0] and B = [0, 1, 0], calculate: (a) cosine similarity, (b) Euclidean distance, (c) dot product. Are these vectors similar?

Q15. OpenAI embeddings are normalized (length = 1). Why does this mean dot product equals cosine similarity? What computational advantage does this give you?

Q16. Compare semantic search and keyword search with a concrete example. Give a query where keyword search fails but semantic search succeeds.

Q17. You build a semantic search engine and a user queries "automobile maintenance." Your database contains a document about "car repair tips" but the word "automobile" never appears in it. Will keyword search find it? Will semantic search find it? Why?

Q18. What is a similarity threshold and why do you need one? If you set it too high (e.g., 0.95), what happens? If you set it too low (e.g., 0.3)?

Q19. Code: Implement a cosineSimilarity(vecA, vecB) function in JavaScript from scratch (no libraries). Test it with: [1, 0, 0] vs [1, 0, 0] (should be 1.0) and [1, 0, 0] vs [0, 1, 0] (should be 0.0).

Q20. Explain the difference between brute-force search and approximate nearest neighbor (ANN) search. At what dataset size should you switch from brute-force to ANN?

Q21. What is hybrid search (combining semantic + keyword)? Why do most production systems use it instead of pure semantic search?

Q22. You search for "React hooks" and get these results with similarity scores: (a) "Vue composition API" = 0.82, (b) "React component lifecycle" = 0.79, (c) "Baking sourdough" = 0.15. With a threshold of 0.7, which results are returned?

Q23. Why is it dangerous to mix embeddings from different models in the same vector database? What would happen if you embedded half your documents with text-embedding-3-small and half with text-embedding-ada-002?

Q24. Design: You're building a customer support chatbot that searches a knowledge base of 500 FAQ articles. Design the similarity search component: which metric, what threshold, how many results to return, and how to handle "no good results found."

4.11.c — Document Chunking Strategies (Q25–Q37)

Q25. Explain in 2-3 sentences why you should not embed an entire 50-page document as a single vector.

Q26. Name five chunking strategies in order from simplest to most sophisticated. Give one advantage of each.

Q27. What is the difference between a 200-token chunk and a 2000-token chunk in terms of: (a) embedding precision, (b) context richness, (c) storage cost, (d) RAG prompt usage?

Q28. Calculation: You have a 100-page document (~75,000 tokens). You chunk it into 400-token chunks with 50-token overlap. Approximately how many chunks will you produce?

Q29. Explain chunk overlap with a concrete example. What problem does it solve? What problem does too much overlap create?

Q30. Code: Write a fixedSizeChunk(text, chunkSize, overlap) function that splits text into chunks of chunkSize characters with overlap characters of overlap. Test it on a 1000-character string with chunk_size=300 and overlap=50.

Q31. Describe recursive character splitting step by step. What separators does it try, and in what order? Why is this better than fixed-size chunking?

Q32. What is semantic chunking and how does it decide where to split? Why is it more expensive than other methods?

Q33. You're chunking a technical manual that has code blocks mixed with explanatory paragraphs. Which chunking strategy do you choose and why? How do you handle the code blocks?

Q34. What metadata should you attach to each chunk? Name at least 6 fields and explain why each is useful.

Q35. Your RAG system returns the right chunk but the user asks "where did this information come from?" How does chunk metadata solve this? Write the code that formats a citation from chunk metadata.

Q36. Design: You have three types of documents: (a) FAQ pages (short Q&A pairs), (b) legal contracts (dense, 50-page PDFs), (c) developer blog posts (mixed code + prose). Design a chunking strategy for each, specifying method, chunk size, and overlap.

Q37. Hands-on experiment: Take a 2-page article and chunk it three ways: (a) fixed-size 200 characters, (b) sentence-based 3 sentences per chunk, (c) paragraph-based. Embed all chunks and search for a specific fact. Which strategy returns the most relevant chunk? Measure the similarity scores.

Answer Hints

Q	Hint
Q3	Dimensions are learned features — they have no human-interpretable labels. You cannot label dimension #742.
Q6	3072 dims x 4 bytes = 12,288 bytes ≈ 12 KB per vector. 5M vectors = ~60 GB.
Q7	No — different models produce incompatible vector spaces. You must re-embed everything with the new model.
Q9	Trade-off: ~8% quality loss at 256 dims vs 1536. Use when storage/speed constraints outweigh quality needs.
Q14	(a) cos sim = 0 (perpendicular), (b) Euclidean = sqrt(2) ≈ 1.414, (c) dot product = 0. Not similar at all.
Q15	For normalized vectors: `
Q18	Too high: miss relevant results (high precision, low recall). Too low: return noise (low precision, high recall).
Q22	Results (a) and (b) are returned (both >= 0.7). Result (c) is filtered out (0.15 < 0.7).
Q28	Effective step = 400 - 50 = 350 tokens. 75,000 / 350 ≈ 215 chunks.
Q30	With 1000 chars, step = 300-50 = 250. Chunks start at: 0, 250, 500, 750. So 4 chunks.
Q36	FAQ: paragraph-based, 100-200 tokens, no overlap. Legal: recursive, 400-600 tokens, 100 overlap. Blog: recursive with code separators, 300-500 tokens, 50 overlap.

← Back to 4.11 — Understanding Embeddings (README)