Episode 4 — Generative AI Engineering / 4.12 — Integrating Vector Databases

4.12 — Exercise Questions: Integrating Vector Databases

Practice questions for all three subtopics in Section 4.12. Mix of conceptual, calculation, code, and design tasks.

How to use this material (instructions)

Read lessons in order — README.md, then 4.12.a -> 4.12.c.
Answer closed-book first — then compare to the matching lesson.
Build something — set up Chroma locally and experiment hands-on.
Interview prep — 4.12-Interview-Questions.md.
Quick review — 4.12-Quick-Revision.md.

4.12.a — Storing Embeddings (Q1-Q12)

Q1. Define what a vector database is in two sentences. How does it differ from a traditional relational database like PostgreSQL?

Q2. Every vector record has three parts. Name them and explain the purpose of each.

Q3. Calculation: You have 2 million vectors, each with 1536 dimensions. A brute-force cosine similarity search requires approximately how many floating-point operations per query? Why does this make brute-force impractical at scale?

Q4. Name five popular vector databases and describe one key differentiator for each.

Q5. When is pgvector the right choice over a purpose-built vector database like Pinecone? Name three conditions.

Q6. Explain how HNSW (Hierarchical Navigable Small World) works at a high level. Why is it O(log n) instead of O(n)?

Q7. Explain how IVF (Inverted File Index) works at a high level. What is the nprobe parameter and how does it affect recall?

Q8. Compare HNSW and IVF across these dimensions: search speed, memory usage, dynamic insert support, and typical use cases.

Q9. What is the difference between a collection and a namespace? When would you use each?

Q10. Hands-on: Write JavaScript code to create a Pinecone index with 1536 dimensions using cosine similarity, then upsert three documents with metadata fields for source, category, and date.

Q11. Why must all vectors in a collection have the same number of dimensions? What happens if you try to upsert a 3072-dimension vector into a 1536-dimension index?

Q12. You need to re-embed 500,000 documents with a new embedding model that produces 1024-dimension vectors (previously 1536). Describe the migration steps required to avoid downtime.

4.12.b — Querying Similar Vectors (Q13-Q24)

Q13. Describe the complete query flow for a vector database search in 5 steps.

Q14. Why must you use the same embedding model for queries as for stored documents? What happens if you mix models?

Q15. The top-k parameter controls how many results are returned. What top-k value would you use for each of these: (a) FAQ chatbot, (b) research assistant, (c) duplicate detection, (d) retrieve-then-rerank pipeline?

Q16. Explain the difference between cosine similarity and cosine distance. Pinecone returns a score of 0.85 — is this similarity or distance? If Chroma returns a distance of 0.15, what is the corresponding similarity?

Q17. Your vector search returns these scores for the top 5 results: 0.94, 0.91, 0.72, 0.68, 0.45. You are building a Q&A chatbot. Which results would you pass to the LLM as context, and why?

Q18. What is a similarity threshold and why is it important for RAG pipelines? What happens if you don't use one?

Q19. Code task: Write a JavaScript function that queries Pinecone, filters results by a score threshold, and returns { results: [], message: "..." } — including a helpful message when no results meet the threshold.

Q20. Name five factors that affect vector search query latency, and describe one mitigation strategy for each.

Q21. Explain three approaches to pagination in vector databases. Which approach does Pinecone use, and why is pagination rarely needed for RAG pipelines?

Q22. Your RAG chatbot sometimes returns irrelevant answers. Describe a systematic debugging approach for vector search results, including what metrics to examine.

Q23. Code task: Write an end-to-end RAG query function in JavaScript that: (a) embeds the user query, (b) searches Pinecone with top-k=5, (c) filters by score threshold 0.75, (d) formats context for the LLM, and (e) calls the OpenAI chat API with the context.

Q24. A user queries "What is the refund policy?" and the top result has a cosine similarity of 0.38. What does this score tell you, and what should your application do?

4.12.c — Metadata Filters (Q25-Q37)

Q25. Define metadata in the context of vector databases. Give five examples of metadata fields you would store for a help-center knowledge base.

Q26. Explain the difference between pre-filtering (what vector DBs actually do) and post-filtering (naive approach). Why does post-filtering produce worse results?

Q27. Write a Pinecone query that finds the top 5 vectors similar to a query vector, but only from documents that are: (a) in the "billing" category, (b) written in English, (c) published after January 1, 2026, and (d) NOT from the "deprecated" source.

Q28. Translate the filter from Q27 into Qdrant filter syntax (using must, should, must_not).

Q29. Translate the filter from Q27 into Chroma filter syntax (using where).

Q30. You are building a multi-tenant SaaS application. Explain why tenant_id filtering is a critical security concern, not just a feature. What happens if you forget to include the tenant filter in a query?

Q31. Schema design: Design a metadata schema for an e-commerce product search system. The system needs to filter by: product category, price range, brand, availability (in stock / out of stock), language, and date added. Write the metadata object in JavaScript.

Q32. Name four common mistakes developers make with metadata in vector databases and how to avoid each one.

Q33. How does the selectivity of a metadata filter affect query performance? What should you do if a filter matches less than 1% of vectors?

Q34. Code task: Write a buildFilter() function in JavaScript that takes user-friendly parameters (category, dateFrom, dateTo, language, isPublished) and returns a Pinecone-compatible filter object. Handle missing parameters gracefully.

Q35. Your metadata includes dates stored as "March 15, 2026" in some records and "2026-03-15" in others. A range filter date: { $gte: "2026-01-01" } works for some records but not others. Explain why and how to fix it.

Q36. Pinecone limits metadata to 40KB per vector. Your documents average 5,000 characters. How do you handle this? Describe the pattern for storing large text content alongside vector records.

Q37. Design challenge: You are building a company-wide knowledge search tool. Different departments (engineering, sales, HR, legal) have their own documents, and some documents are confidential. Design the complete metadata schema, explain your access control strategy, and discuss whether to use collections, namespaces, or metadata filters for department isolation.

Answer Hints

Q	Hint
Q3	2,000,000 x 1,536 x 2 = ~6.1 billion float operations per query
Q6	Multi-layer graph — search starts at the top (few nodes, long jumps), drops down layers (more nodes, shorter jumps) to find approximate nearest neighbors
Q8	HNSW: faster search, more memory, good inserts. IVF: less memory, faster build, poor dynamic inserts
Q11	Distance calculations require vectors of equal length — mismatched dimensions make the math undefined
Q12	Create new collection (1024 dims) -> re-embed all docs -> upsert to new collection -> switch application to new collection -> delete old collection
Q16	Pinecone returns similarity (higher = better). Chroma distance 0.15 = similarity 0.85
Q17	Pass 0.94 and 0.91 (above ~0.75 threshold). The 0.72 is borderline. 0.68 and 0.45 are likely noise.
Q20	Vector count, dimensions, top-k size, metadata filter complexity, network latency
Q24	Score 0.38 means the query is out of domain — no relevant content exists. Return "I don't know" rather than hallucinate from irrelevant context.
Q26	Post-filtering: retrieve top-100 by similarity, then filter. If only 2 of 100 match, you get 2 results instead of the 5 you wanted. Pre-filtering narrows candidates first.
Q30	Without tenant filter, Customer A's query could return Customer B's confidential data — this is a data breach, not just a UX problem.
Q33	Highly selective filters (<1%) can degrade ANN search to near brute-force. Use namespaces or separate collections instead.
Q35	"March 15, 2026" sorts alphabetically (starts with "M"), not chronologically. Only ISO 8601 ("YYYY-MM-DD") strings sort correctly for range queries.
Q36	Store truncated preview (300-500 chars) in metadata + full text in PostgreSQL or other store. Join on doc_id at query time.

<- Back to 4.12 — Integrating Vector Databases (README)