Episode 4 — Generative AI Engineering / 4.12 — Integrating Vector Databases

Interview Questions: Integrating Vector Databases

Model answers for storing embeddings, querying similar vectors, and metadata filtering in vector databases.

How to use this material (instructions)

  1. Read lessons in orderREADME.md, then 4.12.a -> 4.12.c.
  2. Practice out loud — definition -> example -> pitfall.
  3. Pair with exercises4.12-Exercise-Questions.md.
  4. Quick review4.12-Quick-Revision.md.

Beginner (Q1-Q4)

Q1. What is a vector database and why do you need one?

Why interviewers ask: Tests whether you understand why embeddings require specialized storage and search infrastructure beyond traditional databases.

Model answer:

A vector database is a database designed specifically to store, index, and query high-dimensional vectors (embeddings). While a traditional database like PostgreSQL excels at exact-match queries on structured data (WHERE category = 'billing'), it cannot efficiently find which of your million stored vectors is most similar to a query vector.

The core problem is that similarity search over high-dimensional vectors requires computing distances between the query and every stored vector — a brute-force approach that scales linearly (O(n)) and becomes prohibitively slow beyond tens of thousands of vectors. Vector databases solve this using Approximate Nearest Neighbor (ANN) indexes like HNSW that reduce search time to O(log n) while maintaining 95-99%+ recall accuracy.

A vector database stores three things per record: a unique ID, the embedding vector (array of floats), and metadata (structured JSON with source, category, date, etc.). This combination enables you to find semantically similar content AND filter by structured constraints — the foundation of every RAG pipeline.


Q2. Explain the query flow when searching a vector database.

Why interviewers ask: Tests end-to-end understanding of how vector search works in practice, not just theory.

Model answer:

The query flow has five steps:

Step 1: Embed the query. Convert the user's natural language question into a vector using the same embedding model that was used to embed the stored documents. This is critical — different models produce vectors in different "spaces," so mixing models produces meaningless results.

Step 2: Send the query vector to the database with parameters: the vector, topK (how many results to return), whether to include metadata, and optionally metadata filters.

Step 3: The ANN index finds nearest neighbors. The database uses its index (typically HNSW) to find the approximate k-nearest vectors without comparing against every stored vector. This takes milliseconds even over millions of vectors.

Step 4: Return ranked results with similarity scores and metadata. The scores indicate how semantically similar each result is to the query.

Step 5: Apply application-level logic — filter by a score threshold (reject low-relevance results), format the retrieved text as context, and pass it to an LLM to generate a grounded answer.


Q3. What is the difference between cosine similarity and cosine distance?

Why interviewers ask: Confusion between similarity and distance is a common source of bugs. Interviewers check that you can interpret scores correctly.

Model answer:

Cosine similarity measures how similar two vectors are by computing the cosine of the angle between them. It ranges from -1 to 1 (or 0 to 1 for normalized vectors), where 1 means identical direction and 0 means no relationship (orthogonal).

Cosine distance is simply 1 - cosine similarity. It ranges from 0 to 2, where 0 means identical and higher values mean less similar.

The practical distinction matters because different databases report scores differently. Pinecone and Qdrant return similarity (higher = better). Chroma and pgvector return distance (lower = better). If you set a threshold of 0.75 expecting similarity but the database returns distance, you'll filter out your best results and keep the worst ones.

The conversion is straightforward: similarity = 1 - distance and distance = 1 - similarity. A Pinecone score of 0.85 and a Chroma distance of 0.15 represent the same relationship.


Q4. What is metadata in a vector database and why does it matter?

Why interviewers ask: Metadata filtering is what makes vector search production-ready rather than just a demo. Interviewers want to see practical understanding.

Model answer:

Metadata is structured key-value data attached to each vector that describes facts about the source document — things like category: "billing", date: "2026-03-15", language: "en", tenant_id: "customer_123", and is_published: true.

Without metadata filtering, every query searches your entire vector collection. This is problematic because: (1) you might return deprecated documents alongside current ones, (2) multi-tenant systems could leak data between customers, (3) irrelevant content types pollute results, and (4) you cannot restrict search to recent documents.

Metadata filters are applied during the vector search, not after. The database simultaneously evaluates semantic similarity AND metadata constraints, returning results that are both relevant and match your structural requirements. This is what transforms a basic "find similar text" capability into a production search system with access control, time sensitivity, and topic scoping.

Best practices for metadata: use flat key-value structures (most databases don't support nested objects), normalize casing, store dates as ISO 8601 strings ("YYYY-MM-DD"), and respect size limits (Pinecone caps metadata at 40KB per vector).


Intermediate (Q5-Q8)

Q5. Compare HNSW and IVF indexing algorithms. When would you use each?

Why interviewers ask: Tests deeper understanding of how vector databases achieve fast search, and whether you can make informed infrastructure decisions.

Model answer:

HNSW (Hierarchical Navigable Small World) builds a multi-layer graph over the vectors. The top layer has few nodes with long-distance connections, and each lower layer adds more nodes with shorter connections. Search starts at the top and greedily navigates toward the query vector, dropping down layers for finer resolution. This gives O(log n) search time with 95-99%+ recall.

IVF (Inverted File Index) partitions the vector space into clusters using k-means. At query time, it finds the nearest cluster centroids and only searches vectors within those clusters. The nprobe parameter controls how many clusters are searched — higher nprobe gives better recall but slower queries.

HNSW advantages: Faster queries, higher recall, supports dynamic inserts (new vectors can be added to the graph without rebuilding). HNSW disadvantages: Higher memory usage (the graph structure lives in memory), slower initial build.

IVF advantages: Lower memory usage, faster to build the index, handles very large datasets well. IVF disadvantages: Poor support for dynamic inserts (adding vectors may require re-clustering), recall is more sensitive to parameter tuning.

Decision: HNSW for most applications, especially under 100M vectors. IVF for very large datasets (hundreds of millions to billions) or when memory is constrained. Most modern vector databases (Pinecone, Qdrant, Chroma) default to HNSW because it offers the best balance for typical workloads.


Q6. How do you design a metadata schema for a multi-tenant RAG application?

Why interviewers ask: Tests production engineering skills — data isolation, access control, and schema design are critical for real applications.

Model answer:

The metadata schema must serve three goals: tenant isolation (security), search relevance (quality), and query performance (speed).

Core schema:

{
  tenant_id: "customer_abc",        // MANDATORY — never omit
  text: "Preview text (300 chars)", // For display without separate lookup
  doc_id: "doc_001",               // Reference to full document
  category: "billing",             // Topic classification
  date: "2026-03-15",              // ISO 8601 for range queries
  source: "help-center",           // Origin tracking
  language: "en",                  // Language filtering
  is_published: true,              // Status flag
  access_level: "internal",        // Permission level
  version: 3,                      // Document version
}

Tenant isolation strategy: Every query MUST include tenant_id as a filter. This is a security requirement, not optional. A missing tenant filter means Customer A's query could return Customer B's confidential data.

For large tenants (millions of vectors each), consider using namespaces (one per tenant) instead of metadata filtering — this provides better performance and stronger isolation. For smaller tenants, metadata filtering on tenant_id is sufficient and simpler to manage.

Performance considerations: tenant_id filters are highly selective if you have many tenants. If a single tenant has <1% of total vectors, the filter may degrade ANN search performance. In that case, namespace-based isolation is better.


Q7. Your RAG chatbot returns irrelevant answers. Walk through your debugging process.

Why interviewers ask: Demonstrates systematic problem-solving skills specific to AI systems, which are harder to debug than traditional applications.

Model answer:

I'd debug systematically through each layer of the pipeline:

Layer 1: Check the vector search results. Query the database directly (before passing to the LLM) and examine what documents were retrieved. Look at the similarity scores.

  • If all scores are very low (<0.5), the query doesn't match any stored content. The knowledge base might be missing relevant documents, or the query is out of domain.
  • If scores are high but documents are wrong, the chunking strategy may be too coarse. A 2000-word chunk about "account settings" might score high for "password reset" because they're broadly related, but the specific password reset instructions are buried in the middle.
  • If scores are tightly clustered (e.g., all between 0.71 and 0.73), the embeddings lack diversity — the content may be too homogeneous, or the embedding model is weak for this domain.

Layer 2: Verify embedding model consistency. Ensure the query is embedded with the same model used for stored documents. A mismatch produces random-looking results.

Layer 3: Examine metadata filters. If filters are too restrictive, relevant documents get excluded. Try the query without filters to see if the right documents exist in the database at all.

Layer 4: Inspect the LLM prompt. Even with good retrieved documents, the LLM might ignore or misinterpret the context. Check that the system prompt clearly instructs the model to use ONLY the provided context.

Layer 5: Evaluate the score threshold. If the threshold is too low, irrelevant documents pollute the context. If too high, relevant documents get filtered out. Plot the score distribution across many queries to calibrate.


Q8. How do you handle the 40KB metadata size limit in Pinecone when documents are long?

Why interviewers ask: Tests practical production experience — this is a real constraint that every Pinecone user encounters.

Model answer:

The solution is to separate the vector store from the content store. Store a compact metadata record in Pinecone with essential filterable fields and a text preview. Store the full document text in a separate database.

Implementation pattern:

  1. In Pinecone metadata: Store truncated text (300-500 characters for display), the document ID (doc_id), and all filterable fields (category, date, source, etc.). This typically stays well under 40KB.

  2. In a separate store (PostgreSQL, DynamoDB, Redis, or even a simple key-value store): Store the full document text keyed by the same doc_id.

  3. At query time: Perform the vector search to get ranked document IDs, then fetch the full text from the content store using those IDs. This adds one additional database call but keeps your vector records compact and well within limits.

The text preview in metadata serves two purposes: it allows quick display of search results without the extra lookup (good for UI), and it serves as a fallback if the content store is temporarily unavailable.

For systems where the extra lookup adds unacceptable latency, consider using a vector database that doesn't have strict metadata limits (like Qdrant or pgvector) — but be aware that larger payloads increase memory usage and can slow queries.


Advanced (Q9-Q11)

Q9. Design a vector search system that handles 50 million documents with sub-100ms query latency.

Why interviewers ask: Tests system design thinking at scale — infrastructure choices, trade-offs, and performance optimization.

Model answer:

Embedding model: text-embedding-3-small (1536 dimensions) — good balance of quality and cost. At 50M documents, embedding cost is significant: 50M docs x ~500 tokens avg x $0.02/1M tokens = ~$500 one-time cost.

Vector database: Pinecone serverless or Qdrant Cloud with 50M+ vectors. Self-hosted Qdrant on high-memory instances is also viable if cost control is important.

Indexing: HNSW with tuned parameters. For 50M vectors, increase M to 32 (more connections per node) and efConstruction to 300 for better index quality. At query time, tune efSearch to balance recall vs latency.

Partitioning strategy:

  • Partition by coarse category into separate collections (e.g., "engineering-docs", "help-articles", "product-catalog"). This reduces the search space per query.
  • Within each collection, use metadata filters for finer scoping.

Latency budget (100ms total):

  • Embedding the query: ~50-80ms (OpenAI API call)
  • Vector search: ~20-40ms (Pinecone/Qdrant with warm cache)
  • Network overhead: ~10-20ms

Optimization critical path: The embedding API call is the bottleneck. Mitigations: (1) use a locally-hosted embedding model (e.g., ONNX runtime) to eliminate the API call and reduce embedding to ~5ms, (2) cache frequent query embeddings, (3) use batch queries when handling multiple requests.

Monitoring: Track p50/p95/p99 query latencies, recall metrics (measure against a golden test set), and index statistics. Alert when p95 exceeds 150ms.


Q10. Explain the trade-offs between using namespaces, separate collections, and metadata filters for data isolation.

Why interviewers ask: Tests nuanced understanding of vector database architecture — there's no single right answer, and the interviewer wants to see you reason about trade-offs.

Model answer:

Metadata filters (single collection, filter on a field like tenant_id):

  • Pros: Simplest to implement, no collection management overhead, single place to query.
  • Cons: Highly selective filters (e.g., tenant has <1% of vectors) degrade ANN search performance. No hard isolation — a code bug that omits the filter leaks data. All vectors share the same index, so one noisy tenant can affect others.
  • Best for: Fewer than ~100 tenants with roughly equal data sizes; non-sensitive data where filtering bugs don't cause security incidents.

Namespaces (single index, logical partitions — Pinecone-specific):

  • Pros: Lightweight isolation without managing multiple indexes. Queries are scoped to a namespace. Faster than metadata filtering for selective access patterns.
  • Cons: All namespaces share the same index configuration (dimension, metric). Pinecone-specific feature — not portable to other databases. Still a logical (not physical) boundary.
  • Best for: Multi-tenant SaaS with many tenants; environment separation (staging/production); A/B testing different embedding versions.

Separate collections/indexes (physical isolation):

  • Pros: Complete isolation — impossible for data to leak between collections. Independent configuration per collection (different dimensions, metrics, index parameters). Predictable performance per collection.
  • Cons: Higher operational complexity (manage many collections). Higher cost (each collection has its own infrastructure). Cannot cross-collection query without application logic.
  • Best for: Strict compliance requirements (HIPAA, SOC2); fundamentally different data types; very large tenants who need dedicated resources.

My recommendation for most multi-tenant apps: Start with namespace-based isolation (or metadata filters if not using Pinecone). Move high-volume or high-security tenants to dedicated collections when needed. Always treat tenant isolation as a security boundary, not just a convenience.


Q11. How would you evaluate whether your vector search system is returning good results, and how would you improve it over time?

Why interviewers ask: Tests understanding of AI system evaluation — the hardest and most important part of building production AI systems.

Model answer:

Evaluation approach:

Step 1: Build a golden evaluation dataset. Create 200+ query/expected-document pairs where humans have labeled which documents should be returned for each query. Include edge cases: ambiguous queries, out-of-domain queries, queries that should match multiple documents, and queries that should match nothing.

Step 2: Define retrieval metrics:

  • Recall@k: What fraction of relevant documents appear in the top-k results? If there are 3 relevant documents and top-5 retrieval finds 2 of them, recall@5 = 66.7%.
  • Precision@k: What fraction of top-k results are actually relevant? If top-5 returns 2 relevant and 3 irrelevant, precision@5 = 40%.
  • MRR (Mean Reciprocal Rank): Where does the first relevant result appear? If it's at position 1, MRR = 1. Position 3, MRR = 0.33. Average across all queries.
  • NDCG (Normalized Discounted Cumulative Gain): Measures ranking quality — relevant results ranked higher contribute more to the score.

Step 3: Run evaluation regularly. After every change to the embedding model, chunking strategy, metadata schema, or index configuration, run the eval suite. Track metrics over time to catch regressions.

Improvement strategies:

  • Improve chunking: Smaller, more focused chunks (200-500 tokens) with overlap. Include section headings in each chunk for context.
  • Hybrid search: Combine vector similarity with keyword search (BM25). Weaviate supports this natively. Helps with exact term matching that embeddings can miss.
  • Re-ranking: Retrieve top-20 with vector search, then re-rank with a cross-encoder model that considers the query-document pair jointly. More expensive but significantly improves precision.
  • Query expansion: Rephrase the user's query or generate multiple query variations, search with each, and merge results. Helps with vocabulary mismatches.
  • User feedback loop: Log which results users click, ignore, or thumbs-down. Use this signal to identify problem queries and expand the eval dataset.

Quick-fire

#QuestionOne-line answer
1What indexing algorithm do most vector DBs use?HNSW (Hierarchical Navigable Small World)
2Pinecone returns score 0.85 — similarity or distance?Similarity (higher = more similar)
3Chroma returns distance 0.15 — what's the similarity?0.85 (similarity = 1 - distance)
4Max metadata size per vector in Pinecone?40 KB
5Must query embedding model match stored embedding model?Yes — different models produce incompatible vector spaces
6What does top-k=5 mean?Return the 5 most similar vectors to the query
7Qdrant calls metadata what?Payload
8What is a collection in a vector database?A logical grouping of vectors with shared configuration (like a table)
9ISO date format for metadata?YYYY-MM-DD (e.g., "2026-03-15")
10What happens if you forget the tenant_id filter?Data leak — other tenants' documents can appear in results
11HNSW search complexity?O(log n) — sub-linear

<- Back to 4.12 — Integrating Vector Databases (README)