Episode 4 — Generative AI Engineering / 4.12 — Integrating Vector Databases

4.12 — Integrating Vector Databases: Quick Revision

Compact cheat sheet. Print-friendly.

How to use this material (instructions)

Skim before labs or interviews.
Drill gaps — reopen README.md -> 4.12.a...4.12.c.
Practice — 4.12-Exercise-Questions.md.
Polish answers — 4.12-Interview-Questions.md.

Core vocabulary

Term	One-liner
Vector database	Database designed to store, index, and search high-dimensional vectors (embeddings)
Embedding	Array of floats representing semantic meaning (e.g., 1536 dimensions)
ANN	Approximate Nearest Neighbor — finds "close enough" vectors in sub-linear time
HNSW	Hierarchical Navigable Small World — multi-layer graph index, O(log n) search
IVF	Inverted File Index — cluster-based partitioning, O(n/k) search
Collection	Logical grouping of vectors with shared config (like a table)
Namespace	Lightweight partition within an index (Pinecone-specific)
Top-k	Number of nearest neighbors to return
Cosine similarity	0-1, higher = more similar (Pinecone, Qdrant)
Cosine distance	0-2, lower = more similar (Chroma, pgvector). `distance = 1 - similarity`
Metadata	Structured key-value data attached to each vector (source, category, date, etc.)
Upsert	Insert or update a vector by ID
Recall	Fraction of true nearest neighbors found by ANN (95-99%+)

Vector record anatomy

{
  id:       "doc_001",                              // Unique identifier
  vector:   [0.023, -0.041, 0.087, ..., -0.032],   // 1536 floats
  metadata: {                                        // Structured facts
    text: "Preview text here...",
    source: "help-center",
    category: "billing",
    date: "2026-03-15",
    language: "en",
    is_published: true,
    tenant_id: "customer_abc"
  }
}

Popular vector databases

DB	Type	Best For
Pinecone	Managed cloud	Zero-ops production
Chroma	Open source, local	Prototyping, dev
Qdrant	Open source + cloud	Performance-critical
Weaviate	Open source + cloud	Hybrid search (vector + keyword)
pgvector	PostgreSQL extension	Teams already on Postgres
Milvus	Open source + cloud	Billion-scale datasets

Query flow

User question
    |
    v
Embed query (SAME model as stored docs)
    |
    v
Search vector DB (top-k nearest neighbors + optional filters)
    |
    v
Return results with scores + metadata
    |
    v
Filter by score threshold (reject low-relevance)
    |
    v
Pass context to LLM -> Grounded answer

Rule: Query embedding model MUST match stored embedding model.

Top-k guide

top-k 1-3    -> Simple classification, FAQ lookup
top-k 3-5    -> Standard RAG Q&A (most common)
top-k 10-20  -> Research, complex questions
top-k 20-50  -> Retrieve-then-rerank pipeline
top-k 50+    -> Broad retrieval for re-ranking

Similarity scores

Cosine similarity (Pinecone, Qdrant):
  0.90 - 1.00  Very high (near paraphrase)
  0.80 - 0.90  High (same topic)
  0.70 - 0.80  Moderate (loosely related)
  0.60 - 0.70  Low (tangential)
  < 0.60       Probably irrelevant

Convert: distance = 1 - similarity
         similarity = 1 - distance

Score threshold cheat sheet

Use Case	Threshold	Rationale
Factual Q&A	0.80+	Wrong > no answer
General chatbot	0.70-0.75	Balance coverage and relevance
Product search	0.60-0.70	Users expect results
Duplicate detection	0.90+	Only near-exact matches
Recommendation	No threshold	Always show something

Indexing algorithms

HNSW (most common):
  How: Multi-layer graph, greedy navigation top-down
  Speed: O(log n)
  Recall: 95-99%+
  Memory: High (graph in RAM)
  Inserts: Good (graph updates)
  Used by: Pinecone, Qdrant, Chroma, pgvector

IVF:
  How: K-means clustering, search nearest clusters only
  Speed: O(n/k) where k = num clusters
  Recall: 90-99% (depends on nprobe)
  Memory: Lower
  Inserts: Poor (may need re-clustering)
  Used by: Milvus, FAISS, older systems

Metadata filter syntax

Pinecone (MongoDB-like)

filter: {
  category: { $eq: 'billing' },
  date: { $gte: '2026-01-01' },
  source: { $in: ['help-center', 'docs'] },
  $or: [
    { language: { $eq: 'en' } },
    { language: { $eq: 'es' } },
  ],
}

Chroma (where clause)

where: {
  $and: [
    { category: 'billing' },
    { language: 'en' },
  ],
}

Qdrant (must/should/must_not)

filter: {
  must: [
    { key: 'category', match: { value: 'billing' } },
  ],
  must_not: [
    { key: 'status', match: { value: 'draft' } },
  ],
}

Operators quick reference

Operation	Pinecone	Chroma	Qdrant
Equals	`$eq`	`$eq` / shorthand	`match: { value }`
Not equals	`$ne`	`$ne`	`must_not` + `match`
Greater than	`$gt`	`$gt`	`range: { gt }`
In list	`$in`	`$in`	`match: { any }`
AND	`$and` / top-level	`$and`	`must: [...]`
OR	`$or`	`$or`	`should: [...]`

Metadata schema best practices

1. FLAT structure     -> No nested objects (most DBs can't filter them)
2. NORMALIZE casing   -> "billing" not "Billing" or "BILLING"
3. ISO dates          -> "2026-03-15" not "March 15, 2026"
4. TRUNCATE text      -> 300-500 chars in metadata, full text elsewhere
5. FILTER-ONLY fields -> Don't store data you won't filter on
6. BOOLEAN as boolean -> true not "true"
7. ALWAYS tenant_id   -> Security requirement for multi-tenant apps

Pinecone metadata limit: 40 KB per vector

Common patterns

Multi-tenant isolation

// ALWAYS include tenant_id in every query
filter: { tenant_id: { $eq: currentTenantId } }

Time-scoped search

// Only last 30 days
const cutoff = new Date(Date.now() - 30*24*60*60*1000).toISOString().split('T')[0];
filter: { date: { $gte: cutoff } }

Score threshold

const results = searchResults.matches.filter(m => m.score >= 0.75);
if (results.length === 0) return "I don't have enough info to answer that.";

Batch ingestion

1. Batch embedding calls    (up to 2048 texts per OpenAI call)
2. Batch upserts            (100 vectors per Pinecone call)
3. Use idempotent IDs       (safe re-runs, no duplicates)
4. Truncate metadata        (respect 40KB limit)
5. Track progress           (resume on failure)
6. Validate dimensions      (must match index)

Embedding model dimensions

Model	Dims	Cost
`text-embedding-3-small`	1536	$0.02/1M tokens
`text-embedding-3-large`	3072	$0.13/1M tokens
`voyage-3`	1024	$0.06/1M tokens
`embed-v4.0` (Cohere)	1024	$0.10/1M tokens
`all-MiniLM-L6-v2`	384	Free (open source)

Rule: ALL vectors in a collection MUST have the same dimension. Changing models requires full re-indexing.

Debugging checklist

[ ] Query embedding model matches stored embedding model?
[ ] Correct index/collection/namespace being queried?
[ ] Score distribution — are top scores reasonable (>0.7)?
[ ] Metadata filters not too restrictive?
[ ] Documents actually exist in the database?
[ ] Dimensions match between query vector and index?
[ ] Score threshold not filtering out all results?
[ ] Full text available for LLM context (not just preview)?

Performance tips

1. includeValues: false     -> Don't return vectors (saves bandwidth)
2. Scope to namespace       -> Search fewer vectors
3. Minimize metadata filters -> Complex filters slow queries
4. Warm cache               -> Frequent queries get faster
5. Same-region deployment   -> Reduce network latency
6. Appropriate top-k        -> Don't over-fetch

Quick formulas

Cosine distance    = 1 - cosine similarity
Cosine similarity  = 1 - cosine distance
Storage estimate   = num_vectors x dimensions x 4 bytes (float32)
                   = 1M vectors x 1536 dims x 4 = ~6 GB (vectors only)
Brute force ops    = num_vectors x dimensions x 2 (multiply + add)

End of 4.12 quick revision.