Episode 4 — Generative AI Engineering / 4.12 — Integrating Vector Databases

4.12 — Integrating Vector Databases: Quick Revision

Compact cheat sheet. Print-friendly.

How to use this material (instructions)

  1. Skim before labs or interviews.
  2. Drill gaps — reopen README.md -> 4.12.a...4.12.c.
  3. Practice4.12-Exercise-Questions.md.
  4. Polish answers4.12-Interview-Questions.md.

Core vocabulary

TermOne-liner
Vector databaseDatabase designed to store, index, and search high-dimensional vectors (embeddings)
EmbeddingArray of floats representing semantic meaning (e.g., 1536 dimensions)
ANNApproximate Nearest Neighbor — finds "close enough" vectors in sub-linear time
HNSWHierarchical Navigable Small World — multi-layer graph index, O(log n) search
IVFInverted File Index — cluster-based partitioning, O(n/k) search
CollectionLogical grouping of vectors with shared config (like a table)
NamespaceLightweight partition within an index (Pinecone-specific)
Top-kNumber of nearest neighbors to return
Cosine similarity0-1, higher = more similar (Pinecone, Qdrant)
Cosine distance0-2, lower = more similar (Chroma, pgvector). distance = 1 - similarity
MetadataStructured key-value data attached to each vector (source, category, date, etc.)
UpsertInsert or update a vector by ID
RecallFraction of true nearest neighbors found by ANN (95-99%+)

Vector record anatomy

{
  id:       "doc_001",                              // Unique identifier
  vector:   [0.023, -0.041, 0.087, ..., -0.032],   // 1536 floats
  metadata: {                                        // Structured facts
    text: "Preview text here...",
    source: "help-center",
    category: "billing",
    date: "2026-03-15",
    language: "en",
    is_published: true,
    tenant_id: "customer_abc"
  }
}

Popular vector databases

DBTypeBest For
PineconeManaged cloudZero-ops production
ChromaOpen source, localPrototyping, dev
QdrantOpen source + cloudPerformance-critical
WeaviateOpen source + cloudHybrid search (vector + keyword)
pgvectorPostgreSQL extensionTeams already on Postgres
MilvusOpen source + cloudBillion-scale datasets

Query flow

User question
    |
    v
Embed query (SAME model as stored docs)
    |
    v
Search vector DB (top-k nearest neighbors + optional filters)
    |
    v
Return results with scores + metadata
    |
    v
Filter by score threshold (reject low-relevance)
    |
    v
Pass context to LLM -> Grounded answer

Rule: Query embedding model MUST match stored embedding model.


Top-k guide

top-k 1-3    -> Simple classification, FAQ lookup
top-k 3-5    -> Standard RAG Q&A (most common)
top-k 10-20  -> Research, complex questions
top-k 20-50  -> Retrieve-then-rerank pipeline
top-k 50+    -> Broad retrieval for re-ranking

Similarity scores

Cosine similarity (Pinecone, Qdrant):
  0.90 - 1.00  Very high (near paraphrase)
  0.80 - 0.90  High (same topic)
  0.70 - 0.80  Moderate (loosely related)
  0.60 - 0.70  Low (tangential)
  < 0.60       Probably irrelevant

Convert: distance = 1 - similarity
         similarity = 1 - distance

Score threshold cheat sheet

Use CaseThresholdRationale
Factual Q&A0.80+Wrong > no answer
General chatbot0.70-0.75Balance coverage and relevance
Product search0.60-0.70Users expect results
Duplicate detection0.90+Only near-exact matches
RecommendationNo thresholdAlways show something

Indexing algorithms

HNSW (most common):
  How: Multi-layer graph, greedy navigation top-down
  Speed: O(log n)
  Recall: 95-99%+
  Memory: High (graph in RAM)
  Inserts: Good (graph updates)
  Used by: Pinecone, Qdrant, Chroma, pgvector

IVF:
  How: K-means clustering, search nearest clusters only
  Speed: O(n/k) where k = num clusters
  Recall: 90-99% (depends on nprobe)
  Memory: Lower
  Inserts: Poor (may need re-clustering)
  Used by: Milvus, FAISS, older systems

Metadata filter syntax

Pinecone (MongoDB-like)

filter: {
  category: { $eq: 'billing' },
  date: { $gte: '2026-01-01' },
  source: { $in: ['help-center', 'docs'] },
  $or: [
    { language: { $eq: 'en' } },
    { language: { $eq: 'es' } },
  ],
}

Chroma (where clause)

where: {
  $and: [
    { category: 'billing' },
    { language: 'en' },
  ],
}

Qdrant (must/should/must_not)

filter: {
  must: [
    { key: 'category', match: { value: 'billing' } },
  ],
  must_not: [
    { key: 'status', match: { value: 'draft' } },
  ],
}

Operators quick reference

OperationPineconeChromaQdrant
Equals$eq$eq / shorthandmatch: { value }
Not equals$ne$nemust_not + match
Greater than$gt$gtrange: { gt }
In list$in$inmatch: { any }
AND$and / top-level$andmust: [...]
OR$or$orshould: [...]

Metadata schema best practices

1. FLAT structure     -> No nested objects (most DBs can't filter them)
2. NORMALIZE casing   -> "billing" not "Billing" or "BILLING"
3. ISO dates          -> "2026-03-15" not "March 15, 2026"
4. TRUNCATE text      -> 300-500 chars in metadata, full text elsewhere
5. FILTER-ONLY fields -> Don't store data you won't filter on
6. BOOLEAN as boolean -> true not "true"
7. ALWAYS tenant_id   -> Security requirement for multi-tenant apps

Pinecone metadata limit: 40 KB per vector


Common patterns

Multi-tenant isolation

// ALWAYS include tenant_id in every query
filter: { tenant_id: { $eq: currentTenantId } }

Time-scoped search

// Only last 30 days
const cutoff = new Date(Date.now() - 30*24*60*60*1000).toISOString().split('T')[0];
filter: { date: { $gte: cutoff } }

Score threshold

const results = searchResults.matches.filter(m => m.score >= 0.75);
if (results.length === 0) return "I don't have enough info to answer that.";

Batch ingestion

1. Batch embedding calls    (up to 2048 texts per OpenAI call)
2. Batch upserts            (100 vectors per Pinecone call)
3. Use idempotent IDs       (safe re-runs, no duplicates)
4. Truncate metadata        (respect 40KB limit)
5. Track progress           (resume on failure)
6. Validate dimensions      (must match index)

Embedding model dimensions

ModelDimsCost
text-embedding-3-small1536$0.02/1M tokens
text-embedding-3-large3072$0.13/1M tokens
voyage-31024$0.06/1M tokens
embed-v4.0 (Cohere)1024$0.10/1M tokens
all-MiniLM-L6-v2384Free (open source)

Rule: ALL vectors in a collection MUST have the same dimension. Changing models requires full re-indexing.


Debugging checklist

[ ] Query embedding model matches stored embedding model?
[ ] Correct index/collection/namespace being queried?
[ ] Score distribution — are top scores reasonable (>0.7)?
[ ] Metadata filters not too restrictive?
[ ] Documents actually exist in the database?
[ ] Dimensions match between query vector and index?
[ ] Score threshold not filtering out all results?
[ ] Full text available for LLM context (not just preview)?

Performance tips

1. includeValues: false     -> Don't return vectors (saves bandwidth)
2. Scope to namespace       -> Search fewer vectors
3. Minimize metadata filters -> Complex filters slow queries
4. Warm cache               -> Frequent queries get faster
5. Same-region deployment   -> Reduce network latency
6. Appropriate top-k        -> Don't over-fetch

Quick formulas

Cosine distance    = 1 - cosine similarity
Cosine similarity  = 1 - cosine distance
Storage estimate   = num_vectors x dimensions x 4 bytes (float32)
                   = 1M vectors x 1536 dims x 4 = ~6 GB (vectors only)
Brute force ops    = num_vectors x dimensions x 2 (multiply + add)

End of 4.12 quick revision.