Episode 4 — Generative AI Engineering / 4.12 — Integrating Vector Databases

4.12.b — Querying Similar Vectors

In one sentence: Querying a vector database follows a consistent flow — embed the query, search the database for the top-k nearest neighbors, interpret similarity scores, and optionally combine with metadata filters — with tuning knobs for relevance thresholds, performance, and pagination that determine whether your RAG pipeline returns good results or garbage.

Navigation: <- 4.12.a — Storing Embeddings | 4.12.c — Metadata Filters ->


1. The Query Flow

Every vector database query follows the same fundamental pattern, regardless of which database you use:

┌──────────────────────────────────────────────────────────────────────┐
│                    VECTOR SEARCH QUERY FLOW                          │
│                                                                      │
│  Step 1: User asks a question                                        │
│  "How do I reset my password?"                                       │
│       │                                                              │
│       ▼                                                              │
│  Step 2: Embed the query using the SAME model used for storage       │
│  text-embedding-3-small("How do I reset my password?")               │
│  → [0.023, -0.041, 0.087, ..., 0.015]  (1536 dimensions)           │
│       │                                                              │
│       ▼                                                              │
│  Step 3: Send query vector to the vector database                    │
│  "Find the top 5 vectors most similar to this query vector"          │
│       │                                                              │
│       ▼                                                              │
│  Step 4: Vector DB uses ANN index (HNSW/IVF) to find neighbors      │
│  Searches millions of vectors in milliseconds                        │
│       │                                                              │
│       ▼                                                              │
│  Step 5: Return top-k results with scores and metadata               │
│  [                                                                   │
│    { id: "doc_42", score: 0.94, text: "To reset your password..." },│
│    { id: "doc_87", score: 0.89, text: "Password recovery steps..." },│
│    { id: "doc_15", score: 0.82, text: "Account security FAQ..." },  │
│  ]                                                                   │
│       │                                                              │
│       ▼                                                              │
│  Step 6: Pass retrieved text as context to the LLM                   │
│  "Based on these documents, answer the user's question..."           │
└──────────────────────────────────────────────────────────────────────┘

Critical rule: same embedding model

The query must be embedded with the exact same model that was used to embed the stored documents. Mixing models produces meaningless results because different models produce vectors in different "spaces."

✅ Stored with text-embedding-3-small → Query with text-embedding-3-small
❌ Stored with text-embedding-3-small → Query with text-embedding-3-large
❌ Stored with text-embedding-3-small → Query with Cohere embed-v4.0

2. Basic Query Examples

2.1 Querying Pinecone

import { Pinecone } from '@pinecone-database/pinecone';
import OpenAI from 'openai';

const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const openai = new OpenAI();

async function queryPinecone(queryText, topK = 5) {
  // Step 1: Embed the query
  const embeddingResponse = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: queryText,
  });
  const queryVector = embeddingResponse.data[0].embedding;

  // Step 2: Search the vector database
  const index = pinecone.index('knowledge-base');
  const results = await index.query({
    vector: queryVector,
    topK: topK,
    includeMetadata: true,        // Return stored metadata with results
    includeValues: false,         // Don't return the full vector (saves bandwidth)
  });

  // Step 3: Process results
  return results.matches.map((match) => ({
    id: match.id,
    score: match.score,           // Similarity score (0 to 1 for cosine)
    text: match.metadata.text,
    category: match.metadata.category,
    source: match.metadata.source,
  }));
}

// ─── Usage ───

const results = await queryPinecone('How do I reset my password?');

console.log('Search results:');
results.forEach((result, i) => {
  console.log(`${i + 1}. [Score: ${result.score.toFixed(3)}] ${result.text}`);
});

// Output:
// 1. [Score: 0.943] To reset your password, go to Settings > Security > Change Password.
// 2. [Score: 0.891] Password recovery: click "Forgot Password" on the login page.
// 3. [Score: 0.823] Two-factor authentication adds an extra layer of security.
// 4. [Score: 0.756] Account settings allow you to update your profile information.
// 5. [Score: 0.701] Contact support if you are locked out of your account.

2.2 Querying Chroma

import { ChromaClient } from 'chromadb';
import OpenAI from 'openai';

const chroma = new ChromaClient();
const openai = new OpenAI();

async function queryChroma(queryText, nResults = 5) {
  const collection = await chroma.getCollection({ name: 'knowledge-base' });

  // Generate query embedding
  const embeddingResponse = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: queryText,
  });
  const queryVector = embeddingResponse.data[0].embedding;

  // Query the collection
  const results = await collection.query({
    queryEmbeddings: [queryVector],
    nResults: nResults,
    include: ['documents', 'metadatas', 'distances'],  // What to return
  });

  // Chroma returns arrays of arrays (supports multiple queries at once)
  return results.ids[0].map((id, i) => ({
    id: id,
    distance: results.distances[0][i],     // Distance (lower = more similar for cosine)
    text: results.documents[0][i],
    metadata: results.metadatas[0][i],
  }));
}

// ─── Usage ───

const results = await queryChroma('How do I reset my password?');

results.forEach((result, i) => {
  console.log(`${i + 1}. [Distance: ${result.distance.toFixed(4)}] ${result.text}`);
});

// Note: Chroma returns DISTANCE, not similarity.
// For cosine: distance = 1 - similarity
// So distance 0.057 = similarity 0.943

2.3 Querying Qdrant

import { QdrantClient } from '@qdrant/js-client-rest';
import OpenAI from 'openai';

const qdrant = new QdrantClient({ url: 'http://localhost:6333' });
const openai = new OpenAI();

async function queryQdrant(queryText, limit = 5) {
  // Generate query embedding
  const embeddingResponse = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: queryText,
  });
  const queryVector = embeddingResponse.data[0].embedding;

  // Search
  const results = await qdrant.search('knowledge-base', {
    vector: queryVector,
    limit: limit,
    with_payload: true,           // Return payload (metadata)
    with_vectors: false,          // Don't return vectors (saves bandwidth)
  });

  return results.map((point) => ({
    id: point.id,
    score: point.score,
    text: point.payload.text,
    category: point.payload.category,
  }));
}

3. Understanding the Top-K Parameter

The top-k (or topK, nResults, limit depending on the database) parameter controls how many results the vector database returns. This is one of the most important tuning knobs in your RAG pipeline.

How top-k affects your system

top-k = 1:
  Only the single most similar result
  ✅ Most relevant, least noise
  ❌ Misses important context if the question spans multiple documents

top-k = 3-5:
  A handful of highly relevant results
  ✅ Good balance for most RAG use cases
  ✅ Keeps LLM context manageable
  ❌ May miss edge cases in large knowledge bases

top-k = 10-20:
  Broader retrieval
  ✅ More comprehensive context
  ✅ Better for complex questions that need multiple sources
  ❌ Includes less relevant results (noise)
  ❌ Uses more LLM tokens (cost + latency)

top-k = 50-100:
  Very broad retrieval
  ✅ Useful for re-ranking (retrieve many, then re-rank to top 5)
  ❌ Too much context for direct LLM consumption
  ❌ Increases latency and cost significantly

Choosing the right top-k

Use CaseRecommended top-kWhy
FAQ chatbot3-5Questions usually map to 1-2 documents
Research assistant10-20Complex questions need multiple sources
Code search5-10Multiple relevant code snippets
Product recommendation10-20Users want variety
Retrieve then re-rank20-50Broad retrieval, then ML re-ranker picks top 5
Simple classification1-3Just need the closest match

Dynamic top-k

In production, you might adjust top-k based on the query:

function determineTopK(query) {
  // Short, specific questions need fewer results
  if (query.split(' ').length < 5) return 3;

  // Questions with "compare", "list", "all" need more results
  const broadTerms = ['compare', 'list', 'all', 'different', 'options', 'alternatives'];
  if (broadTerms.some((term) => query.toLowerCase().includes(term))) return 15;

  // Default
  return 5;
}

4. Similarity Scores: What They Mean

Different vector databases return scores differently. Understanding what the numbers mean is essential for setting thresholds and debugging.

4.1 Cosine similarity

Range:  -1 to 1  (for normalized vectors: 0 to 1)
1.0  = identical vectors (perfect match)
0.0  = completely unrelated (orthogonal)
-1.0 = opposite meaning (rare in practice with embeddings)

In practice with text embeddings:
0.90 - 1.00  → Very high similarity (likely same topic, near paraphrase)
0.80 - 0.90  → High similarity (related content, same domain)
0.70 - 0.80  → Moderate similarity (loosely related)
0.60 - 0.70  → Low similarity (tangentially related)
< 0.60       → Probably not relevant

4.2 How different databases report scores

DatabaseScore NameMetricInterpretation
PineconescoreCosine similarity0-1, higher = more similar
ChromadistanceCosine distance0-2, lower = more similar
QdrantscoreCosine similarity0-1, higher = more similar
pgvector<=> operatorCosine distance0-2, lower = more similar

Converting between similarity and distance:

// Cosine distance = 1 - cosine similarity
const similarity = 0.943;
const distance = 1 - similarity; // 0.057

// Chroma returns distance, convert to similarity:
const chromaDistance = 0.057;
const chromaSimilarity = 1 - chromaDistance; // 0.943

4.3 Score interpretation varies by embedding model

Different embedding models produce different score distributions. A score of 0.85 from text-embedding-3-small does NOT mean the same thing as 0.85 from text-embedding-3-large or from Cohere's model.

text-embedding-3-small typical ranges:
  Same sentence:           0.95 - 1.00
  Paraphrase:              0.85 - 0.95
  Same topic:              0.70 - 0.85
  Related:                 0.55 - 0.70
  Unrelated:               0.30 - 0.55

text-embedding-3-large typical ranges:
  Same sentence:           0.97 - 1.00
  Paraphrase:              0.88 - 0.97
  Same topic:              0.72 - 0.88
  Related:                 0.58 - 0.72
  Unrelated:               0.30 - 0.58

IMPORTANT: Always calibrate thresholds to YOUR embedding model and YOUR data.
Run experiments on your actual dataset to determine what scores indicate
"relevant" vs "irrelevant" for your specific use case.

5. Similarity Thresholds: Filtering by Relevance

Just because the vector database returns top-k results doesn't mean all results are relevant. A similarity threshold filters out low-quality results.

Implementing a score threshold

async function queryWithThreshold(queryText, options = {}) {
  const {
    topK = 10,
    scoreThreshold = 0.75,       // Minimum similarity score to include
  } = options;

  // Query with a higher topK than needed (retrieve more, filter later)
  const embeddingResponse = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: queryText,
  });
  const queryVector = embeddingResponse.data[0].embedding;

  const index = pinecone.index('knowledge-base');
  const results = await index.query({
    vector: queryVector,
    topK: topK,
    includeMetadata: true,
  });

  // Filter by score threshold
  const relevantResults = results.matches.filter(
    (match) => match.score >= scoreThreshold
  );

  if (relevantResults.length === 0) {
    return {
      results: [],
      message: 'No sufficiently relevant documents found.',
      bestScore: results.matches[0]?.score || 0,
    };
  }

  return {
    results: relevantResults.map((match) => ({
      id: match.id,
      score: match.score,
      text: match.metadata.text,
    })),
    message: `Found ${relevantResults.length} relevant documents.`,
  };
}

// ─── Usage ───

// Good query — matches exist
const goodResult = await queryWithThreshold('How do I reset my password?');
// { results: [{ score: 0.94, text: "..." }, { score: 0.89, text: "..." }], ... }

// Bad query — nothing relevant in the database
const badResult = await queryWithThreshold('What is the meaning of life?');
// { results: [], message: "No sufficiently relevant documents found.", bestScore: 0.41 }

Adaptive thresholds

Different types of queries may need different thresholds:

function getThresholdForQuery(query) {
  // Factual/specific questions need high threshold
  // (wrong answers are worse than no answers)
  const factualPatterns = /^(how|what|when|where|why|who)\s/i;
  if (factualPatterns.test(query)) return 0.80;

  // Exploratory queries can use lower threshold
  const exploratoryPatterns = /\b(related|similar|like|about|explore)\b/i;
  if (exploratoryPatterns.test(query)) return 0.65;

  // Default threshold
  return 0.75;
}

When to use thresholds vs not

ScenarioUse Threshold?Why
RAG Q&AYes (0.75-0.85)Better to say "I don't know" than give irrelevant context
Product searchMaybe (0.60-0.70)Users expect results, even loosely related
Duplicate detectionYes (0.90+)Only want near-exact matches
RecommendationNoAlways want to show something
ClassificationNoAlways pick the closest class

6. Combining Vector Search with Metadata Filtering

One of the most powerful features of vector databases is combining semantic similarity with structured filters. This is covered in depth in 4.12.c, but here is the core concept.

// ─── Pinecone: Vector search + metadata filter ───

const results = await index.query({
  vector: queryVector,
  topK: 5,
  includeMetadata: true,
  filter: {
    category: { $eq: 'billing' },        // Only search billing documents
    date: { $gte: '2026-01-01' },        // Only recent documents
  },
});

// ─── Chroma: Vector search + metadata filter ───

const results = await collection.query({
  queryEmbeddings: [queryVector],
  nResults: 5,
  where: {
    category: 'billing',                  // Only search billing documents
  },
});

// ─── Qdrant: Vector search + payload filter ───

const results = await qdrant.search('knowledge-base', {
  vector: queryVector,
  limit: 5,
  filter: {
    must: [
      { key: 'category', match: { value: 'billing' } },
      { key: 'date', range: { gte: '2026-01-01' } },
    ],
  },
});

The vector database applies both the semantic similarity ranking AND the metadata filter simultaneously — you get results that are both semantically relevant and match your structured constraints.


7. Performance Considerations

Vector search performance depends on several factors. Understanding these helps you design for your latency and throughput requirements.

7.1 What affects query latency

FactorImpactMitigation
Vector countMore vectors = slower search (sub-linear with ANN)Partition into namespaces/collections
Vector dimensionsHigher dimensions = slower distance calculationsUse smaller models if quality allows
Top-k sizeLarger k = slightly slowerUse minimum k needed
Metadata filtersComplex filters add overheadIndex frequently filtered fields
Include vectorsReturning full vectors increases payload sizeSet includeValues: false
Network latencyRemote DB adds round-trip timeUse same region, or local DB for dev
Index typeHNSW vs IVF have different characteristicsHNSW for most cases

7.2 Typical latency benchmarks

Pinecone (managed, same region):
  1M vectors, top-5:   ~20-50ms
  10M vectors, top-5:  ~30-80ms
  With metadata filter: +10-30ms

Qdrant (self-hosted, local):
  1M vectors, top-5:   ~5-15ms
  10M vectors, top-5:  ~10-30ms
  With payload filter:  +5-20ms

Chroma (local/embedded):
  100K vectors, top-5:  ~5-20ms
  1M vectors, top-5:    ~20-80ms

Note: These are approximate. Actual performance depends on hardware,
index configuration, vector dimensions, and query complexity.

7.3 Optimizing query performance

// ─── Optimization 1: Don't return vectors ───

// BAD: Returns full 1536-dim vectors (wastes bandwidth)
const slow = await index.query({
  vector: queryVector,
  topK: 5,
  includeMetadata: true,
  includeValues: true,             // 1536 floats x 5 results = large payload
});

// GOOD: Only return metadata
const fast = await index.query({
  vector: queryVector,
  topK: 5,
  includeMetadata: true,
  includeValues: false,            // Skip the vectors — you rarely need them
});


// ─── Optimization 2: Namespace/collection scoping ───

// BAD: Search all 10M vectors
const slow2 = await index.query({ vector: queryVector, topK: 5 });

// GOOD: Search only the relevant namespace (e.g., 500K vectors)
const fast2 = await index.namespace('help-articles').query({
  vector: queryVector,
  topK: 5,
});


// ─── Optimization 3: Parallel embedding + search ───

async function optimizedRAGQuery(queryText) {
  // Generate embedding (takes ~100-300ms)
  const embeddingResponse = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: queryText,
  });
  const queryVector = embeddingResponse.data[0].embedding;

  // Search (takes ~20-50ms) — happens after embedding is ready
  const results = await index.query({
    vector: queryVector,
    topK: 5,
    includeMetadata: true,
    includeValues: false,
  });

  return results.matches;
}

8. Pagination: Handling Large Result Sets

Sometimes you need more results than a single query returns, or you want to let users browse through results.

8.1 Offset-based pagination (Chroma)

async function paginatedQuery(queryText, page = 1, pageSize = 10) {
  const collection = await chroma.getCollection({ name: 'knowledge-base' });

  const embeddingResponse = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: queryText,
  });
  const queryVector = embeddingResponse.data[0].embedding;

  // Chroma supports offset-based pagination
  const results = await collection.query({
    queryEmbeddings: [queryVector],
    nResults: pageSize,
    offset: (page - 1) * pageSize,    // Skip previous pages
    include: ['documents', 'metadatas', 'distances'],
  });

  return {
    page: page,
    pageSize: pageSize,
    results: results.ids[0].map((id, i) => ({
      id: id,
      distance: results.distances[0][i],
      text: results.documents[0][i],
      metadata: results.metadatas[0][i],
    })),
  };
}

// ─── Usage ───

const page1 = await paginatedQuery('machine learning tutorials', 1, 10);
const page2 = await paginatedQuery('machine learning tutorials', 2, 10);

8.2 Cursor-based pagination (Qdrant)

async function scrollResults(queryText, batchSize = 10) {
  const embeddingResponse = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: queryText,
  });
  const queryVector = embeddingResponse.data[0].embedding;

  let allResults = [];
  let offset = null;

  // Scroll through results in batches
  while (true) {
    const response = await qdrant.scroll('knowledge-base', {
      limit: batchSize,
      offset: offset,
      with_payload: true,
      with_vectors: false,
    });

    allResults.push(...response.points);

    // No more results
    if (!response.next_page_offset) break;
    offset = response.next_page_offset;
  }

  return allResults;
}

8.3 Over-fetch and client-side paginate (Pinecone)

Pinecone does not natively support offset/pagination on query results. The common pattern is to fetch a larger topK and paginate client-side.

async function paginatedPineconeQuery(queryText, page = 1, pageSize = 5) {
  const embeddingResponse = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: queryText,
  });
  const queryVector = embeddingResponse.data[0].embedding;

  // Fetch enough results to cover the requested page
  const totalNeeded = page * pageSize;
  const index = pinecone.index('knowledge-base');
  const results = await index.query({
    vector: queryVector,
    topK: Math.min(totalNeeded, 100),  // Pinecone max topK varies by plan
    includeMetadata: true,
  });

  // Client-side pagination
  const start = (page - 1) * pageSize;
  const end = start + pageSize;
  const pageResults = results.matches.slice(start, end);

  return {
    page: page,
    pageSize: pageSize,
    totalAvailable: results.matches.length,
    results: pageResults.map((match) => ({
      id: match.id,
      score: match.score,
      text: match.metadata.text,
    })),
  };
}

Pagination considerations

ApproachProsCons
Offset-basedSimple, familiar (like SQL OFFSET)Less efficient for deep pages
Cursor-basedEfficient for sequential scanningCan't jump to arbitrary page
Over-fetch + client-sideWorks with any DBWastes bandwidth and compute on earlier pages
No pagination (just top-k)Simplest, fastestLimited results

For RAG pipelines, pagination is rarely needed — you almost always just want the top 5-10 results. Pagination is more useful for search UIs where users browse through results.


9. End-to-End RAG Query Function

Here is a production-ready function that combines everything from this section into a complete RAG query pipeline:

import { Pinecone } from '@pinecone-database/pinecone';
import OpenAI from 'openai';

const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const openai = new OpenAI();

/**
 * Complete RAG query pipeline:
 * 1. Embed the user's question
 * 2. Search the vector database
 * 3. Filter by relevance threshold
 * 4. Format context for the LLM
 * 5. Generate a grounded answer
 */
async function ragQuery(userQuestion, options = {}) {
  const {
    indexName = 'knowledge-base',
    namespace = '',
    topK = 5,
    scoreThreshold = 0.75,
    model = 'gpt-4o',
    embeddingModel = 'text-embedding-3-small',
    systemPrompt = 'You are a helpful assistant. Answer the user\'s question based ONLY on the provided context. If the context does not contain enough information, say "I don\'t have enough information to answer that."',
  } = options;

  // Step 1: Embed the query
  const embeddingResponse = await openai.embeddings.create({
    model: embeddingModel,
    input: userQuestion,
  });
  const queryVector = embeddingResponse.data[0].embedding;

  // Step 2: Search the vector database
  const index = pinecone.index(indexName);
  const ns = namespace ? index.namespace(namespace) : index;

  const searchResults = await ns.query({
    vector: queryVector,
    topK: topK,
    includeMetadata: true,
    includeValues: false,
  });

  // Step 3: Filter by relevance threshold
  const relevantDocs = searchResults.matches.filter(
    (match) => match.score >= scoreThreshold
  );

  if (relevantDocs.length === 0) {
    return {
      answer: "I don't have enough information in my knowledge base to answer that question.",
      sources: [],
      topScore: searchResults.matches[0]?.score || 0,
    };
  }

  // Step 4: Format context for the LLM
  const context = relevantDocs
    .map((doc, i) => `[Document ${i + 1}] (relevance: ${doc.score.toFixed(2)})\n${doc.metadata.text}`)
    .join('\n\n');

  // Step 5: Generate a grounded answer
  const completion = await openai.chat.completions.create({
    model: model,
    temperature: 0,                // Deterministic for factual answers
    messages: [
      { role: 'system', content: systemPrompt },
      {
        role: 'user',
        content: `Context:\n${context}\n\n---\n\nQuestion: ${userQuestion}`,
      },
    ],
  });

  return {
    answer: completion.choices[0].message.content,
    sources: relevantDocs.map((doc) => ({
      id: doc.id,
      score: doc.score,
      text: doc.metadata.text,
      category: doc.metadata.category,
    })),
    topScore: relevantDocs[0].score,
    tokensUsed: completion.usage,
  };
}

// ─── Usage ───

const result = await ragQuery('How do I reset my password?');

console.log('Answer:', result.answer);
console.log('Sources:', result.sources.length);
console.log('Top relevance score:', result.topScore);
console.log('Tokens used:', result.tokensUsed);

10. Debugging Vector Search Results

When your search results look wrong, here's a systematic debugging approach:

async function debugQuery(queryText) {
  // 1. Check the query embedding
  const embeddingResponse = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: queryText,
  });
  const queryVector = embeddingResponse.data[0].embedding;
  console.log('Query embedding dimensions:', queryVector.length);
  console.log('First 5 values:', queryVector.slice(0, 5));

  // 2. Query with a high topK to see the full score distribution
  const index = pinecone.index('knowledge-base');
  const results = await index.query({
    vector: queryVector,
    topK: 20,
    includeMetadata: true,
  });

  // 3. Analyze score distribution
  console.log('\nScore distribution:');
  results.matches.forEach((match, i) => {
    const bar = '█'.repeat(Math.round(match.score * 50));
    console.log(
      `  ${(i + 1).toString().padStart(2)}. [${match.score.toFixed(3)}] ${bar} ${match.metadata.text?.slice(0, 60)}...`
    );
  });

  // 4. Check for common issues
  const scores = results.matches.map((m) => m.score);
  const maxScore = Math.max(...scores);
  const minScore = Math.min(...scores);
  const avgScore = scores.reduce((a, b) => a + b, 0) / scores.length;

  console.log('\nDiagnostics:');
  console.log(`  Max score: ${maxScore.toFixed(3)}`);
  console.log(`  Min score: ${minScore.toFixed(3)}`);
  console.log(`  Avg score: ${avgScore.toFixed(3)}`);
  console.log(`  Score range: ${(maxScore - minScore).toFixed(3)}`);

  if (maxScore < 0.5) {
    console.log('  WARNING: All scores very low — query may not match any content.');
    console.log('  Check: Is the query domain-relevant? Is the correct index being searched?');
  }

  if (maxScore - minScore < 0.05) {
    console.log('  WARNING: Scores are clustered — results may all be equally (ir)relevant.');
    console.log('  Check: Are embeddings diverse enough? Is the collection too homogeneous?');
  }

  return results;
}

Common query issues and fixes

SymptomLikely CauseFix
All scores very low (<0.5)Query is out of domainCheck that the query matches stored content topics
All scores very high (>0.95)Duplicate or near-duplicate vectorsDeduplicate your stored documents
Scores tightly clusteredEmbeddings lack diversityUse a better embedding model, or improve document chunking
Wrong documents ranked firstPoor chunking strategySmaller, more focused chunks; include section headings in chunk text
Results missing expected docsDocuments not stored, or wrong namespaceVerify upsert succeeded; check namespace/collection name
Inconsistent resultsDifferent embedding models for store vs queryEnsure same model for both

11. Key Takeaways

  1. The query flow is always: embed query -> search DB -> return top-k — and you must use the same embedding model for queries as for stored documents.
  2. Top-k controls result count — use 3-5 for focused Q&A, 10-20 for broad research, 20-50+ for retrieve-then-rerank patterns.
  3. Similarity scores vary by database and model — Pinecone/Qdrant return similarity (higher = better), Chroma returns distance (lower = better). Calibrate thresholds to your specific model and data.
  4. Score thresholds prevent irrelevant results from reaching the LLM — better to say "I don't know" than to hallucinate from poor context.
  5. Performance depends on vector count, dimensions, top-k, and filters — scope queries to the right namespace/collection and avoid returning full vectors.
  6. Debug systematically — check score distributions, verify embedding model consistency, and examine what the top results actually contain.

Explain-It Challenge

  1. A teammate's RAG chatbot returns irrelevant answers for specific product questions. Walk through the debugging steps you would take.
  2. Explain to a junior developer why setting topK: 100 and passing all results to the LLM is a bad idea, even if the LLM's context window is large enough.
  3. Your vector database returns cosine similarity scores of 0.72, 0.71, 0.70, 0.69 for the top 4 results. Should you use all four as context? How do you decide?

Navigation: <- 4.12.a — Storing Embeddings | 4.12.c — Metadata Filters ->