Episode 4 — Generative AI Engineering / 4.12 — Integrating Vector Databases
4.12.b — Querying Similar Vectors
In one sentence: Querying a vector database follows a consistent flow — embed the query, search the database for the top-k nearest neighbors, interpret similarity scores, and optionally combine with metadata filters — with tuning knobs for relevance thresholds, performance, and pagination that determine whether your RAG pipeline returns good results or garbage.
Navigation: <- 4.12.a — Storing Embeddings | 4.12.c — Metadata Filters ->
1. The Query Flow
Every vector database query follows the same fundamental pattern, regardless of which database you use:
┌──────────────────────────────────────────────────────────────────────┐
│ VECTOR SEARCH QUERY FLOW │
│ │
│ Step 1: User asks a question │
│ "How do I reset my password?" │
│ │ │
│ ▼ │
│ Step 2: Embed the query using the SAME model used for storage │
│ text-embedding-3-small("How do I reset my password?") │
│ → [0.023, -0.041, 0.087, ..., 0.015] (1536 dimensions) │
│ │ │
│ ▼ │
│ Step 3: Send query vector to the vector database │
│ "Find the top 5 vectors most similar to this query vector" │
│ │ │
│ ▼ │
│ Step 4: Vector DB uses ANN index (HNSW/IVF) to find neighbors │
│ Searches millions of vectors in milliseconds │
│ │ │
│ ▼ │
│ Step 5: Return top-k results with scores and metadata │
│ [ │
│ { id: "doc_42", score: 0.94, text: "To reset your password..." },│
│ { id: "doc_87", score: 0.89, text: "Password recovery steps..." },│
│ { id: "doc_15", score: 0.82, text: "Account security FAQ..." }, │
│ ] │
│ │ │
│ ▼ │
│ Step 6: Pass retrieved text as context to the LLM │
│ "Based on these documents, answer the user's question..." │
└──────────────────────────────────────────────────────────────────────┘
Critical rule: same embedding model
The query must be embedded with the exact same model that was used to embed the stored documents. Mixing models produces meaningless results because different models produce vectors in different "spaces."
✅ Stored with text-embedding-3-small → Query with text-embedding-3-small
❌ Stored with text-embedding-3-small → Query with text-embedding-3-large
❌ Stored with text-embedding-3-small → Query with Cohere embed-v4.0
2. Basic Query Examples
2.1 Querying Pinecone
import { Pinecone } from '@pinecone-database/pinecone';
import OpenAI from 'openai';
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const openai = new OpenAI();
async function queryPinecone(queryText, topK = 5) {
// Step 1: Embed the query
const embeddingResponse = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: queryText,
});
const queryVector = embeddingResponse.data[0].embedding;
// Step 2: Search the vector database
const index = pinecone.index('knowledge-base');
const results = await index.query({
vector: queryVector,
topK: topK,
includeMetadata: true, // Return stored metadata with results
includeValues: false, // Don't return the full vector (saves bandwidth)
});
// Step 3: Process results
return results.matches.map((match) => ({
id: match.id,
score: match.score, // Similarity score (0 to 1 for cosine)
text: match.metadata.text,
category: match.metadata.category,
source: match.metadata.source,
}));
}
// ─── Usage ───
const results = await queryPinecone('How do I reset my password?');
console.log('Search results:');
results.forEach((result, i) => {
console.log(`${i + 1}. [Score: ${result.score.toFixed(3)}] ${result.text}`);
});
// Output:
// 1. [Score: 0.943] To reset your password, go to Settings > Security > Change Password.
// 2. [Score: 0.891] Password recovery: click "Forgot Password" on the login page.
// 3. [Score: 0.823] Two-factor authentication adds an extra layer of security.
// 4. [Score: 0.756] Account settings allow you to update your profile information.
// 5. [Score: 0.701] Contact support if you are locked out of your account.
2.2 Querying Chroma
import { ChromaClient } from 'chromadb';
import OpenAI from 'openai';
const chroma = new ChromaClient();
const openai = new OpenAI();
async function queryChroma(queryText, nResults = 5) {
const collection = await chroma.getCollection({ name: 'knowledge-base' });
// Generate query embedding
const embeddingResponse = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: queryText,
});
const queryVector = embeddingResponse.data[0].embedding;
// Query the collection
const results = await collection.query({
queryEmbeddings: [queryVector],
nResults: nResults,
include: ['documents', 'metadatas', 'distances'], // What to return
});
// Chroma returns arrays of arrays (supports multiple queries at once)
return results.ids[0].map((id, i) => ({
id: id,
distance: results.distances[0][i], // Distance (lower = more similar for cosine)
text: results.documents[0][i],
metadata: results.metadatas[0][i],
}));
}
// ─── Usage ───
const results = await queryChroma('How do I reset my password?');
results.forEach((result, i) => {
console.log(`${i + 1}. [Distance: ${result.distance.toFixed(4)}] ${result.text}`);
});
// Note: Chroma returns DISTANCE, not similarity.
// For cosine: distance = 1 - similarity
// So distance 0.057 = similarity 0.943
2.3 Querying Qdrant
import { QdrantClient } from '@qdrant/js-client-rest';
import OpenAI from 'openai';
const qdrant = new QdrantClient({ url: 'http://localhost:6333' });
const openai = new OpenAI();
async function queryQdrant(queryText, limit = 5) {
// Generate query embedding
const embeddingResponse = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: queryText,
});
const queryVector = embeddingResponse.data[0].embedding;
// Search
const results = await qdrant.search('knowledge-base', {
vector: queryVector,
limit: limit,
with_payload: true, // Return payload (metadata)
with_vectors: false, // Don't return vectors (saves bandwidth)
});
return results.map((point) => ({
id: point.id,
score: point.score,
text: point.payload.text,
category: point.payload.category,
}));
}
3. Understanding the Top-K Parameter
The top-k (or topK, nResults, limit depending on the database) parameter controls how many results the vector database returns. This is one of the most important tuning knobs in your RAG pipeline.
How top-k affects your system
top-k = 1:
Only the single most similar result
✅ Most relevant, least noise
❌ Misses important context if the question spans multiple documents
top-k = 3-5:
A handful of highly relevant results
✅ Good balance for most RAG use cases
✅ Keeps LLM context manageable
❌ May miss edge cases in large knowledge bases
top-k = 10-20:
Broader retrieval
✅ More comprehensive context
✅ Better for complex questions that need multiple sources
❌ Includes less relevant results (noise)
❌ Uses more LLM tokens (cost + latency)
top-k = 50-100:
Very broad retrieval
✅ Useful for re-ranking (retrieve many, then re-rank to top 5)
❌ Too much context for direct LLM consumption
❌ Increases latency and cost significantly
Choosing the right top-k
| Use Case | Recommended top-k | Why |
|---|---|---|
| FAQ chatbot | 3-5 | Questions usually map to 1-2 documents |
| Research assistant | 10-20 | Complex questions need multiple sources |
| Code search | 5-10 | Multiple relevant code snippets |
| Product recommendation | 10-20 | Users want variety |
| Retrieve then re-rank | 20-50 | Broad retrieval, then ML re-ranker picks top 5 |
| Simple classification | 1-3 | Just need the closest match |
Dynamic top-k
In production, you might adjust top-k based on the query:
function determineTopK(query) {
// Short, specific questions need fewer results
if (query.split(' ').length < 5) return 3;
// Questions with "compare", "list", "all" need more results
const broadTerms = ['compare', 'list', 'all', 'different', 'options', 'alternatives'];
if (broadTerms.some((term) => query.toLowerCase().includes(term))) return 15;
// Default
return 5;
}
4. Similarity Scores: What They Mean
Different vector databases return scores differently. Understanding what the numbers mean is essential for setting thresholds and debugging.
4.1 Cosine similarity
Range: -1 to 1 (for normalized vectors: 0 to 1)
1.0 = identical vectors (perfect match)
0.0 = completely unrelated (orthogonal)
-1.0 = opposite meaning (rare in practice with embeddings)
In practice with text embeddings:
0.90 - 1.00 → Very high similarity (likely same topic, near paraphrase)
0.80 - 0.90 → High similarity (related content, same domain)
0.70 - 0.80 → Moderate similarity (loosely related)
0.60 - 0.70 → Low similarity (tangentially related)
< 0.60 → Probably not relevant
4.2 How different databases report scores
| Database | Score Name | Metric | Interpretation |
|---|---|---|---|
| Pinecone | score | Cosine similarity | 0-1, higher = more similar |
| Chroma | distance | Cosine distance | 0-2, lower = more similar |
| Qdrant | score | Cosine similarity | 0-1, higher = more similar |
| pgvector | <=> operator | Cosine distance | 0-2, lower = more similar |
Converting between similarity and distance:
// Cosine distance = 1 - cosine similarity
const similarity = 0.943;
const distance = 1 - similarity; // 0.057
// Chroma returns distance, convert to similarity:
const chromaDistance = 0.057;
const chromaSimilarity = 1 - chromaDistance; // 0.943
4.3 Score interpretation varies by embedding model
Different embedding models produce different score distributions. A score of 0.85 from text-embedding-3-small does NOT mean the same thing as 0.85 from text-embedding-3-large or from Cohere's model.
text-embedding-3-small typical ranges:
Same sentence: 0.95 - 1.00
Paraphrase: 0.85 - 0.95
Same topic: 0.70 - 0.85
Related: 0.55 - 0.70
Unrelated: 0.30 - 0.55
text-embedding-3-large typical ranges:
Same sentence: 0.97 - 1.00
Paraphrase: 0.88 - 0.97
Same topic: 0.72 - 0.88
Related: 0.58 - 0.72
Unrelated: 0.30 - 0.58
IMPORTANT: Always calibrate thresholds to YOUR embedding model and YOUR data.
Run experiments on your actual dataset to determine what scores indicate
"relevant" vs "irrelevant" for your specific use case.
5. Similarity Thresholds: Filtering by Relevance
Just because the vector database returns top-k results doesn't mean all results are relevant. A similarity threshold filters out low-quality results.
Implementing a score threshold
async function queryWithThreshold(queryText, options = {}) {
const {
topK = 10,
scoreThreshold = 0.75, // Minimum similarity score to include
} = options;
// Query with a higher topK than needed (retrieve more, filter later)
const embeddingResponse = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: queryText,
});
const queryVector = embeddingResponse.data[0].embedding;
const index = pinecone.index('knowledge-base');
const results = await index.query({
vector: queryVector,
topK: topK,
includeMetadata: true,
});
// Filter by score threshold
const relevantResults = results.matches.filter(
(match) => match.score >= scoreThreshold
);
if (relevantResults.length === 0) {
return {
results: [],
message: 'No sufficiently relevant documents found.',
bestScore: results.matches[0]?.score || 0,
};
}
return {
results: relevantResults.map((match) => ({
id: match.id,
score: match.score,
text: match.metadata.text,
})),
message: `Found ${relevantResults.length} relevant documents.`,
};
}
// ─── Usage ───
// Good query — matches exist
const goodResult = await queryWithThreshold('How do I reset my password?');
// { results: [{ score: 0.94, text: "..." }, { score: 0.89, text: "..." }], ... }
// Bad query — nothing relevant in the database
const badResult = await queryWithThreshold('What is the meaning of life?');
// { results: [], message: "No sufficiently relevant documents found.", bestScore: 0.41 }
Adaptive thresholds
Different types of queries may need different thresholds:
function getThresholdForQuery(query) {
// Factual/specific questions need high threshold
// (wrong answers are worse than no answers)
const factualPatterns = /^(how|what|when|where|why|who)\s/i;
if (factualPatterns.test(query)) return 0.80;
// Exploratory queries can use lower threshold
const exploratoryPatterns = /\b(related|similar|like|about|explore)\b/i;
if (exploratoryPatterns.test(query)) return 0.65;
// Default threshold
return 0.75;
}
When to use thresholds vs not
| Scenario | Use Threshold? | Why |
|---|---|---|
| RAG Q&A | Yes (0.75-0.85) | Better to say "I don't know" than give irrelevant context |
| Product search | Maybe (0.60-0.70) | Users expect results, even loosely related |
| Duplicate detection | Yes (0.90+) | Only want near-exact matches |
| Recommendation | No | Always want to show something |
| Classification | No | Always pick the closest class |
6. Combining Vector Search with Metadata Filtering
One of the most powerful features of vector databases is combining semantic similarity with structured filters. This is covered in depth in 4.12.c, but here is the core concept.
// ─── Pinecone: Vector search + metadata filter ───
const results = await index.query({
vector: queryVector,
topK: 5,
includeMetadata: true,
filter: {
category: { $eq: 'billing' }, // Only search billing documents
date: { $gte: '2026-01-01' }, // Only recent documents
},
});
// ─── Chroma: Vector search + metadata filter ───
const results = await collection.query({
queryEmbeddings: [queryVector],
nResults: 5,
where: {
category: 'billing', // Only search billing documents
},
});
// ─── Qdrant: Vector search + payload filter ───
const results = await qdrant.search('knowledge-base', {
vector: queryVector,
limit: 5,
filter: {
must: [
{ key: 'category', match: { value: 'billing' } },
{ key: 'date', range: { gte: '2026-01-01' } },
],
},
});
The vector database applies both the semantic similarity ranking AND the metadata filter simultaneously — you get results that are both semantically relevant and match your structured constraints.
7. Performance Considerations
Vector search performance depends on several factors. Understanding these helps you design for your latency and throughput requirements.
7.1 What affects query latency
| Factor | Impact | Mitigation |
|---|---|---|
| Vector count | More vectors = slower search (sub-linear with ANN) | Partition into namespaces/collections |
| Vector dimensions | Higher dimensions = slower distance calculations | Use smaller models if quality allows |
| Top-k size | Larger k = slightly slower | Use minimum k needed |
| Metadata filters | Complex filters add overhead | Index frequently filtered fields |
| Include vectors | Returning full vectors increases payload size | Set includeValues: false |
| Network latency | Remote DB adds round-trip time | Use same region, or local DB for dev |
| Index type | HNSW vs IVF have different characteristics | HNSW for most cases |
7.2 Typical latency benchmarks
Pinecone (managed, same region):
1M vectors, top-5: ~20-50ms
10M vectors, top-5: ~30-80ms
With metadata filter: +10-30ms
Qdrant (self-hosted, local):
1M vectors, top-5: ~5-15ms
10M vectors, top-5: ~10-30ms
With payload filter: +5-20ms
Chroma (local/embedded):
100K vectors, top-5: ~5-20ms
1M vectors, top-5: ~20-80ms
Note: These are approximate. Actual performance depends on hardware,
index configuration, vector dimensions, and query complexity.
7.3 Optimizing query performance
// ─── Optimization 1: Don't return vectors ───
// BAD: Returns full 1536-dim vectors (wastes bandwidth)
const slow = await index.query({
vector: queryVector,
topK: 5,
includeMetadata: true,
includeValues: true, // 1536 floats x 5 results = large payload
});
// GOOD: Only return metadata
const fast = await index.query({
vector: queryVector,
topK: 5,
includeMetadata: true,
includeValues: false, // Skip the vectors — you rarely need them
});
// ─── Optimization 2: Namespace/collection scoping ───
// BAD: Search all 10M vectors
const slow2 = await index.query({ vector: queryVector, topK: 5 });
// GOOD: Search only the relevant namespace (e.g., 500K vectors)
const fast2 = await index.namespace('help-articles').query({
vector: queryVector,
topK: 5,
});
// ─── Optimization 3: Parallel embedding + search ───
async function optimizedRAGQuery(queryText) {
// Generate embedding (takes ~100-300ms)
const embeddingResponse = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: queryText,
});
const queryVector = embeddingResponse.data[0].embedding;
// Search (takes ~20-50ms) — happens after embedding is ready
const results = await index.query({
vector: queryVector,
topK: 5,
includeMetadata: true,
includeValues: false,
});
return results.matches;
}
8. Pagination: Handling Large Result Sets
Sometimes you need more results than a single query returns, or you want to let users browse through results.
8.1 Offset-based pagination (Chroma)
async function paginatedQuery(queryText, page = 1, pageSize = 10) {
const collection = await chroma.getCollection({ name: 'knowledge-base' });
const embeddingResponse = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: queryText,
});
const queryVector = embeddingResponse.data[0].embedding;
// Chroma supports offset-based pagination
const results = await collection.query({
queryEmbeddings: [queryVector],
nResults: pageSize,
offset: (page - 1) * pageSize, // Skip previous pages
include: ['documents', 'metadatas', 'distances'],
});
return {
page: page,
pageSize: pageSize,
results: results.ids[0].map((id, i) => ({
id: id,
distance: results.distances[0][i],
text: results.documents[0][i],
metadata: results.metadatas[0][i],
})),
};
}
// ─── Usage ───
const page1 = await paginatedQuery('machine learning tutorials', 1, 10);
const page2 = await paginatedQuery('machine learning tutorials', 2, 10);
8.2 Cursor-based pagination (Qdrant)
async function scrollResults(queryText, batchSize = 10) {
const embeddingResponse = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: queryText,
});
const queryVector = embeddingResponse.data[0].embedding;
let allResults = [];
let offset = null;
// Scroll through results in batches
while (true) {
const response = await qdrant.scroll('knowledge-base', {
limit: batchSize,
offset: offset,
with_payload: true,
with_vectors: false,
});
allResults.push(...response.points);
// No more results
if (!response.next_page_offset) break;
offset = response.next_page_offset;
}
return allResults;
}
8.3 Over-fetch and client-side paginate (Pinecone)
Pinecone does not natively support offset/pagination on query results. The common pattern is to fetch a larger topK and paginate client-side.
async function paginatedPineconeQuery(queryText, page = 1, pageSize = 5) {
const embeddingResponse = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: queryText,
});
const queryVector = embeddingResponse.data[0].embedding;
// Fetch enough results to cover the requested page
const totalNeeded = page * pageSize;
const index = pinecone.index('knowledge-base');
const results = await index.query({
vector: queryVector,
topK: Math.min(totalNeeded, 100), // Pinecone max topK varies by plan
includeMetadata: true,
});
// Client-side pagination
const start = (page - 1) * pageSize;
const end = start + pageSize;
const pageResults = results.matches.slice(start, end);
return {
page: page,
pageSize: pageSize,
totalAvailable: results.matches.length,
results: pageResults.map((match) => ({
id: match.id,
score: match.score,
text: match.metadata.text,
})),
};
}
Pagination considerations
| Approach | Pros | Cons |
|---|---|---|
| Offset-based | Simple, familiar (like SQL OFFSET) | Less efficient for deep pages |
| Cursor-based | Efficient for sequential scanning | Can't jump to arbitrary page |
| Over-fetch + client-side | Works with any DB | Wastes bandwidth and compute on earlier pages |
| No pagination (just top-k) | Simplest, fastest | Limited results |
For RAG pipelines, pagination is rarely needed — you almost always just want the top 5-10 results. Pagination is more useful for search UIs where users browse through results.
9. End-to-End RAG Query Function
Here is a production-ready function that combines everything from this section into a complete RAG query pipeline:
import { Pinecone } from '@pinecone-database/pinecone';
import OpenAI from 'openai';
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const openai = new OpenAI();
/**
* Complete RAG query pipeline:
* 1. Embed the user's question
* 2. Search the vector database
* 3. Filter by relevance threshold
* 4. Format context for the LLM
* 5. Generate a grounded answer
*/
async function ragQuery(userQuestion, options = {}) {
const {
indexName = 'knowledge-base',
namespace = '',
topK = 5,
scoreThreshold = 0.75,
model = 'gpt-4o',
embeddingModel = 'text-embedding-3-small',
systemPrompt = 'You are a helpful assistant. Answer the user\'s question based ONLY on the provided context. If the context does not contain enough information, say "I don\'t have enough information to answer that."',
} = options;
// Step 1: Embed the query
const embeddingResponse = await openai.embeddings.create({
model: embeddingModel,
input: userQuestion,
});
const queryVector = embeddingResponse.data[0].embedding;
// Step 2: Search the vector database
const index = pinecone.index(indexName);
const ns = namespace ? index.namespace(namespace) : index;
const searchResults = await ns.query({
vector: queryVector,
topK: topK,
includeMetadata: true,
includeValues: false,
});
// Step 3: Filter by relevance threshold
const relevantDocs = searchResults.matches.filter(
(match) => match.score >= scoreThreshold
);
if (relevantDocs.length === 0) {
return {
answer: "I don't have enough information in my knowledge base to answer that question.",
sources: [],
topScore: searchResults.matches[0]?.score || 0,
};
}
// Step 4: Format context for the LLM
const context = relevantDocs
.map((doc, i) => `[Document ${i + 1}] (relevance: ${doc.score.toFixed(2)})\n${doc.metadata.text}`)
.join('\n\n');
// Step 5: Generate a grounded answer
const completion = await openai.chat.completions.create({
model: model,
temperature: 0, // Deterministic for factual answers
messages: [
{ role: 'system', content: systemPrompt },
{
role: 'user',
content: `Context:\n${context}\n\n---\n\nQuestion: ${userQuestion}`,
},
],
});
return {
answer: completion.choices[0].message.content,
sources: relevantDocs.map((doc) => ({
id: doc.id,
score: doc.score,
text: doc.metadata.text,
category: doc.metadata.category,
})),
topScore: relevantDocs[0].score,
tokensUsed: completion.usage,
};
}
// ─── Usage ───
const result = await ragQuery('How do I reset my password?');
console.log('Answer:', result.answer);
console.log('Sources:', result.sources.length);
console.log('Top relevance score:', result.topScore);
console.log('Tokens used:', result.tokensUsed);
10. Debugging Vector Search Results
When your search results look wrong, here's a systematic debugging approach:
async function debugQuery(queryText) {
// 1. Check the query embedding
const embeddingResponse = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: queryText,
});
const queryVector = embeddingResponse.data[0].embedding;
console.log('Query embedding dimensions:', queryVector.length);
console.log('First 5 values:', queryVector.slice(0, 5));
// 2. Query with a high topK to see the full score distribution
const index = pinecone.index('knowledge-base');
const results = await index.query({
vector: queryVector,
topK: 20,
includeMetadata: true,
});
// 3. Analyze score distribution
console.log('\nScore distribution:');
results.matches.forEach((match, i) => {
const bar = '█'.repeat(Math.round(match.score * 50));
console.log(
` ${(i + 1).toString().padStart(2)}. [${match.score.toFixed(3)}] ${bar} ${match.metadata.text?.slice(0, 60)}...`
);
});
// 4. Check for common issues
const scores = results.matches.map((m) => m.score);
const maxScore = Math.max(...scores);
const minScore = Math.min(...scores);
const avgScore = scores.reduce((a, b) => a + b, 0) / scores.length;
console.log('\nDiagnostics:');
console.log(` Max score: ${maxScore.toFixed(3)}`);
console.log(` Min score: ${minScore.toFixed(3)}`);
console.log(` Avg score: ${avgScore.toFixed(3)}`);
console.log(` Score range: ${(maxScore - minScore).toFixed(3)}`);
if (maxScore < 0.5) {
console.log(' WARNING: All scores very low — query may not match any content.');
console.log(' Check: Is the query domain-relevant? Is the correct index being searched?');
}
if (maxScore - minScore < 0.05) {
console.log(' WARNING: Scores are clustered — results may all be equally (ir)relevant.');
console.log(' Check: Are embeddings diverse enough? Is the collection too homogeneous?');
}
return results;
}
Common query issues and fixes
| Symptom | Likely Cause | Fix |
|---|---|---|
| All scores very low (<0.5) | Query is out of domain | Check that the query matches stored content topics |
| All scores very high (>0.95) | Duplicate or near-duplicate vectors | Deduplicate your stored documents |
| Scores tightly clustered | Embeddings lack diversity | Use a better embedding model, or improve document chunking |
| Wrong documents ranked first | Poor chunking strategy | Smaller, more focused chunks; include section headings in chunk text |
| Results missing expected docs | Documents not stored, or wrong namespace | Verify upsert succeeded; check namespace/collection name |
| Inconsistent results | Different embedding models for store vs query | Ensure same model for both |
11. Key Takeaways
- The query flow is always: embed query -> search DB -> return top-k — and you must use the same embedding model for queries as for stored documents.
- Top-k controls result count — use 3-5 for focused Q&A, 10-20 for broad research, 20-50+ for retrieve-then-rerank patterns.
- Similarity scores vary by database and model — Pinecone/Qdrant return similarity (higher = better), Chroma returns distance (lower = better). Calibrate thresholds to your specific model and data.
- Score thresholds prevent irrelevant results from reaching the LLM — better to say "I don't know" than to hallucinate from poor context.
- Performance depends on vector count, dimensions, top-k, and filters — scope queries to the right namespace/collection and avoid returning full vectors.
- Debug systematically — check score distributions, verify embedding model consistency, and examine what the top results actually contain.
Explain-It Challenge
- A teammate's RAG chatbot returns irrelevant answers for specific product questions. Walk through the debugging steps you would take.
- Explain to a junior developer why setting
topK: 100and passing all results to the LLM is a bad idea, even if the LLM's context window is large enough. - Your vector database returns cosine similarity scores of 0.72, 0.71, 0.70, 0.69 for the top 4 results. Should you use all four as context? How do you decide?
Navigation: <- 4.12.a — Storing Embeddings | 4.12.c — Metadata Filters ->