Episode 4 — Generative AI Engineering / 4.12 — Integrating Vector Databases

4.12.c — Metadata Filters

In one sentence: Metadata filters let you attach structured data (tags, categories, dates, source info) to every vector and then constrain your similarity search to only vectors matching specific criteria — turning a pure "find similar meaning" query into a precise "find similar meaning within these constraints" query that powers real-world RAG applications.

Navigation: <- 4.12.b — Querying Similar Vectors | 4.12 Overview ->

1. What Is Metadata?

When you store a vector in a database, the embedding captures the semantic meaning of the text. But meaning alone is not enough. You also need to know where the text came from, when it was written, what category it belongs to, and other structured facts. This structured information is called metadata.

┌──────────────────────────────────────────────────────────────────────┐
│  Vector Record with Metadata                                          │
│                                                                        │
│  id:      "doc_4821"                                                  │
│  vector:  [0.023, -0.041, 0.087, ...]     ← Semantic meaning          │
│  metadata: {                               ← Structured facts          │
│    "text": "Refunds are processed within 5-7 business days...",       │
│    "source": "help-center",                ← Where it came from        │
│    "category": "billing",                  ← Topic classification      │
│    "date": "2026-03-15",                   ← When it was written       │
│    "author": "support-team",               ← Who wrote it              │
│    "language": "en",                       ← What language              │
│    "version": 3,                           ← Which version              │
│    "is_published": true,                   ← Whether it's live          │
│    "tags": ["refund", "payment", "billing"]← Searchable tags           │
│  }                                                                     │
└──────────────────────────────────────────────────────────────────────┘

Why metadata matters

Without metadata filtering, every query searches your entire vector collection. This creates problems:

Scenario: Customer support chatbot with 500,000 vectors

Without metadata filtering:
  User asks: "What is your refund policy?"
  Search: Find top-5 most similar across ALL 500,000 vectors
  Problem: Results might include:
    - Internal engineering docs about "refund microservice"
    - Deprecated policy from 2023 (no longer accurate)
    - Spanish-language version of the policy
    - Draft article that was never published

With metadata filtering:
  User asks: "What is your refund policy?"
  Search: Find top-5 most similar WHERE source="help-center"
          AND language="en" AND is_published=true AND date >= "2026-01-01"
  Result: Only current, published, English help articles about refunds

2. Types of Metadata

Metadata falls into several categories, each serving different filtering purposes:

Category	Examples	Filter Use Case
Source/Origin	`source: "help-center"`, `url: "..."`, `file_name: "..."`	Restrict search to specific data sources
Classification	`category: "billing"`, `department: "engineering"`, `type: "faq"`	Filter by topic or content type
Temporal	`date: "2026-03-15"`, `created_at: "..."`, `updated_at: "..."`	Only recent documents, time-range queries
Author/Owner	`author: "jane"`, `team: "support"`, `tenant_id: "cust_123"`	Multi-tenant isolation, ownership filtering
Status	`is_published: true`, `status: "active"`, `version: 3`	Only live/active content
Language	`language: "en"`, `locale: "en-US"`	Language-specific search
Structural	`chunk_index: 2`, `section: "pricing"`, `parent_doc: "doc_100"`	Navigate document structure
Tags/Labels	`tags: ["refund", "payment"]`	Flexible multi-label filtering
Numeric	`price: 49.99`, `word_count: 350`, `relevance_score: 0.8`	Range queries on numeric values

3. Filtering During Search: The Core Concept

Metadata filtering happens during the vector search, not after. This is an important distinction:

Approach 1: Post-filtering (INEFFICIENT)
  1. Search all 1M vectors for top-100 most similar
  2. Filter the 100 results by metadata
  3. Return the filtered results
  Problem: If only 3 out of 100 results match the filter,
           you get only 3 results instead of the 10 you wanted.

Approach 2: Pre-filtering (WHAT VECTOR DBs DO)
  1. Apply metadata filter to narrow the candidate set
  2. Search only within matching vectors for top-10 most similar
  3. Return results that are BOTH similar AND match the filter
  Benefit: You always get the right number of results,
           and search is faster because the candidate set is smaller.

Reality: Most vector DBs use a hybrid approach — they interleave
filtering and search in their index traversal for optimal performance.

4. Filter Syntax Across Different Vector Databases

Each vector database has its own filter syntax. Here is a comprehensive comparison.

4.1 Pinecone Filters

Pinecone uses a MongoDB-like query syntax:

// ─── Pinecone Filter Operators ───

const index = pinecone.index('knowledge-base');

// Equality
await index.query({
  vector: queryVector,
  topK: 5,
  filter: {
    category: { $eq: 'billing' },        // category equals "billing"
  },
});

// Not equal
await index.query({
  vector: queryVector,
  topK: 5,
  filter: {
    status: { $ne: 'draft' },            // status is NOT "draft"
  },
});

// Numeric comparisons
await index.query({
  vector: queryVector,
  topK: 5,
  filter: {
    version: { $gt: 2 },                 // version greater than 2
    price: { $lte: 100.00 },             // price less than or equal to 100
  },
});

// In / Not In (match against a list)
await index.query({
  vector: queryVector,
  topK: 5,
  filter: {
    category: { $in: ['billing', 'account', 'payment'] },  // category is one of these
    source: { $nin: ['deprecated', 'internal'] },           // source is NOT one of these
  },
});

// Combining conditions (AND — all conditions must match)
await index.query({
  vector: queryVector,
  topK: 5,
  filter: {
    category: { $eq: 'billing' },
    language: { $eq: 'en' },
    date: { $gte: '2026-01-01' },
    is_published: { $eq: true },
  },
});

// Explicit AND / OR with $and / $or
await index.query({
  vector: queryVector,
  topK: 5,
  filter: {
    $or: [
      { category: { $eq: 'billing' } },
      { category: { $eq: 'payment' } },
    ],
  },
});

// Complex nested filter
await index.query({
  vector: queryVector,
  topK: 5,
  filter: {
    $and: [
      {
        $or: [
          { category: { $eq: 'billing' } },
          { category: { $eq: 'account' } },
        ],
      },
      { date: { $gte: '2026-01-01' } },
      { is_published: { $eq: true } },
    ],
  },
});

Pinecone filter operators reference

Operator	Description	Example
`$eq`	Equals	`{ category: { $eq: "billing" } }`
`$ne`	Not equals	`{ status: { $ne: "draft" } }`
`$gt`	Greater than	`{ version: { $gt: 2 } }`
`$gte`	Greater than or equal	`{ date: { $gte: "2026-01-01" } }`
`$lt`	Less than	`{ price: { $lt: 50 } }`
`$lte`	Less than or equal	`{ score: { $lte: 0.5 } }`
`$in`	In list	`{ category: { $in: ["a", "b"] } }`
`$nin`	Not in list	`{ source: { $nin: ["x", "y"] } }`
`$exists`	Field exists	`{ author: { $exists: true } }`
`$and`	All conditions must match	`{ $and: [...conditions] }`
`$or`	Any condition must match	`{ $or: [...conditions] }`

4.2 Chroma Filters

Chroma uses a where clause with its own syntax:

const collection = await chroma.getCollection({ name: 'knowledge-base' });

// Basic equality
const results = await collection.query({
  queryEmbeddings: [queryVector],
  nResults: 5,
  where: {
    category: 'billing',                  // Shorthand for equals
  },
});

// Comparison operators
const results2 = await collection.query({
  queryEmbeddings: [queryVector],
  nResults: 5,
  where: {
    version: { $gt: 2 },                 // Greater than
  },
});

// Combining with $and / $or
const results3 = await collection.query({
  queryEmbeddings: [queryVector],
  nResults: 5,
  where: {
    $and: [
      { category: 'billing' },
      { language: 'en' },
      { version: { $gte: 2 } },
    ],
  },
});

// OR conditions
const results4 = await collection.query({
  queryEmbeddings: [queryVector],
  nResults: 5,
  where: {
    $or: [
      { category: 'billing' },
      { category: 'payment' },
    ],
  },
});

// Chroma also supports filtering on the document text itself
const results5 = await collection.query({
  queryEmbeddings: [queryVector],
  nResults: 5,
  whereDocument: {
    $contains: 'refund',                  // Document text contains "refund"
  },
});

// Combine metadata filter AND document filter
const results6 = await collection.query({
  queryEmbeddings: [queryVector],
  nResults: 5,
  where: { category: 'billing' },
  whereDocument: { $contains: 'refund' },
});

Chroma filter operators reference

Operator	Description	Example
`$eq`	Equals (or shorthand: `{ key: value }`)	`{ category: "billing" }`
`$ne`	Not equals	`{ status: { $ne: "draft" } }`
`$gt`	Greater than	`{ version: { $gt: 2 } }`
`$gte`	Greater than or equal	`{ version: { $gte: 2 } }`
`$lt`	Less than	`{ price: { $lt: 50 } }`
`$lte`	Less than or equal	`{ price: { $lte: 50 } }`
`$in`	In list	`{ category: { $in: ["a", "b"] } }`
`$nin`	Not in list	`{ source: { $nin: ["x", "y"] } }`
`$and`	All conditions match	`{ $and: [...] }`
`$or`	Any condition matches	`{ $or: [...] }`
`$contains`	Document text contains (whereDocument)	`{ $contains: "refund" }`
`$not_contains`	Document text doesn't contain	`{ $not_contains: "draft" }`

4.3 Qdrant Filters

Qdrant uses a structured filter format with must, should, and must_not clauses:

const qdrant = new QdrantClient({ url: 'http://localhost:6333' });

// Basic filter
const results = await qdrant.search('knowledge-base', {
  vector: queryVector,
  limit: 5,
  filter: {
    must: [
      { key: 'category', match: { value: 'billing' } },
    ],
  },
});

// Multiple conditions (AND = must)
const results2 = await qdrant.search('knowledge-base', {
  vector: queryVector,
  limit: 5,
  filter: {
    must: [
      { key: 'category', match: { value: 'billing' } },
      { key: 'language', match: { value: 'en' } },
      { key: 'is_published', match: { value: true } },
    ],
  },
});

// OR conditions (should — at least one must match)
const results3 = await qdrant.search('knowledge-base', {
  vector: queryVector,
  limit: 5,
  filter: {
    should: [
      { key: 'category', match: { value: 'billing' } },
      { key: 'category', match: { value: 'payment' } },
    ],
  },
});

// NOT conditions (must_not — none can match)
const results4 = await qdrant.search('knowledge-base', {
  vector: queryVector,
  limit: 5,
  filter: {
    must_not: [
      { key: 'status', match: { value: 'draft' } },
      { key: 'source', match: { value: 'deprecated' } },
    ],
  },
});

// Range filter
const results5 = await qdrant.search('knowledge-base', {
  vector: queryVector,
  limit: 5,
  filter: {
    must: [
      {
        key: 'version',
        range: { gte: 2, lt: 5 },        // version >= 2 AND version < 5
      },
    ],
  },
});

// Complex combined filter
const results6 = await qdrant.search('knowledge-base', {
  vector: queryVector,
  limit: 5,
  filter: {
    must: [
      { key: 'is_published', match: { value: true } },
      { key: 'date', range: { gte: '2026-01-01' } },
    ],
    should: [
      { key: 'category', match: { value: 'billing' } },
      { key: 'category', match: { value: 'account' } },
    ],
    must_not: [
      { key: 'source', match: { value: 'internal' } },
    ],
  },
});

Qdrant filter clauses reference

Clause	Description	Behavior
`must`	All conditions must match	AND logic
`should`	At least one condition must match	OR logic
`must_not`	No condition can match	NOT logic
`match`	Exact value match	`{ key: "field", match: { value: "x" } }`
`range`	Numeric/string range	`{ key: "field", range: { gte: 1, lt: 10 } }`

4.4 pgvector Filters (SQL)

pgvector uses standard SQL WHERE clauses — the most familiar syntax if you know SQL:

-- Basic vector search with metadata filter
SELECT id, text, category,
       1 - (embedding <=> query_embedding) AS similarity
FROM documents
WHERE category = 'billing'
  AND language = 'en'
  AND is_published = true
  AND date >= '2026-01-01'
ORDER BY embedding <=> query_embedding
LIMIT 5;

-- OR conditions
SELECT id, text, category,
       1 - (embedding <=> query_embedding) AS similarity
FROM documents
WHERE (category = 'billing' OR category = 'payment')
  AND is_published = true
ORDER BY embedding <=> query_embedding
LIMIT 5;

-- Range + LIKE + IN
SELECT id, text,
       1 - (embedding <=> query_embedding) AS similarity
FROM documents
WHERE version >= 2
  AND source IN ('help-center', 'docs', 'faq')
  AND text LIKE '%refund%'
ORDER BY embedding <=> query_embedding
LIMIT 5;

4.5 Filter syntax comparison table

Operation	Pinecone	Chroma	Qdrant	pgvector (SQL)
Equals	`{ $eq: "x" }`	`{ $eq: "x" }` or `"x"`	`match: { value: "x" }`	`= 'x'`
Not equals	`{ $ne: "x" }`	`{ $ne: "x" }`	`must_not` + `match`	`!= 'x'`
Greater than	`{ $gt: 5 }`	`{ $gt: 5 }`	`range: { gt: 5 }`	`> 5`
In list	`{ $in: [...] }`	`{ $in: [...] }`	`match: { any: [...] }`	`IN (...)`
AND	`$and` or top-level	`$and`	`must: [...]`	`AND`
OR	`$or`	`$or`	`should: [...]`	`OR`
NOT	`$ne` / `$nin`	`$ne` / `$nin`	`must_not: [...]`	`NOT` / `!=`

5. Real-World Use Cases for Metadata Filtering

5.1 Multi-tenant SaaS application

Each customer's data is isolated by tenant_id:

async function queryForTenant(tenantId, queryText) {
  const queryVector = await generateEmbedding(queryText);

  const results = await index.query({
    vector: queryVector,
    topK: 5,
    includeMetadata: true,
    filter: {
      tenant_id: { $eq: tenantId },       // CRITICAL: Tenant isolation
    },
  });

  return results.matches;
}

// Customer A can only search their own data
const customerAResults = await queryForTenant('tenant_abc', 'refund policy');

// Customer B's data is completely invisible to Customer A
const customerBResults = await queryForTenant('tenant_xyz', 'refund policy');

5.2 Time-sensitive knowledge base

Only search recent, current documents:

async function queryRecentDocs(queryText, daysBack = 30) {
  const queryVector = await generateEmbedding(queryText);

  // Calculate the cutoff date
  const cutoffDate = new Date();
  cutoffDate.setDate(cutoffDate.getDate() - daysBack);
  const cutoffString = cutoffDate.toISOString().split('T')[0]; // "2026-03-12"

  const results = await index.query({
    vector: queryVector,
    topK: 5,
    includeMetadata: true,
    filter: {
      date: { $gte: cutoffString },       // Only docs from last 30 days
      is_published: { $eq: true },         // Only published
    },
  });

  return results.matches;
}

// Search only recent docs
const recentResults = await queryRecentDocs('shipping policy update', 30);

5.3 Category-scoped search

Let users search within a specific section of your knowledge base:

async function categorySearch(queryText, category, options = {}) {
  const { language = 'en', includeArchived = false } = options;
  const queryVector = await generateEmbedding(queryText);

  const filter = {
    category: { $eq: category },
    language: { $eq: language },
  };

  if (!includeArchived) {
    filter.status = { $ne: 'archived' };
  }

  const results = await index.query({
    vector: queryVector,
    topK: 5,
    includeMetadata: true,
    filter: filter,
  });

  return results.matches;
}

// Search only billing articles, in English, excluding archived
const billingResults = await categorySearch('payment methods', 'billing');

// Search engineering docs, including archived
const engineeringResults = await categorySearch(
  'database migration',
  'engineering',
  { includeArchived: true }
);

5.4 Document version control

Only search the latest version of each document:

async function queryLatestVersions(queryText) {
  const queryVector = await generateEmbedding(queryText);

  const results = await index.query({
    vector: queryVector,
    topK: 10,
    includeMetadata: true,
    filter: {
      is_latest_version: { $eq: true },   // Only current versions
    },
  });

  return results.matches;
}

5.5 Permission-aware search

Only return documents the current user has access to:

async function permissionAwareSearch(queryText, userRoles) {
  const queryVector = await generateEmbedding(queryText);

  // User has roles like ["employee", "engineering", "admin"]
  // Documents have access_level: "public", "internal", "engineering", "admin"

  const results = await index.query({
    vector: queryVector,
    topK: 10,
    includeMetadata: true,
    filter: {
      access_level: { $in: userRoles },   // Only docs the user can access
      is_published: { $eq: true },
    },
  });

  return results.matches;
}

// Regular employee sees public + internal docs
const employeeResults = await permissionAwareSearch(
  'company policies',
  ['public', 'internal']
);

// Admin sees everything
const adminResults = await permissionAwareSearch(
  'company policies',
  ['public', 'internal', 'engineering', 'admin']
);

5.6 Chunked document reassembly

When a document is split into chunks, retrieve the surrounding chunks for context:

async function queryWithSurroundingChunks(queryText) {
  const queryVector = await generateEmbedding(queryText);

  // First, find the most relevant chunk
  const results = await index.query({
    vector: queryVector,
    topK: 1,
    includeMetadata: true,
  });

  if (results.matches.length === 0) return null;

  const bestMatch = results.matches[0];
  const parentDocId = bestMatch.metadata.parent_doc_id;
  const chunkIndex = bestMatch.metadata.chunk_index;

  // Then, fetch the surrounding chunks by metadata
  // (This requires a separate fetch, not a vector search)
  const surroundingChunks = await index.query({
    vector: queryVector,             // Still use the query vector for relevance
    topK: 5,
    includeMetadata: true,
    filter: {
      parent_doc_id: { $eq: parentDocId },
      chunk_index: {
        $gte: chunkIndex - 1,        // One chunk before
        $lte: chunkIndex + 1,        // One chunk after
      },
    },
  });

  return {
    mainChunk: bestMatch,
    context: surroundingChunks.matches,
  };
}

6. Best Practices for Metadata Schema Design

How you design your metadata schema has a direct impact on query flexibility, performance, and maintainability.

6.1 Design principles

┌──────────────────────────────────────────────────────────────────────┐
│  METADATA SCHEMA DESIGN PRINCIPLES                                    │
│                                                                        │
│  1. FILTER-FIRST: Only store metadata you will FILTER on.             │
│     Don't store fields you'll never query — they waste space.         │
│                                                                        │
│  2. FLAT OVER NESTED: Most vector DBs don't support nested objects.   │
│     Use flat key-value pairs, not deeply nested structures.           │
│                                                                        │
│  3. CONSISTENT TYPES: Use the same type for the same field across     │
│     all vectors. Don't mix strings and numbers for "version".         │
│                                                                        │
│  4. NORMALIZE VALUES: "Billing", "billing", "BILLING" are different.  │
│     Normalize to lowercase before storing.                             │
│                                                                        │
│  5. DATE AS STRING: Store dates as ISO strings ("2026-03-15") for     │
│     consistent range queries across all vector databases.              │
│                                                                        │
│  6. SIZE LIMITS: Pinecone limits metadata to 40KB per vector.         │
│     Don't store full document text in metadata — store a reference.   │
└──────────────────────────────────────────────────────────────────────┘

6.2 Good vs bad metadata schemas

// ─── BAD: Over-engineered, nested, inconsistent ───

const badMetadata = {
  content: {
    text: 'Full 10,000 character article text here...',  // Too large
    html: '<div>Full HTML version...</div>',              // Redundant
  },
  info: {
    created: new Date(),           // Date object — not portable
    author: {                      // Nested — most DBs can't filter on this
      name: 'Jane',
      email: 'jane@example.com',
    },
  },
  category: 'Billing',            // Not normalized (uppercase B)
  tags: { primary: 'refund', secondary: 'payment' }, // Nested tags
};


// ─── GOOD: Flat, normalized, filterable ───

const goodMetadata = {
  text: 'Refunds are processed within 5-7 business days...',  // Truncated to ~500 chars
  source: 'help-center',          // Normalized lowercase
  category: 'billing',            // Normalized lowercase
  date: '2026-03-15',             // ISO string
  author: 'jane',                 // Flat string
  language: 'en',                 // ISO language code
  is_published: true,             // Boolean
  version: 3,                     // Number
  chunk_index: 0,                 // Number
  parent_doc_id: 'doc_100',       // Reference to parent document
  tags: 'refund,payment,billing', // Comma-separated string (for $contains)
  word_count: 85,                 // Number (enables range queries)
};

6.3 Metadata schema template

Here is a template you can adapt for most use cases:

// ─── Universal metadata schema template ───

function createMetadata(doc) {
  return {
    // ─── Identity ───
    doc_id: doc.id,                           // Original document ID
    chunk_index: doc.chunkIndex || 0,         // Position within parent document
    parent_doc_id: doc.parentId || doc.id,    // Parent document reference

    // ─── Content preview ───
    text: doc.text.slice(0, 500),             // Truncated text for display
    title: doc.title || '',                   // Document title

    // ─── Classification ───
    category: (doc.category || '').toLowerCase(),
    type: (doc.type || 'article').toLowerCase(), // article, faq, guide, etc.
    tags: (doc.tags || []).join(','),          // Comma-separated tags

    // ─── Source ───
    source: (doc.source || '').toLowerCase(),
    url: doc.url || '',

    // ─── Temporal ───
    date: doc.date || new Date().toISOString().split('T')[0],
    updated_at: doc.updatedAt || new Date().toISOString().split('T')[0],

    // ─── Access control ───
    access_level: (doc.accessLevel || 'public').toLowerCase(),
    tenant_id: doc.tenantId || 'default',

    // ─── Status ───
    is_published: doc.isPublished !== false,   // Default true
    is_latest_version: doc.isLatest !== false,  // Default true
    version: doc.version || 1,

    // ─── Language ───
    language: doc.language || 'en',
  };
}

6.4 Metadata size limits

Database	Max Metadata Size	Notes
Pinecone	40 KB per vector	Total metadata JSON must be under 40KB
Chroma	No hard limit	But large metadata slows queries
Qdrant	No hard limit	Payload can be any size; large payloads affect memory
pgvector	PostgreSQL column limits	Effectively unlimited (JSONB column)

Strategy for large text: Don't store the full document text in metadata. Instead, store a truncated preview (200-500 chars) in metadata and keep the full text in a separate database (PostgreSQL, Redis, or a file store). Use the vector ID or doc_id metadata field to join them.

// ─── Pattern: Metadata preview + full text in separate store ───

// Store in vector DB (lightweight metadata)
await index.upsert([{
  id: 'doc_001',
  values: embedding,
  metadata: {
    text_preview: doc.text.slice(0, 300),     // Short preview
    doc_id: 'doc_001',                        // Reference to full text
    category: 'billing',
  },
}]);

// Store full text in PostgreSQL
await db.query(
  'INSERT INTO documents (id, full_text) VALUES ($1, $2)',
  ['doc_001', doc.fullText]
);

// At query time: get IDs from vector search, fetch full text from Postgres
const vectorResults = await index.query({ vector: queryVector, topK: 5 });
const docIds = vectorResults.matches.map(m => m.metadata.doc_id);
const fullTexts = await db.query(
  'SELECT id, full_text FROM documents WHERE id = ANY($1)',
  [docIds]
);

7. Performance Impact of Metadata Filters

Metadata filters affect query performance. Understanding the trade-offs helps you design efficient schemas.

7.1 How filters affect performance

No filter:
  Vector DB searches entire ANN index
  Speed: Fastest (pure ANN search)

Selective filter (matches 50%+ of vectors):
  Minor impact — ANN index still effective
  Speed: ~1.1-1.5x slower

Moderate filter (matches 10-50% of vectors):
  ANN index partially effective, some post-filtering
  Speed: ~1.5-3x slower

Highly selective filter (matches <10% of vectors):
  ANN index less effective, more candidates needed
  Speed: ~2-5x slower

Extremely selective filter (matches <1% of vectors):
  May degrade to near-brute-force on matching subset
  Speed: Significantly slower — consider using namespaces instead

7.2 Optimization strategies

Strategy	When to Use	Example
Use namespaces/collections	Filter always selects the same group	One namespace per tenant instead of `tenant_id` filter
Avoid highly selective filters	Filter matches <1% of vectors	Pre-filter by moving data to separate collections
Index filterable fields	Qdrant/Weaviate support field indexes	Create payload index on frequently filtered fields
Combine fewer filters	Many filters compound the performance hit	Merge related filters into a single field
Use numeric ranges wisely	Date/version range queries	Store dates as sortable strings

7.3 Creating payload indexes (Qdrant)

// Qdrant lets you create indexes on specific payload fields for faster filtering

await qdrant.createPayloadIndex('knowledge-base', {
  field_name: 'category',
  field_schema: 'keyword',        // keyword | integer | float | bool | datetime
});

await qdrant.createPayloadIndex('knowledge-base', {
  field_name: 'date',
  field_schema: 'keyword',        // Dates as strings use keyword index
});

await qdrant.createPayloadIndex('knowledge-base', {
  field_name: 'version',
  field_schema: 'integer',
});

// After creating indexes, filtered queries on these fields are significantly faster

8. Building a Complete Filtered Search Function

Here is a production-ready search function that combines semantic search, metadata filtering, and score thresholds:

import { Pinecone } from '@pinecone-database/pinecone';
import OpenAI from 'openai';

const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const openai = new OpenAI();

/**
 * Production-ready filtered vector search.
 *
 * @param {string} queryText - The user's search query
 * @param {object} filters - Metadata filters to apply
 * @param {object} options - Search configuration
 * @returns {object} Search results with metadata
 */
async function filteredSearch(queryText, filters = {}, options = {}) {
  const {
    indexName = 'knowledge-base',
    namespace = '',
    topK = 5,
    scoreThreshold = 0.70,
    embeddingModel = 'text-embedding-3-small',
  } = options;

  // Step 1: Build the metadata filter
  const metadataFilter = buildFilter(filters);

  // Step 2: Embed the query
  const embeddingResponse = await openai.embeddings.create({
    model: embeddingModel,
    input: queryText,
  });
  const queryVector = embeddingResponse.data[0].embedding;

  // Step 3: Query with filters
  const index = pinecone.index(indexName);
  const ns = namespace ? index.namespace(namespace) : index;

  const queryParams = {
    vector: queryVector,
    topK: topK,
    includeMetadata: true,
    includeValues: false,
  };

  // Only add filter if there are actual filter conditions
  if (metadataFilter && Object.keys(metadataFilter).length > 0) {
    queryParams.filter = metadataFilter;
  }

  const results = await ns.query(queryParams);

  // Step 4: Apply score threshold
  const filteredResults = results.matches.filter(
    (match) => match.score >= scoreThreshold
  );

  // Step 5: Format response
  return {
    query: queryText,
    filtersApplied: filters,
    totalResults: filteredResults.length,
    totalCandidates: results.matches.length,
    results: filteredResults.map((match) => ({
      id: match.id,
      score: match.score,
      metadata: match.metadata,
    })),
  };
}

/**
 * Build a Pinecone-compatible filter object from user-friendly filter params.
 */
function buildFilter(filters) {
  const conditions = [];

  if (filters.category) {
    if (Array.isArray(filters.category)) {
      conditions.push({ category: { $in: filters.category } });
    } else {
      conditions.push({ category: { $eq: filters.category } });
    }
  }

  if (filters.source) {
    conditions.push({ source: { $eq: filters.source } });
  }

  if (filters.language) {
    conditions.push({ language: { $eq: filters.language } });
  }

  if (filters.dateFrom) {
    conditions.push({ date: { $gte: filters.dateFrom } });
  }

  if (filters.dateTo) {
    conditions.push({ date: { $lte: filters.dateTo } });
  }

  if (filters.isPublished !== undefined) {
    conditions.push({ is_published: { $eq: filters.isPublished } });
  }

  if (filters.tenantId) {
    conditions.push({ tenant_id: { $eq: filters.tenantId } });
  }

  if (filters.accessLevel) {
    if (Array.isArray(filters.accessLevel)) {
      conditions.push({ access_level: { $in: filters.accessLevel } });
    } else {
      conditions.push({ access_level: { $eq: filters.accessLevel } });
    }
  }

  if (conditions.length === 0) return {};
  if (conditions.length === 1) return conditions[0];
  return { $and: conditions };
}

// ─── Usage examples ───

// Search all published billing articles from the last 3 months
const billingResults = await filteredSearch(
  'How do refunds work?',
  {
    category: 'billing',
    isPublished: true,
    dateFrom: '2026-01-11',
    language: 'en',
  }
);

// Multi-tenant search
const tenantResults = await filteredSearch(
  'API rate limits',
  {
    tenantId: 'customer_abc',
    category: ['docs', 'guides'],
  },
  { scoreThreshold: 0.75 }
);

// Permission-aware search
const userResults = await filteredSearch(
  'Company roadmap',
  {
    accessLevel: ['public', 'internal'],
    isPublished: true,
  }
);

// No filters — pure semantic search
const openResults = await filteredSearch('machine learning basics');

console.log(`Found ${billingResults.totalResults} results`);
billingResults.results.forEach((r, i) => {
  console.log(`  ${i + 1}. [${r.score.toFixed(3)}] ${r.metadata.text}`);
});

9. Common Mistakes and How to Avoid Them

Mistake	Problem	Solution
Inconsistent casing	`"Billing"` vs `"billing"` are different values	Normalize to lowercase before storing
Dates as various formats	`"03/15/2026"` vs `"2026-03-15"` break range queries	Always use ISO 8601: `"YYYY-MM-DD"`
Storing full text in metadata	Exceeds size limits, wastes memory	Store preview (300-500 chars), full text elsewhere
Not storing text at all	Can't display results without a separate DB lookup	Always store at least a text preview
Missing tenant_id filter	Data leaks between customers	ALWAYS include tenant filter for multi-tenant apps
Too many filter fields	Complex filters slow queries	Only store fields you actually filter on
Nested metadata objects	Most vector DBs don't support nested filters	Flatten: `author_name` instead of `author.name`
Boolean as string	`"true"` (string) vs `true` (boolean) behave differently	Use actual booleans
Forgetting to update metadata	Stale metadata returns wrong results	Update metadata when source documents change

10. Key Takeaways

Metadata is structured data attached to every vector — it describes the source, category, date, permissions, and other facts about what the vector represents.
Metadata filters constrain vector search so results are both semantically similar AND match structured criteria — essential for production RAG systems.
Filter syntax differs across databases — Pinecone uses MongoDB-like operators, Chroma uses where clauses, Qdrant uses must/should/must_not, and pgvector uses standard SQL WHERE.
Design metadata schemas to be flat, normalized, and filter-oriented — only store fields you will actually filter on, normalize casing and date formats, and respect size limits.
Highly selective filters can degrade performance — if a filter matches less than 1% of vectors, consider using separate namespaces or collections instead.
Multi-tenant isolation is a critical security concern — always filter by tenant_id and never rely on semantic search alone to separate customer data.

Explain-It Challenge

A product manager asks: "Why can't we just filter search results after the vector search returns them?" Explain the difference between pre-filtering and post-filtering, and why it matters.
Your team is building a knowledge base for 50 enterprise customers. Design the metadata schema and explain how you ensure one customer never sees another customer's data.
A developer stores dates as "March 15, 2026" in metadata and wonders why their date range filter ($gte: "2026-01-01") doesn't work. Explain the problem and the fix.

Navigation: <- 4.12.b — Querying Similar Vectors | 4.12 Overview ->