Semantic Search Implementation: Beyond Keyword Matching [2026]

Written by Virtido | Feb 28, 2026 10:00:00 AM

Traditional keyword search fails silently. A user searches for "laptop won't turn on" but your documentation says "computer fails to boot." A customer asks about "cancellation policy" while your help center uses "subscription termination." These vocabulary mismatches happen constantly — and your search returns nothing while the perfect answer exists in your database.

The problem runs deeper than synonyms. Keyword search can't handle conceptual queries ("how do I speed up my workflow"), negations ("error without internet connection"), or questions where the answer exists but uses completely different terminology. In a world where users expect Google-level understanding, traditional search feels broken.

TL;DR: Semantic search uses embeddings to match queries by meaning rather than keywords. Best results come from hybrid approaches combining vector similarity with BM25 keyword matching, plus re-ranking for precision. This guide covers embedding selection, vector storage with pgvector, hybrid search implementation, cross-encoder re-ranking, and relevance tuning — with production code examples.

The Problem with Keyword Search

Keyword search operates on a simple premise: find documents containing the query terms. This approach, refined over decades with techniques like TF-IDF and BM25, works well when users and content creators share the same vocabulary. In practice, they often don't.

The Vocabulary Mismatch Problem

Consider these real-world search failures:

"automobile repair" vs "car maintenance" — Zero keyword overlap despite identical meaning
"how to reduce server costs" vs documentation about "infrastructure optimization" — The concept matches, the words don't
"Python timeout error" vs "request duration exceeded limit" — The error message uses different terminology than the user's mental model
"best practices for code review" — Returns nothing because your guide is titled "effective pull request workflows"

Synonym lists help, but they require manual maintenance and can't scale to cover every possible variation. Users express the same intent in countless ways.

Conceptual Query Failures

Beyond vocabulary, keyword search struggles with queries that require understanding:

Conceptual questions — "What causes slow API response times?" requires matching content about database queries, network latency, and caching — topics that might not contain those exact words
Negations — "Login without password" should find passwordless authentication docs, not every document mentioning "login" and "password"
Implicit context — "How do I fix the error?" requires understanding what error the user is experiencing based on their history or context

The Business Impact

Poor search has measurable consequences:

Support ticket volume increases — Users can't self-serve, so they contact support
Documentation becomes "write-only" — Teams create content no one can find
User frustration drives churn — "I know the answer is somewhere" is a terrible user experience
Internal productivity drops — Employees waste time hunting for information that exists but is unfindable

How Semantic Search Works

Semantic search solves the vocabulary mismatch problem by comparing meaning rather than words. Instead of asking "which documents contain these terms?", semantic search asks "which documents are about the same thing as this query?"

Embeddings: Capturing Meaning as Vectors

The core technology is the embedding model — a neural network trained to convert text into dense vectors (arrays of numbers). These vectors have a remarkable property: semantically similar texts produce vectors that are close together in the embedding space.

"Car repair" and "automobile maintenance" have almost no word overlap, but their embeddings are nearly identical. The model has learned that these phrases mean the same thing.

Modern embedding models are trained on massive text corpora, learning relationships between concepts. They understand that "CEO" is similar to "chief executive," that "Python" in a programming context differs from "python" in a biology context, and that "running" can mean jogging, operating software, or a flowing stream — with context determining which meaning applies.

The Search Process

Semantic search follows this flow:

Indexing — Convert each document (or document chunk) into an embedding vector and store it in a vector database
Query encoding — When a user searches, convert their query into an embedding using the same model
Similarity search — Find the stored vectors closest to the query vector (using cosine similarity or other distance metrics)
Return results — The documents corresponding to the nearest vectors are the most semantically relevant

This process matches by meaning regardless of specific word choices. A search for "laptop won't boot" finds documentation about "computer startup failures" because their embeddings are similar.

What Embeddings Capture (and Miss)

Embeddings excel at:

Synonym handling — Different words, same meaning
Conceptual similarity — Related ideas cluster together
Cross-lingual matching — With multilingual models, "perro" (Spanish) matches "dog"
Paraphrase detection — Reworded content still matches

Embeddings struggle with:

Exact matches — Product SKUs, error codes, proper nouns
Negation — "not working" vs "working" may have similar embeddings
Rare terminology — Domain-specific jargon the model hasn't seen
Recent information — Terms coined after model training

These limitations are why production systems typically combine semantic search with keyword matching.

Embedding Model Selection

Choosing the right embedding model significantly impacts search quality. Models differ in dimensions, training data, supported languages, and performance characteristics.

Model Comparison

Model	Provider	Dimensions	Strengths	Considerations
text-embedding-3-large	OpenAI	3072	Best general performance, easy API	Costs per token, requires API calls
text-embedding-3-small	OpenAI	1536	Good balance of quality and cost	Slightly lower quality than large
embed-v3	Cohere	1024	Excellent multilingual, search/classification variants	Commercial API
bge-large-en-v1.5	BAAI (open-source)	1024	Near-commercial quality, self-hostable	English-focused
e5-large-v2	Microsoft (open-source)	1024	Strong retrieval performance, instruction-tuned version available	Requires query prefixes for best results
all-MiniLM-L6-v2	Sentence Transformers	384	Fast, small, good baseline	Lower quality than larger models
multilingual-e5-large	Microsoft (open-source)	1024	100+ languages, strong cross-lingual	Larger model size

Selection Criteria

Choose based on your requirements:

Highest quality, simplest integration — OpenAI text-embedding-3-large. Best for production systems where cost per query is acceptable.
Self-hosted, no API costs — BGE or E5 models. Run on your infrastructure with sentence-transformers.
Multilingual support — Cohere embed-v3 or multilingual-e5. Essential for international content.
Lowest latency — Smaller models like all-MiniLM-L6-v2. Acceptable quality for high-throughput scenarios.

Embedding Generation Code

Here's how to generate embeddings with both OpenAI and open-source models:

# OpenAI embeddings
from openai import OpenAI

client = OpenAI()

def embed_openai(texts: list[str], model: str = "text-embedding-3-small") -> list[list[float]]:
    """Generate embeddings using OpenAI API."""
    response = client.embeddings.create(
        input=texts,
        model=model
    )
    return [item.embedding for item in response.data]

# Usage
documents = ["laptop won't turn on", "computer fails to boot"]
embeddings = embed_openai(documents)

# Open-source embeddings with sentence-transformers
from sentence_transformers import SentenceTransformer

# Load model once, reuse for all embeddings
model = SentenceTransformer("BAAI/bge-large-en-v1.5")

def embed_local(texts: list[str]) -> list[list[float]]:
    """Generate embeddings using local model."""
    # BGE models recommend adding instruction prefix for queries
    embeddings = model.encode(texts, normalize_embeddings=True)
    return embeddings.tolist()

# For BGE models, add prefix for queries (not documents)
def embed_query_bge(query: str) -> list[float]:
    """Embed a search query with BGE instruction prefix."""
    prefixed = f"Represent this sentence for searching relevant passages: {query}"
    return model.encode(prefixed, normalize_embeddings=True).tolist()

Vector Storage and Retrieval

Once you have embeddings, you need efficient storage and retrieval. For a comprehensive comparison of options, see our vector databases guide. Here we'll focus on practical implementation with pgvector, which integrates with existing PostgreSQL infrastructure.

Why pgvector for Many Use Cases

pgvector adds vector similarity search to PostgreSQL. Benefits include:

No new infrastructure — Use your existing PostgreSQL deployment
ACID transactions — Vectors and metadata stay consistent
Familiar tooling — Standard SQL, existing ORMs work
Combined queries — Join vector search with relational filters in one query

pgvector works well for datasets up to a few million vectors. For larger scale or specialized features, consider purpose-built vector databases like Pinecone, Weaviate, or Qdrant.

pgvector Setup and Schema

-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create table for document embeddings
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    content TEXT NOT NULL,
    embedding vector(1536),  -- Match your model's dimensions
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMP DEFAULT NOW()
);

-- Create HNSW index for fast similarity search
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Index for metadata filtering
CREATE INDEX ON documents USING GIN (metadata);

Vector Search with Python

import psycopg2
from psycopg2.extras import execute_values

def store_documents(conn, documents: list[dict]):
    """Store documents with their embeddings."""
    with conn.cursor() as cur:
        execute_values(
            cur,
            """
            INSERT INTO documents (title, content, embedding, metadata)
            VALUES %s
            """,
            [
                (doc["title"], doc["content"], doc["embedding"], doc.get("metadata", {}))
                for doc in documents
            ],
            template="(%(title)s, %(content)s, %(embedding)s::vector, %(metadata)s::jsonb)"
        )
    conn.commit()

def semantic_search(conn, query_embedding: list[float], limit: int = 10) -> list[dict]:
    """Find documents most similar to query embedding."""
    with conn.cursor() as cur:
        cur.execute(
            """
            SELECT id, title, content, metadata,
                   1 - (embedding <=> %s::vector) AS similarity
            FROM documents
            ORDER BY embedding <=> %s::vector
            LIMIT %s
            """,
            (query_embedding, query_embedding, limit)
        )
        columns = [desc[0] for desc in cur.description]
        return [dict(zip(columns, row)) for row in cur.fetchall()]

def filtered_search(conn, query_embedding: list[float],
                   filters: dict, limit: int = 10) -> list[dict]:
    """Semantic search with metadata filters."""
    with conn.cursor() as cur:
        cur.execute(
            """
            SELECT id, title, content, metadata,
                   1 - (embedding <=> %s::vector) AS similarity
            FROM documents
            WHERE metadata @> %s::jsonb
            ORDER BY embedding <=> %s::vector
            LIMIT %s
            """,
            (query_embedding, filters, query_embedding, limit)
        )
        columns = [desc[0] for desc in cur.description]
        return [dict(zip(columns, row)) for row in cur.fetchall()]

Index Tuning

HNSW index parameters affect the accuracy/speed tradeoff:

m — Number of connections per layer (default 16). Higher values improve recall but use more memory.
ef_construction — Search width during index building (default 64). Higher values build slower but create better indexes.
ef_search — Search width at query time. Set via SET hnsw.ef_search = 100. Higher values improve recall at the cost of latency.

Start with defaults and tune based on your accuracy requirements. Benchmark with representative queries to find the right balance.

Hybrid Search: Best of Both Worlds

Pure semantic search misses exact matches. Pure keyword search misses semantic matches. Hybrid search combines both, capturing the strengths of each approach.

Why Hybrid Search Matters

Consider these queries:

"ERR_CONNECTION_REFUSED" — This error code needs exact keyword matching. Semantic search might return vaguely related network errors.
"how to fix slow database queries" — This conceptual query needs semantic understanding to find relevant content about query optimization, indexing, and performance tuning.
"python requests timeout" — This needs both: "python" and "requests" as keywords, "timeout" semantically linked to duration limits, connection errors, and retry logic.

Hybrid search handles all three cases by running both search types and combining results.

BM25 + Vector Search Implementation

Add full-text search capability alongside vector search:

-- Add tsvector column for BM25-style search
ALTER TABLE documents ADD COLUMN search_vector tsvector;

-- Populate search vector from title and content
UPDATE documents SET search_vector =
    setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
    setweight(to_tsvector('english', coalesce(content, '')), 'B');

-- Create GIN index for fast text search
CREATE INDEX ON documents USING GIN (search_vector);

-- Trigger to keep search_vector updated
CREATE OR REPLACE FUNCTION update_search_vector() RETURNS trigger AS $$
BEGIN
    NEW.search_vector :=
        setweight(to_tsvector('english', coalesce(NEW.title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(NEW.content, '')), 'B');
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER documents_search_update
    BEFORE INSERT OR UPDATE ON documents
    FOR EACH ROW EXECUTE FUNCTION update_search_vector();

Reciprocal Rank Fusion

Reciprocal Rank Fusion (RRF) is a simple, effective method to combine ranked results from multiple sources:

def reciprocal_rank_fusion(
    rankings: list[list[str]],
    k: int = 60
) -> list[tuple[str, float]]:
    """
    Combine multiple rankings using Reciprocal Rank Fusion.

    Args:
        rankings: List of ranked document ID lists
        k: Constant to prevent high ranks from dominating (default 60)

    Returns:
        List of (doc_id, score) tuples, sorted by fused score
    """
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking, start=1):
            if doc_id not in scores:
                scores[doc_id] = 0
            scores[doc_id] += 1 / (k + rank)

    return sorted(scores.items(), key=lambda x: x[1], reverse=True)


def hybrid_search(conn, query: str, query_embedding: list[float],
                  limit: int = 10) -> list[dict]:
    """
    Combine semantic and keyword search using RRF.
    """
    # Get semantic search results
    with conn.cursor() as cur:
        cur.execute(
            """
            SELECT id::text FROM documents
            ORDER BY embedding <=> %s::vector
            LIMIT 50
            """,
            (query_embedding,)
        )
        semantic_ids = [row[0] for row in cur.fetchall()]

    # Get keyword search results
    with conn.cursor() as cur:
        cur.execute(
            """
            SELECT id::text FROM documents
            WHERE search_vector @@ plainto_tsquery('english', %s)
            ORDER BY ts_rank(search_vector, plainto_tsquery('english', %s)) DESC
            LIMIT 50
            """,
            (query, query)
        )
        keyword_ids = [row[0] for row in cur.fetchall()]

    # Fuse rankings
    fused = reciprocal_rank_fusion([semantic_ids, keyword_ids])
    top_ids = [doc_id for doc_id, score in fused[:limit]]

    # Fetch full documents
    with conn.cursor() as cur:
        cur.execute(
            """
            SELECT id, title, content, metadata
            FROM documents
            WHERE id = ANY(%s::int[])
            """,
            ([int(i) for i in top_ids],)
        )
        docs = {row[0]: dict(zip(["id", "title", "content", "metadata"], row))
                for row in cur.fetchall()}

    # Return in fused rank order
    return [docs[int(doc_id)] for doc_id in top_ids if int(doc_id) in docs]

When to Use Hybrid Search

Always recommended — Hybrid search rarely performs worse than either method alone
Essential for technical content — Error codes, product names, API references
Important for mixed queries — Users combine concepts with specific terms
Lower priority for pure conceptual search — Research queries, exploratory search

Re-Ranking for Precision

Initial retrieval (whether semantic, keyword, or hybrid) optimizes for recall — getting relevant documents into the candidate set. Re-ranking optimizes for precision — putting the best results at the top.

Cross-Encoders vs Bi-Encoders

Embedding models are bi-encoders: they encode queries and documents independently. This enables fast retrieval but limits accuracy because the model can't directly compare query and document.

Cross-encoders process query and document together, enabling deeper interaction between them. They're more accurate but too slow for searching large collections. The solution: use bi-encoders for initial retrieval, then cross-encoders to re-rank the top candidates.

Cross-Encoder Re-Ranking

from sentence_transformers import CrossEncoder

# Load cross-encoder model
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")

def rerank_results(query: str, documents: list[dict],
                   top_k: int = 10) -> list[dict]:
    """
    Re-rank search results using a cross-encoder.

    Args:
        query: User's search query
        documents: Initial search results with 'content' field
        top_k: Number of results to return after re-ranking

    Returns:
        Re-ranked documents with relevance scores
    """
    if not documents:
        return []

    # Create query-document pairs
    pairs = [(query, doc["content"]) for doc in documents]

    # Score all pairs
    scores = reranker.predict(pairs)

    # Add scores and sort
    for doc, score in zip(documents, scores):
        doc["rerank_score"] = float(score)

    reranked = sorted(documents, key=lambda x: x["rerank_score"], reverse=True)
    return reranked[:top_k]


# Alternative: Use Cohere Rerank API for production
import cohere

co = cohere.Client("your-api-key")

def rerank_cohere(query: str, documents: list[dict], top_k: int = 10) -> list[dict]:
    """Re-rank using Cohere's Rerank API."""
    response = co.rerank(
        query=query,
        documents=[doc["content"] for doc in documents],
        model="rerank-english-v3.0",
        top_n=top_k
    )

    return [
        {**documents[result.index], "rerank_score": result.relevance_score}
        for result in response.results
    ]

LLM-Based Re-Ranking

For maximum accuracy, LLMs can evaluate relevance directly. This is slower and more expensive but can outperform cross-encoders for complex queries:

from openai import OpenAI

client = OpenAI()

def rerank_with_llm(query: str, documents: list[dict], top_k: int = 5) -> list[dict]:
    """
    Re-rank using an LLM to evaluate relevance.
    Best for complex queries where cross-encoders struggle.
    """
    prompt = f"""Rate the relevance of each document to the query on a scale of 1-10.
Query: {query}

Documents:
"""
    for i, doc in enumerate(documents):
        prompt += f"\n[{i}] {doc['content'][:500]}\n"

    prompt += "\nRespond with JSON: {\"scores\": [score0, score1, ...]}"

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )

    import json
    scores = json.loads(response.choices[0].message.content)["scores"]

    for doc, score in zip(documents, scores):
        doc["rerank_score"] = score

    reranked = sorted(documents, key=lambda x: x["rerank_score"], reverse=True)
    return reranked[:top_k]

When to Use Re-Ranking

High-stakes search — Legal, medical, financial content where precision matters
Complex queries — Multi-part questions, nuanced intent
When initial results are "close but not quite" — Re-ranking sharpens the ranking
Not needed for — Simple lookups, high-throughput scenarios where latency matters more than precision

Relevance Tuning

Building search is iterative. Initial implementations always need tuning based on real usage patterns. Systematic evaluation enables data-driven improvements.

Building an Evaluation Set

An evaluation set contains queries paired with their relevant documents (ground truth):

# Evaluation set structure
evaluation_set = [
    {
        "query": "how to reset password",
        "relevant_docs": ["doc_123", "doc_456"],  # Document IDs
        "notes": "Should find both self-service and admin reset procedures"
    },
    {
        "query": "ERR_CONNECTION_TIMEOUT python requests",
        "relevant_docs": ["doc_789"],
        "notes": "Technical error code - needs exact match"
    },
    # ... 50-200 queries for meaningful evaluation
]

Sources for evaluation queries:

Search logs — Real queries users have made
Support tickets — Questions that led to human escalation
Synthetic generation — LLMs can generate query variations
Expert curation — Domain experts identify key queries

Evaluation Metrics

from typing import List, Dict
import numpy as np

def precision_at_k(retrieved: List[str], relevant: List[str], k: int) -> float:
    """Precision@K: What fraction of top-k results are relevant?"""
    retrieved_k = retrieved[:k]
    relevant_set = set(relevant)
    hits = sum(1 for doc in retrieved_k if doc in relevant_set)
    return hits / k

def recall_at_k(retrieved: List[str], relevant: List[str], k: int) -> float:
    """Recall@K: What fraction of relevant docs are in top-k?"""
    retrieved_k = set(retrieved[:k])
    relevant_set = set(relevant)
    if not relevant_set:
        return 0.0
    hits = len(retrieved_k & relevant_set)
    return hits / len(relevant_set)

def mrr(retrieved: List[str], relevant: List[str]) -> float:
    """Mean Reciprocal Rank: How high is the first relevant result?"""
    relevant_set = set(relevant)
    for rank, doc in enumerate(retrieved, start=1):
        if doc in relevant_set:
            return 1.0 / rank
    return 0.0

def ndcg_at_k(retrieved: List[str], relevant: List[str], k: int) -> float:
    """NDCG@K: Normalized Discounted Cumulative Gain."""
    relevant_set = set(relevant)

    # DCG: sum of relevance / log2(rank+1)
    dcg = sum(
        1.0 / np.log2(rank + 2)  # +2 because rank is 0-indexed
        for rank, doc in enumerate(retrieved[:k])
        if doc in relevant_set
    )

    # Ideal DCG: all relevant docs at top
    ideal_length = min(len(relevant), k)
    idcg = sum(1.0 / np.log2(i + 2) for i in range(ideal_length))

    return dcg / idcg if idcg > 0 else 0.0

def evaluate_search(search_fn, evaluation_set: List[Dict], k: int = 10) -> Dict:
    """
    Run evaluation across all queries.

    Args:
        search_fn: Function that takes query and returns list of doc IDs
        evaluation_set: List of {query, relevant_docs} dicts
        k: Cutoff for metrics

    Returns:
        Dict with average metrics
    """
    metrics = {"precision": [], "recall": [], "mrr": [], "ndcg": []}

    for item in evaluation_set:
        retrieved = search_fn(item["query"])
        relevant = item["relevant_docs"]

        metrics["precision"].append(precision_at_k(retrieved, relevant, k))
        metrics["recall"].append(recall_at_k(retrieved, relevant, k))
        metrics["mrr"].append(mrr(retrieved, relevant))
        metrics["ndcg"].append(ndcg_at_k(retrieved, relevant, k))

    return {
        f"precision@{k}": np.mean(metrics["precision"]),
        f"recall@{k}": np.mean(metrics["recall"]),
        "mrr": np.mean(metrics["mrr"]),
        f"ndcg@{k}": np.mean(metrics["ndcg"]),
    }

Tuning Strategies

Once you can measure quality, systematically test improvements:

Embedding model — Compare OpenAI vs Cohere vs open-source on your data
Chunk size — Test 256, 512, 1024 token chunks
Hybrid weights — Adjust the balance between semantic and keyword scores
Re-ranker choice — Compare cross-encoders, Cohere Rerank, LLM re-ranking
Query preprocessing — Expansion, reformulation, spell correction

Track metrics over time as your content and query patterns evolve.

Production Implementation

Moving from prototype to production requires attention to reliability, performance, and maintainability.

Complete Search Service

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import psycopg2
from sentence_transformers import SentenceTransformer, CrossEncoder
from contextlib import contextmanager
import os

app = FastAPI(title="Semantic Search API")

# Initialize models once at startup
embedding_model = SentenceTransformer("BAAI/bge-large-en-v1.5")
rerank_model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")

# Database connection pool
from psycopg2 import pool
db_pool = pool.ThreadedConnectionPool(
    minconn=2,
    maxconn=10,
    dsn=os.environ["DATABASE_URL"]
)

@contextmanager
def get_db():
    conn = db_pool.getconn()
    try:
        yield conn
    finally:
        db_pool.putconn(conn)

class SearchRequest(BaseModel):
    query: str
    limit: int = 10
    filters: dict = {}
    use_rerank: bool = True

class SearchResult(BaseModel):
    id: int
    title: str
    content: str
    score: float
    metadata: dict

class SearchResponse(BaseModel):
    results: list[SearchResult]
    query: str
    total_candidates: int

@app.post("/search", response_model=SearchResponse)
def search(request: SearchRequest):
    """
    Hybrid semantic search with optional re-ranking.
    """
    # Generate query embedding
    query_prefixed = f"Represent this sentence for searching relevant passages: {request.query}"
    query_embedding = embedding_model.encode(query_prefixed, normalize_embeddings=True).tolist()

    with get_db() as conn:
        # Hybrid search: combine semantic and keyword results
        candidates = hybrid_search(
            conn,
            request.query,
            query_embedding,
            filters=request.filters,
            limit=50  # Get more candidates for re-ranking
        )

    total_candidates = len(candidates)

    # Re-rank if requested and we have results
    if request.use_rerank and candidates:
        pairs = [(request.query, doc["content"]) for doc in candidates]
        scores = rerank_model.predict(pairs)
        for doc, score in zip(candidates, scores):
            doc["score"] = float(score)
        candidates.sort(key=lambda x: x["score"], reverse=True)
    else:
        # Use similarity scores from initial retrieval
        for i, doc in enumerate(candidates):
            doc["score"] = doc.get("similarity", 1.0 - i * 0.01)

    results = [
        SearchResult(
            id=doc["id"],
            title=doc["title"],
            content=doc["content"][:500],  # Truncate for response
            score=doc["score"],
            metadata=doc.get("metadata", {})
        )
        for doc in candidates[:request.limit]
    ]

    return SearchResponse(
        results=results,
        query=request.query,
        total_candidates=total_candidates
    )

@app.post("/index")
def index_document(title: str, content: str, metadata: dict = {}):
    """Add a document to the search index."""
    embedding = embedding_model.encode(content, normalize_embeddings=True).tolist()

    with get_db() as conn:
        with conn.cursor() as cur:
            cur.execute(
                """
                INSERT INTO documents (title, content, embedding, metadata)
                VALUES (%s, %s, %s::vector, %s::jsonb)
                RETURNING id
                """,
                (title, content, embedding, metadata)
            )
            doc_id = cur.fetchone()[0]
        conn.commit()

    return {"id": doc_id, "status": "indexed"}

@app.get("/health")
def health_check():
    """Health check endpoint."""
    with get_db() as conn:
        with conn.cursor() as cur:
            cur.execute("SELECT 1")
    return {"status": "healthy"}

Indexing Pipeline Considerations

Batch embeddings — Generate embeddings in batches, not one at a time
Async processing — Queue large indexing jobs for background processing
Incremental updates — Update changed documents rather than full reindex
Version tracking — Store embedding model version with documents for migration

Query Processing

Spell correction — Fix typos before embedding
Query expansion — Add synonyms or related terms
Intent detection — Route different query types to optimized paths
Caching — Cache frequent queries and their results

Performance Optimization

Embedding caching — Cache document embeddings, not just search results
Connection pooling — Reuse database connections
Index tuning — Adjust HNSW parameters based on latency/accuracy requirements
Hardware — GPU acceleration for local embedding models

Common Pitfalls

Avoid these frequent mistakes in semantic search implementations:

Mixing embedding models — Using different models for indexing and querying produces meaningless similarity scores. Always use the same model (and version) for both.
Ignoring chunk boundaries — Chunks that split mid-sentence or separate related information reduce retrieval quality. Respect document structure when chunking.
Skipping hybrid search — Pure semantic search misses exact matches. Add BM25 for robustness, especially with technical content.
No evaluation baseline — Without metrics, you can't know if changes help or hurt. Build evaluation infrastructure early.
Over-indexing — Indexing every document in full creates noise. Consider what content actually needs to be searchable.
Forgetting metadata filters — Users often need to search within a category, date range, or permission scope. Design metadata schema upfront.
Neglecting latency — Users expect sub-second search. Monitor and optimize for real-world response times.
Static configuration — Query patterns change over time. Review evaluation metrics regularly and retune.

How Virtido Can Help You Implement Semantic Search

At Virtido, we help enterprises build production-grade search infrastructure that understands meaning, not just keywords — combining vector search expertise with practical AI engineering through our AI Hub.

What We Offer

Search architecture design — Selecting embedding models, vector databases, and hybrid strategies for your specific requirements
Production implementation — Building scalable search services with proper indexing, query processing, and re-ranking
RAG system development — Connecting semantic search to LLMs for question-answering and chat applications (see our RAG guide)
Relevance tuning — Systematic evaluation and optimization to improve search quality
AI talent on demand — ML engineers and search specialists to extend your team in 2-4 weeks

We've built semantic search systems for clients across FinTech, legal tech, healthcare, and enterprise software. Our staff augmentation model provides vetted talent with Swiss contracts and full IP protection.

Final Thoughts

Semantic search represents a fundamental shift in how applications connect users with information. Rather than requiring users to guess the exact words content creators used, semantic search understands intent and matches by meaning. The technology is mature, the tools are production-ready, and the user experience improvements are substantial.

The most effective implementations combine multiple techniques: embeddings for semantic understanding, BM25 for exact matches, and re-ranking for precision. This hybrid approach handles the full range of queries users actually make — from conceptual questions to specific error codes. Building evaluation infrastructure early enables data-driven improvements rather than guesswork.

Whether you're building a knowledge base, document search, or the retrieval layer for a RAG system, semantic search transforms "I know the answer exists somewhere" frustration into "found it on the first try" satisfaction. The implementation patterns in this guide provide a foundation for search that truly understands what users are looking for.

Frequently Asked Questions

What's the difference between semantic search and keyword search?

Keyword search finds documents containing specific words — it requires exact or near-exact matches between query terms and document text. Semantic search uses embeddings to match by meaning, finding relevant documents even when they use completely different words. "Car repair" finds "automobile maintenance" with semantic search because the embeddings capture that both phrases mean the same thing.

Which embedding model should I choose?

For most applications, OpenAI's text-embedding-3-small offers the best balance of quality and simplicity. If you need self-hosted embeddings to avoid API costs, BGE-large or E5-large provide near-commercial quality. For multilingual content, use Cohere embed-v3 or multilingual-e5. Start with a general-purpose model, then evaluate domain-specific options if quality is insufficient.

How much does semantic search cost at scale?

Costs have three components: embedding generation, vector storage, and compute for search. OpenAI embeddings cost roughly $0.02 per million tokens (about 750,000 words). Vector storage in pgvector uses your existing PostgreSQL infrastructure. Managed vector databases like Pinecone start around $70/month for production workloads. Self-hosted options reduce per-query costs but require infrastructure management. A typical 1 million document index costs $50-200/month to operate.

Should I use vector-only search or hybrid search?

Hybrid search is recommended for almost all production systems. Vector-only search misses exact matches — important for error codes, product names, and technical terms. Hybrid search combines semantic similarity with keyword matching (BM25), capturing both conceptual matches and exact terms. The overhead is minimal, and hybrid search rarely performs worse than either method alone.

Is re-ranking necessary?

Re-ranking improves precision for the top results, which matters most for user experience. Cross-encoders are more accurate than bi-encoders because they can directly compare query and document. If your search results are "close but not quite right," re-ranking often fixes the ordering. Skip re-ranking if latency is critical or if initial retrieval quality is already sufficient for your use case.

How do I measure search quality?

Build an evaluation set with queries and their relevant documents (ground truth). Key metrics include Precision@K (what fraction of top-k results are relevant), Recall@K (what fraction of relevant docs are in top-k), MRR (Mean Reciprocal Rank — how high is the first relevant result), and NDCG (accounts for ranking position). Collect queries from search logs and support tickets to reflect real usage patterns.

How do I handle multilingual search?

Use multilingual embedding models like Cohere embed-v3 or multilingual-e5-large, which create comparable embeddings across 100+ languages. A query in English will match relevant documents in German, Spanish, or Japanese. For best results, consider language-specific tuning and ensure your evaluation set includes cross-lingual queries. Some applications benefit from separate indexes per language with language detection routing.

How often should I update the search index?

Update strategy depends on content freshness requirements. Near-real-time indexing (seconds to minutes) suits dynamic content like support tickets or chat history. Daily batch indexing works for documentation or knowledge bases that change infrequently. Always use incremental updates where possible — re-embed only changed documents rather than rebuilding the entire index.

What latency should I expect?

Embedding generation takes 10-50ms per query depending on model and hardware. Vector search with HNSW indexes returns results in 5-20ms for million-scale indexes. Re-ranking adds 50-200ms depending on the number of candidates and re-ranker model. Total end-to-end latency of 100-300ms is typical for production systems. Latency increases with index size and result set size.

Do I need a dedicated vector database?

Not necessarily. PostgreSQL with pgvector handles millions of vectors and integrates with existing infrastructure — often the best starting point. Purpose-built vector databases (Pinecone, Weaviate, Qdrant) become valuable at larger scale (tens of millions of vectors), when you need advanced features like built-in hybrid search, or when operational simplicity of a managed service outweighs the cost premium.

View full post