Traditional keyword search fails silently. A user searches for "laptop won't turn on" but your documentation says "computer fails to boot." A customer asks about "cancellation policy" while your help center uses "subscription termination." These vocabulary mismatches happen constantly — and your search returns nothing while the perfect answer exists in your database.
The problem runs deeper than synonyms. Keyword search can't handle conceptual queries ("how do I speed up my workflow"), negations ("error without internet connection"), or questions where the answer exists but uses completely different terminology. In a world where users expect Google-level understanding, traditional search feels broken.
TL;DR: Semantic search uses embeddings to match queries by meaning rather than keywords. Best results come from hybrid approaches combining vector similarity with BM25 keyword matching, plus re-ranking for precision. This guide covers embedding selection, vector storage with pgvector, hybrid search implementation, cross-encoder re-ranking, and relevance tuning — with production code examples.
Keyword search operates on a simple premise: find documents containing the query terms. This approach, refined over decades with techniques like TF-IDF and BM25, works well when users and content creators share the same vocabulary. In practice, they often don't.
Consider these real-world search failures:
Synonym lists help, but they require manual maintenance and can't scale to cover every possible variation. Users express the same intent in countless ways.
Beyond vocabulary, keyword search struggles with queries that require understanding:
Poor search has measurable consequences:
Semantic search solves the vocabulary mismatch problem by comparing meaning rather than words. Instead of asking "which documents contain these terms?", semantic search asks "which documents are about the same thing as this query?"
The core technology is the embedding model — a neural network trained to convert text into dense vectors (arrays of numbers). These vectors have a remarkable property: semantically similar texts produce vectors that are close together in the embedding space.
"Car repair" and "automobile maintenance" have almost no word overlap, but their embeddings are nearly identical. The model has learned that these phrases mean the same thing.
Modern embedding models are trained on massive text corpora, learning relationships between concepts. They understand that "CEO" is similar to "chief executive," that "Python" in a programming context differs from "python" in a biology context, and that "running" can mean jogging, operating software, or a flowing stream — with context determining which meaning applies.
Semantic search follows this flow:
This process matches by meaning regardless of specific word choices. A search for "laptop won't boot" finds documentation about "computer startup failures" because their embeddings are similar.
Embeddings excel at:
Embeddings struggle with:
These limitations are why production systems typically combine semantic search with keyword matching.
Choosing the right embedding model significantly impacts search quality. Models differ in dimensions, training data, supported languages, and performance characteristics.
| Model | Provider | Dimensions | Strengths | Considerations |
|---|---|---|---|---|
| text-embedding-3-large | OpenAI | 3072 | Best general performance, easy API | Costs per token, requires API calls |
| text-embedding-3-small | OpenAI | 1536 | Good balance of quality and cost | Slightly lower quality than large |
| embed-v3 | Cohere | 1024 | Excellent multilingual, search/classification variants | Commercial API |
| bge-large-en-v1.5 | BAAI (open-source) | 1024 | Near-commercial quality, self-hostable | English-focused |
| e5-large-v2 | Microsoft (open-source) | 1024 | Strong retrieval performance, instruction-tuned version available | Requires query prefixes for best results |
| all-MiniLM-L6-v2 | Sentence Transformers | 384 | Fast, small, good baseline | Lower quality than larger models |
| multilingual-e5-large | Microsoft (open-source) | 1024 | 100+ languages, strong cross-lingual | Larger model size |
Choose based on your requirements:
Here's how to generate embeddings with both OpenAI and open-source models:
# OpenAI embeddings
from openai import OpenAI
client = OpenAI()
def embed_openai(texts: list[str], model: str = "text-embedding-3-small") -> list[list[float]]:
"""Generate embeddings using OpenAI API."""
response = client.embeddings.create(
input=texts,
model=model
)
return [item.embedding for item in response.data]
# Usage
documents = ["laptop won't turn on", "computer fails to boot"]
embeddings = embed_openai(documents)
# Open-source embeddings with sentence-transformers
from sentence_transformers import SentenceTransformer
# Load model once, reuse for all embeddings
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
def embed_local(texts: list[str]) -> list[list[float]]:
"""Generate embeddings using local model."""
# BGE models recommend adding instruction prefix for queries
embeddings = model.encode(texts, normalize_embeddings=True)
return embeddings.tolist()
# For BGE models, add prefix for queries (not documents)
def embed_query_bge(query: str) -> list[float]:
"""Embed a search query with BGE instruction prefix."""
prefixed = f"Represent this sentence for searching relevant passages: {query}"
return model.encode(prefixed, normalize_embeddings=True).tolist()
Once you have embeddings, you need efficient storage and retrieval. For a comprehensive comparison of options, see our vector databases guide. Here we'll focus on practical implementation with pgvector, which integrates with existing PostgreSQL infrastructure.
pgvector adds vector similarity search to PostgreSQL. Benefits include:
pgvector works well for datasets up to a few million vectors. For larger scale or specialized features, consider purpose-built vector databases like Pinecone, Weaviate, or Qdrant.
-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Create table for document embeddings
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
title TEXT NOT NULL,
content TEXT NOT NULL,
embedding vector(1536), -- Match your model's dimensions
metadata JSONB DEFAULT '{}',
created_at TIMESTAMP DEFAULT NOW()
);
-- Create HNSW index for fast similarity search
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Index for metadata filtering
CREATE INDEX ON documents USING GIN (metadata);
import psycopg2
from psycopg2.extras import execute_values
def store_documents(conn, documents: list[dict]):
"""Store documents with their embeddings."""
with conn.cursor() as cur:
execute_values(
cur,
"""
INSERT INTO documents (title, content, embedding, metadata)
VALUES %s
""",
[
(doc["title"], doc["content"], doc["embedding"], doc.get("metadata", {}))
for doc in documents
],
template="(%(title)s, %(content)s, %(embedding)s::vector, %(metadata)s::jsonb)"
)
conn.commit()
def semantic_search(conn, query_embedding: list[float], limit: int = 10) -> list[dict]:
"""Find documents most similar to query embedding."""
with conn.cursor() as cur:
cur.execute(
"""
SELECT id, title, content, metadata,
1 - (embedding <=> %s::vector) AS similarity
FROM documents
ORDER BY embedding <=> %s::vector
LIMIT %s
""",
(query_embedding, query_embedding, limit)
)
columns = [desc[0] for desc in cur.description]
return [dict(zip(columns, row)) for row in cur.fetchall()]
def filtered_search(conn, query_embedding: list[float],
filters: dict, limit: int = 10) -> list[dict]:
"""Semantic search with metadata filters."""
with conn.cursor() as cur:
cur.execute(
"""
SELECT id, title, content, metadata,
1 - (embedding <=> %s::vector) AS similarity
FROM documents
WHERE metadata @> %s::jsonb
ORDER BY embedding <=> %s::vector
LIMIT %s
""",
(query_embedding, filters, query_embedding, limit)
)
columns = [desc[0] for desc in cur.description]
return [dict(zip(columns, row)) for row in cur.fetchall()]
HNSW index parameters affect the accuracy/speed tradeoff:
SET hnsw.ef_search = 100. Higher values improve recall at the cost of latency.Start with defaults and tune based on your accuracy requirements. Benchmark with representative queries to find the right balance.
Pure semantic search misses exact matches. Pure keyword search misses semantic matches. Hybrid search combines both, capturing the strengths of each approach.
Consider these queries:
Hybrid search handles all three cases by running both search types and combining results.
Add full-text search capability alongside vector search:
-- Add tsvector column for BM25-style search
ALTER TABLE documents ADD COLUMN search_vector tsvector;
-- Populate search vector from title and content
UPDATE documents SET search_vector =
setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
setweight(to_tsvector('english', coalesce(content, '')), 'B');
-- Create GIN index for fast text search
CREATE INDEX ON documents USING GIN (search_vector);
-- Trigger to keep search_vector updated
CREATE OR REPLACE FUNCTION update_search_vector() RETURNS trigger AS $$
BEGIN
NEW.search_vector :=
setweight(to_tsvector('english', coalesce(NEW.title, '')), 'A') ||
setweight(to_tsvector('english', coalesce(NEW.content, '')), 'B');
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER documents_search_update
BEFORE INSERT OR UPDATE ON documents
FOR EACH ROW EXECUTE FUNCTION update_search_vector();
Reciprocal Rank Fusion (RRF) is a simple, effective method to combine ranked results from multiple sources:
def reciprocal_rank_fusion(
rankings: list[list[str]],
k: int = 60
) -> list[tuple[str, float]]:
"""
Combine multiple rankings using Reciprocal Rank Fusion.
Args:
rankings: List of ranked document ID lists
k: Constant to prevent high ranks from dominating (default 60)
Returns:
List of (doc_id, score) tuples, sorted by fused score
"""
scores = {}
for ranking in rankings:
for rank, doc_id in enumerate(ranking, start=1):
if doc_id not in scores:
scores[doc_id] = 0
scores[doc_id] += 1 / (k + rank)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
def hybrid_search(conn, query: str, query_embedding: list[float],
limit: int = 10) -> list[dict]:
"""
Combine semantic and keyword search using RRF.
"""
# Get semantic search results
with conn.cursor() as cur:
cur.execute(
"""
SELECT id::text FROM documents
ORDER BY embedding <=> %s::vector
LIMIT 50
""",
(query_embedding,)
)
semantic_ids = [row[0] for row in cur.fetchall()]
# Get keyword search results
with conn.cursor() as cur:
cur.execute(
"""
SELECT id::text FROM documents
WHERE search_vector @@ plainto_tsquery('english', %s)
ORDER BY ts_rank(search_vector, plainto_tsquery('english', %s)) DESC
LIMIT 50
""",
(query, query)
)
keyword_ids = [row[0] for row in cur.fetchall()]
# Fuse rankings
fused = reciprocal_rank_fusion([semantic_ids, keyword_ids])
top_ids = [doc_id for doc_id, score in fused[:limit]]
# Fetch full documents
with conn.cursor() as cur:
cur.execute(
"""
SELECT id, title, content, metadata
FROM documents
WHERE id = ANY(%s::int[])
""",
([int(i) for i in top_ids],)
)
docs = {row[0]: dict(zip(["id", "title", "content", "metadata"], row))
for row in cur.fetchall()}
# Return in fused rank order
return [docs[int(doc_id)] for doc_id in top_ids if int(doc_id) in docs]
Initial retrieval (whether semantic, keyword, or hybrid) optimizes for recall — getting relevant documents into the candidate set. Re-ranking optimizes for precision — putting the best results at the top.
Embedding models are bi-encoders: they encode queries and documents independently. This enables fast retrieval but limits accuracy because the model can't directly compare query and document.
Cross-encoders process query and document together, enabling deeper interaction between them. They're more accurate but too slow for searching large collections. The solution: use bi-encoders for initial retrieval, then cross-encoders to re-rank the top candidates.
from sentence_transformers import CrossEncoder
# Load cross-encoder model
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")
def rerank_results(query: str, documents: list[dict],
top_k: int = 10) -> list[dict]:
"""
Re-rank search results using a cross-encoder.
Args:
query: User's search query
documents: Initial search results with 'content' field
top_k: Number of results to return after re-ranking
Returns:
Re-ranked documents with relevance scores
"""
if not documents:
return []
# Create query-document pairs
pairs = [(query, doc["content"]) for doc in documents]
# Score all pairs
scores = reranker.predict(pairs)
# Add scores and sort
for doc, score in zip(documents, scores):
doc["rerank_score"] = float(score)
reranked = sorted(documents, key=lambda x: x["rerank_score"], reverse=True)
return reranked[:top_k]
# Alternative: Use Cohere Rerank API for production
import cohere
co = cohere.Client("your-api-key")
def rerank_cohere(query: str, documents: list[dict], top_k: int = 10) -> list[dict]:
"""Re-rank using Cohere's Rerank API."""
response = co.rerank(
query=query,
documents=[doc["content"] for doc in documents],
model="rerank-english-v3.0",
top_n=top_k
)
return [
{**documents[result.index], "rerank_score": result.relevance_score}
for result in response.results
]
For maximum accuracy, LLMs can evaluate relevance directly. This is slower and more expensive but can outperform cross-encoders for complex queries:
from openai import OpenAI
client = OpenAI()
def rerank_with_llm(query: str, documents: list[dict], top_k: int = 5) -> list[dict]:
"""
Re-rank using an LLM to evaluate relevance.
Best for complex queries where cross-encoders struggle.
"""
prompt = f"""Rate the relevance of each document to the query on a scale of 1-10.
Query: {query}
Documents:
"""
for i, doc in enumerate(documents):
prompt += f"\n[{i}] {doc['content'][:500]}\n"
prompt += "\nRespond with JSON: {\"scores\": [score0, score1, ...]}"
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"}
)
import json
scores = json.loads(response.choices[0].message.content)["scores"]
for doc, score in zip(documents, scores):
doc["rerank_score"] = score
reranked = sorted(documents, key=lambda x: x["rerank_score"], reverse=True)
return reranked[:top_k]
Building search is iterative. Initial implementations always need tuning based on real usage patterns. Systematic evaluation enables data-driven improvements.
An evaluation set contains queries paired with their relevant documents (ground truth):
# Evaluation set structure
evaluation_set = [
{
"query": "how to reset password",
"relevant_docs": ["doc_123", "doc_456"], # Document IDs
"notes": "Should find both self-service and admin reset procedures"
},
{
"query": "ERR_CONNECTION_TIMEOUT python requests",
"relevant_docs": ["doc_789"],
"notes": "Technical error code - needs exact match"
},
# ... 50-200 queries for meaningful evaluation
]
Sources for evaluation queries:
from typing import List, Dict
import numpy as np
def precision_at_k(retrieved: List[str], relevant: List[str], k: int) -> float:
"""Precision@K: What fraction of top-k results are relevant?"""
retrieved_k = retrieved[:k]
relevant_set = set(relevant)
hits = sum(1 for doc in retrieved_k if doc in relevant_set)
return hits / k
def recall_at_k(retrieved: List[str], relevant: List[str], k: int) -> float:
"""Recall@K: What fraction of relevant docs are in top-k?"""
retrieved_k = set(retrieved[:k])
relevant_set = set(relevant)
if not relevant_set:
return 0.0
hits = len(retrieved_k & relevant_set)
return hits / len(relevant_set)
def mrr(retrieved: List[str], relevant: List[str]) -> float:
"""Mean Reciprocal Rank: How high is the first relevant result?"""
relevant_set = set(relevant)
for rank, doc in enumerate(retrieved, start=1):
if doc in relevant_set:
return 1.0 / rank
return 0.0
def ndcg_at_k(retrieved: List[str], relevant: List[str], k: int) -> float:
"""NDCG@K: Normalized Discounted Cumulative Gain."""
relevant_set = set(relevant)
# DCG: sum of relevance / log2(rank+1)
dcg = sum(
1.0 / np.log2(rank + 2) # +2 because rank is 0-indexed
for rank, doc in enumerate(retrieved[:k])
if doc in relevant_set
)
# Ideal DCG: all relevant docs at top
ideal_length = min(len(relevant), k)
idcg = sum(1.0 / np.log2(i + 2) for i in range(ideal_length))
return dcg / idcg if idcg > 0 else 0.0
def evaluate_search(search_fn, evaluation_set: List[Dict], k: int = 10) -> Dict:
"""
Run evaluation across all queries.
Args:
search_fn: Function that takes query and returns list of doc IDs
evaluation_set: List of {query, relevant_docs} dicts
k: Cutoff for metrics
Returns:
Dict with average metrics
"""
metrics = {"precision": [], "recall": [], "mrr": [], "ndcg": []}
for item in evaluation_set:
retrieved = search_fn(item["query"])
relevant = item["relevant_docs"]
metrics["precision"].append(precision_at_k(retrieved, relevant, k))
metrics["recall"].append(recall_at_k(retrieved, relevant, k))
metrics["mrr"].append(mrr(retrieved, relevant))
metrics["ndcg"].append(ndcg_at_k(retrieved, relevant, k))
return {
f"precision@{k}": np.mean(metrics["precision"]),
f"recall@{k}": np.mean(metrics["recall"]),
"mrr": np.mean(metrics["mrr"]),
f"ndcg@{k}": np.mean(metrics["ndcg"]),
}
Once you can measure quality, systematically test improvements:
Track metrics over time as your content and query patterns evolve.
Moving from prototype to production requires attention to reliability, performance, and maintainability.
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import psycopg2
from sentence_transformers import SentenceTransformer, CrossEncoder
from contextlib import contextmanager
import os
app = FastAPI(title="Semantic Search API")
# Initialize models once at startup
embedding_model = SentenceTransformer("BAAI/bge-large-en-v1.5")
rerank_model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")
# Database connection pool
from psycopg2 import pool
db_pool = pool.ThreadedConnectionPool(
minconn=2,
maxconn=10,
dsn=os.environ["DATABASE_URL"]
)
@contextmanager
def get_db():
conn = db_pool.getconn()
try:
yield conn
finally:
db_pool.putconn(conn)
class SearchRequest(BaseModel):
query: str
limit: int = 10
filters: dict = {}
use_rerank: bool = True
class SearchResult(BaseModel):
id: int
title: str
content: str
score: float
metadata: dict
class SearchResponse(BaseModel):
results: list[SearchResult]
query: str
total_candidates: int
@app.post("/search", response_model=SearchResponse)
def search(request: SearchRequest):
"""
Hybrid semantic search with optional re-ranking.
"""
# Generate query embedding
query_prefixed = f"Represent this sentence for searching relevant passages: {request.query}"
query_embedding = embedding_model.encode(query_prefixed, normalize_embeddings=True).tolist()
with get_db() as conn:
# Hybrid search: combine semantic and keyword results
candidates = hybrid_search(
conn,
request.query,
query_embedding,
filters=request.filters,
limit=50 # Get more candidates for re-ranking
)
total_candidates = len(candidates)
# Re-rank if requested and we have results
if request.use_rerank and candidates:
pairs = [(request.query, doc["content"]) for doc in candidates]
scores = rerank_model.predict(pairs)
for doc, score in zip(candidates, scores):
doc["score"] = float(score)
candidates.sort(key=lambda x: x["score"], reverse=True)
else:
# Use similarity scores from initial retrieval
for i, doc in enumerate(candidates):
doc["score"] = doc.get("similarity", 1.0 - i * 0.01)
results = [
SearchResult(
id=doc["id"],
title=doc["title"],
content=doc["content"][:500], # Truncate for response
score=doc["score"],
metadata=doc.get("metadata", {})
)
for doc in candidates[:request.limit]
]
return SearchResponse(
results=results,
query=request.query,
total_candidates=total_candidates
)
@app.post("/index")
def index_document(title: str, content: str, metadata: dict = {}):
"""Add a document to the search index."""
embedding = embedding_model.encode(content, normalize_embeddings=True).tolist()
with get_db() as conn:
with conn.cursor() as cur:
cur.execute(
"""
INSERT INTO documents (title, content, embedding, metadata)
VALUES (%s, %s, %s::vector, %s::jsonb)
RETURNING id
""",
(title, content, embedding, metadata)
)
doc_id = cur.fetchone()[0]
conn.commit()
return {"id": doc_id, "status": "indexed"}
@app.get("/health")
def health_check():
"""Health check endpoint."""
with get_db() as conn:
with conn.cursor() as cur:
cur.execute("SELECT 1")
return {"status": "healthy"}
Avoid these frequent mistakes in semantic search implementations:
At Virtido, we help enterprises build production-grade search infrastructure that understands meaning, not just keywords — combining vector search expertise with practical AI engineering through our AI Hub.
We've built semantic search systems for clients across FinTech, legal tech, healthcare, and enterprise software. Our staff augmentation model provides vetted talent with Swiss contracts and full IP protection.
Semantic search represents a fundamental shift in how applications connect users with information. Rather than requiring users to guess the exact words content creators used, semantic search understands intent and matches by meaning. The technology is mature, the tools are production-ready, and the user experience improvements are substantial.
The most effective implementations combine multiple techniques: embeddings for semantic understanding, BM25 for exact matches, and re-ranking for precision. This hybrid approach handles the full range of queries users actually make — from conceptual questions to specific error codes. Building evaluation infrastructure early enables data-driven improvements rather than guesswork.
Whether you're building a knowledge base, document search, or the retrieval layer for a RAG system, semantic search transforms "I know the answer exists somewhere" frustration into "found it on the first try" satisfaction. The implementation patterns in this guide provide a foundation for search that truly understands what users are looking for.
Keyword search finds documents containing specific words — it requires exact or near-exact matches between query terms and document text. Semantic search uses embeddings to match by meaning, finding relevant documents even when they use completely different words. "Car repair" finds "automobile maintenance" with semantic search because the embeddings capture that both phrases mean the same thing.
For most applications, OpenAI's text-embedding-3-small offers the best balance of quality and simplicity. If you need self-hosted embeddings to avoid API costs, BGE-large or E5-large provide near-commercial quality. For multilingual content, use Cohere embed-v3 or multilingual-e5. Start with a general-purpose model, then evaluate domain-specific options if quality is insufficient.
Costs have three components: embedding generation, vector storage, and compute for search. OpenAI embeddings cost roughly $0.02 per million tokens (about 750,000 words). Vector storage in pgvector uses your existing PostgreSQL infrastructure. Managed vector databases like Pinecone start around $70/month for production workloads. Self-hosted options reduce per-query costs but require infrastructure management. A typical 1 million document index costs $50-200/month to operate.
Hybrid search is recommended for almost all production systems. Vector-only search misses exact matches — important for error codes, product names, and technical terms. Hybrid search combines semantic similarity with keyword matching (BM25), capturing both conceptual matches and exact terms. The overhead is minimal, and hybrid search rarely performs worse than either method alone.
Re-ranking improves precision for the top results, which matters most for user experience. Cross-encoders are more accurate than bi-encoders because they can directly compare query and document. If your search results are "close but not quite right," re-ranking often fixes the ordering. Skip re-ranking if latency is critical or if initial retrieval quality is already sufficient for your use case.
Build an evaluation set with queries and their relevant documents (ground truth). Key metrics include Precision@K (what fraction of top-k results are relevant), Recall@K (what fraction of relevant docs are in top-k), MRR (Mean Reciprocal Rank — how high is the first relevant result), and NDCG (accounts for ranking position). Collect queries from search logs and support tickets to reflect real usage patterns.
Use multilingual embedding models like Cohere embed-v3 or multilingual-e5-large, which create comparable embeddings across 100+ languages. A query in English will match relevant documents in German, Spanish, or Japanese. For best results, consider language-specific tuning and ensure your evaluation set includes cross-lingual queries. Some applications benefit from separate indexes per language with language detection routing.
Update strategy depends on content freshness requirements. Near-real-time indexing (seconds to minutes) suits dynamic content like support tickets or chat history. Daily batch indexing works for documentation or knowledge bases that change infrequently. Always use incremental updates where possible — re-embed only changed documents rather than rebuilding the entire index.
Embedding generation takes 10-50ms per query depending on model and hardware. Vector search with HNSW indexes returns results in 5-20ms for million-scale indexes. Re-ranking adds 50-200ms depending on the number of candidates and re-ranker model. Total end-to-end latency of 100-300ms is typical for production systems. Latency increases with index size and result set size.
Not necessarily. PostgreSQL with pgvector handles millions of vectors and integrates with existing infrastructure — often the best starting point. Purpose-built vector databases (Pinecone, Weaviate, Qdrant) become valuable at larger scale (tens of millions of vectors), when you need advanced features like built-in hybrid search, or when operational simplicity of a managed service outweighs the cost premium.