(1) Chunk your documents thoughtfully. (2) Embed each chunk with an off-the-shelf model. (3) Store the vectors in a database that supports approximate nearest-neighbor search. (4) At query time, run hybrid retrieval — vector search plus BM25 — and fuse the results. (5) Rerank the top N with a cross-encoder. Evaluate the whole thing on a labeled query set in CI. The system that ships is the one with that evaluation loop, not the one with the fanciest model.
"Build a semantic search engine" sounds like a research project. It isn't. Most production semantic search systems in 2026 are roughly the same five components stitched together, and a small team can ship a working version in a couple of weeks. The hard parts are the parts no one talks about: chunking strategy, evaluation discipline, and the gap between "the demo works" and "it works on the long tail of weird queries your users actually type."
This is a practical guide to each piece — what to use, what to skip, and what the production-grade version looks like by the end. It's aimed at engineers who already know what an embedding is and want a usable architecture, not a paper review.
The five-component architecture (and what each one does)
| Component | Job | The variable to tune |
|---|---|---|
| Chunker | Split documents into searchable units | Chunk size, overlap, semantic vs fixed |
| Embedder | Convert chunks and queries into vectors | Model choice, dimension, modality |
| Vector store | Index vectors and find nearest neighbors | Index type (HNSW, IVF), metric, filters |
| Hybrid retriever | Combine vector and keyword search | Fusion method, weight balance |
| Reranker | Reorder the top N by relevance | Cross-encoder choice, top-N cutoff |
Build them in order. Skip components only after you've measured why you don't need them. The most common production regression is teams skipping the reranker because "vector search seems good" — and then watching precision drop when real-world queries arrive.
Step 1: Chunking — the part nobody respects enough
Before any embeddings, you have to decide what unit you're going to search over. A single huge document embedded as one vector is almost always a bad idea — the embedding gets diluted by everything in the document. A single sentence per chunk is the other extreme and usually too granular. The right answer is in the middle, and it depends on what your users are searching for.
The three patterns that work:
- Fixed-size character or token chunks with overlap. 500 to 1,000 tokens per chunk, 100 to 200 tokens of overlap between adjacent chunks. The overlap means a phrase that straddles a boundary still ends up in at least one whole chunk. This is the boring default that ships most systems.
- Semantic chunking. Split on natural boundaries — section headings, paragraph breaks, code blocks. More work to implement; gives noticeably better retrieval on long-form content where the structure is meaningful.
- Multi-granularity. Store both small chunks (sentences) and large chunks (sections or whole documents) as separate vectors. Retrieve small chunks for precision, then expand to the parent chunk for context. The "parent document retriever" pattern.
One thing to skip: don't index PDFs as a single chunk. Don't index a whole web page as a single chunk. Don't index a 50-paragraph blog post as a single chunk. You will get vague retrieval and you will not understand why. The chunk is the unit of meaning your retrieval sees — make it the right size.
Step 2: Embeddings — pick a model, don't fine-tune yet
An embedding model takes a chunk of text and returns a vector — a fixed-length list of floating-point numbers that represents the chunk's meaning. Two chunks with similar meaning end up close together in vector space; unrelated chunks end up far apart.
In 2026 there are a handful of standard choices: hosted models from the major AI providers, and several strong open-source options that you can self-host. The honest answer for which to pick: start with whatever is one API call away in your existing stack, measure on your queries, and switch later if needed.
The three things to actually decide:
- Hosted vs self-hosted. Hosted is simpler, has per-token pricing, and ties you to a vendor. Self-hosted gives you cost control and data residency, and you run a GPU.
- Embedding dimension. Higher-dimensional embeddings often retrieve marginally better; they also cost more to store and query. 768 to 1,536 dimensions is the common production range. Matryoshka-style embeddings let you truncate to fewer dimensions when storage matters.
- Modality. Text-only is the default. If you're searching images, code, or audio, pick a model trained for that modality — using a text model on code or images usually underperforms.
And one thing not to do on day one: don't fine-tune the embedding model. Off-the-shelf models are good enough that the marginal lift from fine-tuning is rarely worth the data-collection effort. The first lift you should reach for is a reranker, which we'll get to.
A minimal Python example for embedding a chunk:
# Pseudo-code; substitute the embedding provider of your choice from embedding_client import EmbeddingClient client = EmbeddingClient(model="your-embedding-model") def embed(texts: list[str]) -> list[list[float]]: # Batch to keep request count down out = [] for i in range(0, len(texts), 100): batch = texts[i:i + 100] resp = client.embed(batch) out.extend(resp.vectors) return out
The unsexy detail that matters in production: always batch your embedding calls. Per-request overhead dominates for one-at-a-time embedding, and at index-build time you'll be embedding hundreds of thousands of chunks. Batching reduces wall-clock time by an order of magnitude or more.
Step 3: The vector store
You need somewhere to put the vectors that can do fast approximate nearest-neighbor (ANN) search over them. The common options:
- pgvector — vector search inside Postgres. If you already run Postgres, this is by far the lowest-friction starting point. Good enough for many production workloads up to several million vectors.
- Qdrant, Weaviate, Milvus — purpose-built open-source vector databases with rich filtering, strong write throughput, and hosted offerings. Common picks once you outgrow pgvector or need rich metadata filtering.
- Pinecone — managed-only, simple to integrate, popular when teams don't want to run vector infrastructure. Per-vector pricing.
- Elasticsearch / OpenSearch with vector support — useful if you're already operating a Lucene-based search stack and want vector and BM25 in one place.
What to ignore until you've measured: the choice of ANN index (HNSW vs IVF vs DiskANN) and the choice of distance metric (cosine vs dot product vs Euclidean). Use the defaults of whichever database you pick. These knobs matter for advanced tuning; they almost never matter on day one.
What to think about up front: filters. Real-world search nearly always has filters — date range, category, language, user-permitted set. Make sure your chosen vector database supports efficient pre-filtering or hybrid-filtering, because filtering inside a 1M-vector index after the fact is slow.
Step 4: Hybrid retrieval — the unloved step that wins production
Pure semantic search has a well-known weakness: it's bad at exact matches. A query for an exact product SKU, a specific employee name, a version number, or a piece of rare jargon will often miss documents that contain the exact string. The reason is that the embedding model doesn't memorize identifiers; it represents meaning.
The fix is hybrid retrieval: at query time, run both vector search and a keyword-based search (BM25), then fuse the results. The two retrieval methods complement each other — vector search finds semantic matches, BM25 finds exact matches.
The standard fusion approach is Reciprocal Rank Fusion (RRF), which has the rare property of being almost embarrassingly simple and almost always working:
def reciprocal_rank_fusion(rankings: list[list[str]], k: int = 60) -> list[str]: # rankings is a list of ranked lists of doc IDs from each retriever scores = {} for ranking in rankings: for rank, doc_id in enumerate(ranking): scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1) return sorted(scores, key=scores.get, reverse=True)
You take the top, say, 50 results from each retriever, fuse them with RRF, and pass the resulting list to the reranker. The k=60 constant is a well-known starting value; nudge it later if needed.
Some vector databases ship native hybrid search and do the fusion for you. If yours does, use it. If it doesn't, the few lines above are the entirety of what you need.
Step 5: Reranking — the precision step
The retrieval step optimizes for recall: "did the right answer make it into the top 50?" The reranker optimizes for precision: "is the right answer in the top 3?" Without a reranker, you'll have good retrieval and mediocre user-facing quality. With one, the same retrieval feels noticeably more intelligent.
A reranker is typically a cross-encoder model that takes a query and a document together, scores them as a pair, and returns a relevance score. Bi-encoders (the embedding models you used for retrieval) embed query and document separately — fast but less accurate. Cross-encoders attend to query and document together — slower but more accurate. You can't use a cross-encoder for retrieval because you'd have to score every document in your corpus per query. But over the top 50 from retrieval, a cross-encoder is cheap and dramatic.
Two common patterns:
- Hosted reranker. Several providers offer hosted rerankers as an API. Lowest-friction. Pay per query.
- Self-hosted cross-encoder. Open-source cross-encoder models can be self-hosted on a small GPU. Cheaper at scale; more operational overhead.
The rule of thumb: rerank the top 50, return the top 5 to 10 to the user. Adjusting top-N is one of the cheapest dials you have for trading off cost and quality.
The thing that actually matters: evaluation
Most semantic search systems that fail in production fail because the team built the pipeline and never built an evaluation loop. Without evaluation you have no way to know whether swapping the embedding model helped or hurt. You'll make changes on vibes, regress quietly, and one day notice quality dropping with no idea why.
The evaluation loop has three parts.
1. A labeled query set
Collect 100 to 500 real queries that your users (or representative users) would ask. For each query, identify the relevant documents — manually if you have to, more efficiently with LLM-assisted labeling if your corpus is large. This is unglamorous, time-consuming work. It is also the highest-leverage thing you'll do.
Keep this query set under version control. Never edit it without recording what changed. When you swap a model, run against the same set and compare results.
2. The right metrics
- Recall@k. Of the relevant documents, what fraction appear in the top k results? Measures whether retrieval found the right answer at all.
- MRR (Mean Reciprocal Rank). Where did the first relevant result rank, averaged across queries? Higher is better. Sensitive to top-1 quality.
- NDCG (Normalized Discounted Cumulative Gain). Quality of the full ranked list, weighted toward the top. The metric most production systems converge on.
Report all three. Different changes affect different metrics — a better reranker often boosts MRR and NDCG without changing recall, because it's reordering the same retrieved set.
3. Run evaluation in CI
Every PR that changes the embedding model, retrieval method, chunking strategy, or reranker should run the evaluation set and report the deltas. Treat regressions like test failures. Without this discipline, "small" changes silently degrade quality and nobody catches it until users complain — by which point you've shipped half a dozen other changes and don't know which one caused the regression.
What to skip on day one
- Fine-tuning embeddings. Off-the-shelf is good enough until evaluation says otherwise.
- Custom ANN indexes. Database defaults are fine for the first several million vectors.
- Query rewriting / HyDE / multi-query. All useful at the margin. Build the base system first.
- Agentic retrieval. Tempting, expensive, and often worse than the boring single-shot pipeline.
The order in which to add complexity is: get the five components working → build the evaluation loop → identify the weakest stage from data → fix that stage. Most teams add complexity in the opposite order and end up with a system that's hard to reason about and not measurably better.
Cost discipline
The three places semantic search systems leak money:
- Re-embedding everything when the model changes. Plan for it. Budget for it. Build your indexing job to be re-runnable.
- Reranking on every query. Cheap per query, dominant at scale. Cache rerank scores for popular queries.
- Storing high-dimensional vectors at large scale. 1,536-dimension float32 vectors weigh roughly 6 KB each. A million vectors is 6 GB before any index overhead. Consider quantization or Matryoshka-style truncation if storage cost becomes a concern.
None of these will hurt you at prototype scale. All of them will hurt you at production scale if you didn't plan for them.
What "good" looks like
A production-quality semantic search system in 2026 looks like this: documents flow through a deterministic chunking pipeline, get embedded with a versioned embedder, are stored in a vector database with rich metadata filters, and are queried via hybrid retrieval with reranking. There's a labeled evaluation set checked in alongside the code, run in CI on every change, with metrics published on a dashboard. Quality regressions block merges. New features go through the same loop.
None of that is research. All of it is engineering. The hardest part is the evaluation set, which is also the part that lets every other change in the system be safe.
A note on RAG vs search
"Semantic search" and "retrieval-augmented generation" share the same retrieval layer. If you build the system above, you also have the retriever for a RAG application — the only thing you add is the LLM that consumes the retrieved chunks and produces an answer. Many teams skip the search-first step and jump straight to RAG, then wonder why their LLM gives bad answers. The answer is almost always that retrieval is the bottleneck. Build search first, evaluate it, then layer generation on top.
If you want the RAG-specific layer, the same five-component architecture applies — just with an LLM call at the end consuming the reranked chunks. See our guides on RAG architecture and chunking strategies for the generation-side details.
Pulling it together
A working semantic search system in 2026 is not a research moonshot. It's five well-understood components — chunking, embedding, a vector store, hybrid retrieval, reranking — wired together with a serious evaluation loop. The teams that ship great search are not the ones with the fanciest models. They're the ones with the labeled query set, the CI-enforced metrics, and the discipline to swap one component at a time and measure the effect.
If you're starting from zero this week, the order is: build the boring version in a week, build the evaluation set in week two, then iterate on whichever component the metrics tell you is weakest. Within a month you'll have a system that genuinely outperforms keyword search on real queries — and you'll know why.
Hiring for ML, AI, or search engineers?
Browse senior ML and AI roles at companies hiring for production search, RAG, and embedding-driven systems — with full culture context.
Browse ML & AI Jobs → Explore AI Tools →