Context

The client needed a retrieval pipeline that could handle 10M+ documents with sub-second latency for their internal knowledge base. We evaluated hybrid search (BM25 + vector), reranking models, and caching strategies.

Approach

We implemented a two-stage retrieval pipeline: fast candidate retrieval with a custom hybrid scorer, then a cross-encoder reranker. Results were cached with a TTL tuned to the update frequency of the corpus.

Results

Results & metrics

Latency reduction: 60%
Cost reduction: 40%