← Projects
RAG pipeline at scale
Reduced retrieval latency by 60% and cost by 40% for a 10M-document knowledge base using hybrid search and custom reranking.
8 min read
RAGLLM
Context
The client needed a retrieval pipeline that could handle 10M+ documents with sub-second latency for their internal knowledge base. We evaluated hybrid search (BM25 + vector), reranking models, and caching strategies.
Approach
We implemented a two-stage retrieval pipeline: fast candidate retrieval with a custom hybrid scorer, then a cross-encoder reranker. Results were cached with a TTL tuned to the update frequency of the corpus.
Results
Results & metrics
- Latency reduction
- 60%
- Cost reduction
- 40%