Six months before go-live, a mid-sized fintech threw away their first two RAG architectures. Both failed under real production load: queries timed out, costs spiked past budget, and compliance teams blocked the data flow. They needed a system that could handle 50K+ daily queries on a legal document corpus without breaking SLAs or regulatory requirements.
Here’s how we built a production-grade RAG pipeline using Qdrant that cut p95 latency from 1.2s to 180ms, reduced infrastructure costs by 67%, and hit 99.9% uptime for six consecutive months.
The Production Reality
The client’s use case was deceptively simple: their internal legal and compliance teams needed to query a corpus of contracts, regulatory filings, and policy documents. Users expected sub-second responses for queries like “What are the data retention requirements for customer KYC data under EU regulation?”
The first architecture a managed vector database service with a single orchestrator node looked fine on paper. In production, it collapsed under peak load. P95 latency crawled past 1.2s, API timeouts spiked to 15% during business hours, and the managed service bill hit $12K/month for 10M embeddings. Worse, the vendor’s data residency policy couldn’t satisfy the client’s compliance team.
The second attempt tried to fix cost by self-hosting Milvus, but operational complexity overwhelmed the team. Cluster management, backup/recovery, and monitoring became a full-time job. When a Milvus node failed during a quarterly audit, recovery took 12 hours.
We needed a third approach that balanced three constraints: sub-200ms p95 latency, under $5K/month infrastructure, and ops burden small enough for a three-person infra team.
The Architecture That Finally Worked
┌─────────────┐
│ Users │
└──────┬──────┘
│
┌──────▼────────────────────────────────────────────────────────────┐
│ API Gateway (Kong) │
│ Rate limiting, Auth │
└──────┬─────────────────────────────────────────────────────────────┘
│
┌──────▼────────────────────────────────────────────────────────────┐
│ Orchestrator Service │
│ - Query parsing and rewriting │
│ - Cache lookups (Redis) │
│ - Parallel search dispatch │
│ - Response synthesis │
└──────┬─────────────────────────────────────────────────────────────┘
│
├───┬─────────────────────┬──────────────────┐
│ │ │ │
┌──────▼─┐│ ┌──────────▼────────┐ ┌───────▼─────┐
│ Redis ││ │ Qdrant Cluster │ │ LLM Service │
│ Cache ││ │ (3 nodes, HNSW) │ │ (GPT-4) │
│ ││ └───────────────────┘ └─────────────┘
└────────┘│
└──────────────────┬───────────────────┐
│ │
┌───────────▼─────────┐ ┌──────▼──────┐
│ Embedding Service │ │ Monitoring │
│ (Text-Embedding-3) │ │ (Prometheus) │
└─────────────────────┘ └─────────────┘The pipeline has four key components:
API Gateway
Kong handles rate limiting (100 req/min per user), authentication, and request routing. It also injects tracing headers for observability.
Orchestrator Service
A Node.js service built on Express that:
- Parses user queries and extracts key entities
- Checks Redis cache before hitting vector search (60% cache hit rate)
- Dispatches parallel queries to Qdrant (hybrid search) and the LLM for generation
- Synthesizes the final response with citations
Qdrant Cluster
Three-node deployment on AWS EKS (m6i.2xlarge instances, 8 vCPU, 32GB RAM each) with:
- HNSW index with
m=16,ef_construction=128for fast approximate search - Hybrid search combining dense vectors (OpenAI text-embedding-3-large) and sparse BM25 keywords
- Payload filtering by document type, jurisdiction, and date range
- Snapshot-based backups to S3 every 6 hours
Caching Layer
Redis Cluster with 3 nodes stores:
- Query-response pairs (24h TTL)
- Frequent query patterns for pre-warming
- Failed queries for rate-limiting abuse detection
Monitoring Stack
Prometheus + Grafana tracks:
- Query latency (p50, p95, p99)
- Cache hit/miss ratios
- Qdrant search latency and memory usage
- LLM token counts and costs
- Error rates per component
Why We Chose Qdrant (Over Pinecone and Milvus)
We evaluated three vector databases seriously. Here’s how they compared:
Pinecone
Managed service with excellent developer experience. But at our client’s scale, the economics didn’t work:
- $0.10 per million read operations vs $0.01 with self-hosted Qdrant
- No data residency guarantees in the client’s required regions
- Vendor lock-in would complicate future migration
Pinecone works great for prototypes and smaller workloads, but once you cross ~20K queries/day, the managed premium gets hard to justify.
Milvus
Powerful open-source vector database with strong feature set. But operational complexity was a dealbreaker:
- Requires multiple components (Milvus, etcd, MinIO, Pulsar) to coordinate
- Backup/recovery is fragile snapshots need careful orchestration
- Documentation is scattered across multiple repos and versions
For a team without a dedicated database engineer, Milvus became a time sink.
Qdrant
Built in Rust with a single binary, simple configuration, and excellent hybrid search. The decision drivers were:
- Single binary deployment: Just run
qdrantand it works. No dependency hell. - Hybrid search out of the box: Combines dense vectors and BM25 in one query critical for legal documents where exact terms matter (e.g., “GDPR Article 17”).
- Cost at scale: 60% cheaper than Pinecone at 50K queries/day
- Compliance: Self-hosted on the client’s VPC with full data control
The trade-off: we’re responsible for ops. But Qdrant’s stability and simple ops model meant that burden was manageable.
The Ingestion Pipeline
Keeping 10M embeddings up to date required its own pipeline. Documents arrive from three sources: CMS (policy updates), OCR (scanned contracts), and API feeds (regulatory updates).
const { QdrantClient } = require('@qdrant/js-client-rest')
const { OpenAI } = require('openai')
const Redis = require('ioredis')
const qdrant = new QdrantClient({ url: process.env.QDRANT_URL })
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
const redis = new Redis(process.env.REDIS_URL)
async function processBatch(documents) {
const embeddings = await Promise.all(
documents.map(doc =>
openai.embeddings.create({
model: 'text-embedding-3-large',
input: doc.content
})
)
)
const points = documents.map((doc, i) => ({
id: doc.id,
vector: embeddings[i].data[0].embedding,
payload: {
title: doc.title,
content: doc.content,
doc_type: doc.type,
jurisdiction: doc.jurisdiction,
effective_date: doc.effective_date,
text: doc.content // For sparse search
}
}))
await qdrant.upsert('legal_docs', { wait: true, batch_size: 100, points })
// Invalidate cache for affected queries
await redis.flushdb()
}
// Run every hour
setInterval(
() => {
const newDocs = fetchDocumentsSince(lastRun)
processBatch(newDocs)
},
60 * 60 * 1000
)
The ingestion pipeline runs as a Kubernetes CronJob:
- Pulls new documents from the CMS
- Generates embeddings in batches of 100
- Upserts to Qdrant with
wait=truefor consistency - Flushes Redis cache to serve fresh results
Performance and Cost Results
After six months in production, here are the actual metrics:
| Metric | Before | After | Improvement |
|---|---|---|---|
| p95 Latency | 1,200ms | 180ms | 6.7× faster |
| p99 Latency | 2,400ms | 420ms | 5.7× faster |
| Infrastructure Cost | $12K/month | $4K/month | 67% reduction |
| Uptime (6 months) | 97.5% | 99.9% | 2.9× more reliable |
| Max Throughput | 25K queries/day | 200K queries/day | 8× headroom |
| Cache Hit Rate | N/A | 60% | 60% of queries served from cache |
Cost breakdown per month:
- Qdrant cluster (3× m6i.2xlarge): $1,080
- Redis cluster (3× t3.medium): $180
- Orchestrator service (4× t3.large): $360
- LLM API (GPT-4): $1,800
- Monitoring & observability: $180
- Load balancers & ingress: $200
- Total: ~$3,800/month
What Didn’t Work (And Why)
Attempt 1: Managed Service + Single Orchestrator
What we tried: Pinecone with a single Node.js orchestrator instance. Why it failed: Single point of failure, no horizontal scaling, vendor lock-in. Lesson: Managed services scale horizontally, but your orchestration layer must too.
Attempt 2: Milvus Self-Hosted
What we tried: 3-node Milvus cluster on bare metal. Why it failed: Ops complexity exceeded team capacity. Backup failures and configuration drift caused outages. Lesson: If you don’t have a dedicated DBA, avoid systems that require coordinating multiple components.
Attempt 3: Cache-First Architecture
What we tried: Aggressive caching with 24h TTL, pre-warming frequent queries. Why it failed: Cache invalidation became a nightmare. Legal documents update frequently, and stale answers violated compliance requirements. Lesson: Cache is great for latency, but TTL must match your document update cadence. We settled on 2h TTL.
Known Limitations and Open Questions
This architecture works well for the client’s current scale (50K queries/day), but we see three potential constraints:
- LLM cost scaling: At 200K queries/day, GPT-4 API costs would hit $7.2K/month. We’re evaluating smaller models (Llama 3.1 70B) via vLLM for cost reduction.
- Cold starts: The first query after a deployment can take 3-4 seconds while Qdrant warms up. We’re experimenting with connection pooling and request batching.
- Multi-tenant isolation: The client wants to expose this system to external partners. We need tenant-aware isolation in Qdrant (either separate collections or payload filtering with RBAC).
Where This Approach Breaks Down
This architecture is not a universal solution. Avoid it if:
- Your corpus is <100K documents: Managed services (Pinecone, Weaviate Cloud) will be cheaper and simpler.
- You lack ops capacity: Self-hosted Qdrant requires someone to manage Kubernetes, backups, and monitoring.
- Your latency SLA is <50ms: Vector search at scale rarely gets below 50ms. Consider precomputed answers or traditional search (Elasticsearch) instead.
- Your queries are highly complex: Multi-hop reasoning or deep synthesis will require more sophisticated agents than this pipeline supports.
How to Replicate This
If you’re scaling RAG beyond 20K queries/day, here’s your starting point:
- Deploy Qdrant on Kubernetes: Use the official Helm chart. Start with 2 nodes and scale to 3 for high availability.
- Enable hybrid search: Configure both dense and sparse indexes. Legal and financial documents need exact term matching.
- Add a cache layer: Redis works. Cache query-response pairs, not just search results.
- Monitor everything: Latency per component, cache hit rates, Qdrant memory usage, LLM costs.
- Plan for ingestion: Don’t make ingestion an afterthought. Automate document processing and embedding generation.
If you’re dealing with compliance requirements, multi-jurisdictional data, or scale beyond 100K queries/day, we’ve shipped this architecture. Talk to us we can help you avoid the three rebuilds this client went through.
This article originally appeared on lightrains.com
Leave a comment
To make a comment, please send an e-mail using the button below. Your e-mail address won't be shared and will be deleted from our records after the comment is published. If you don't want your real name to be credited alongside your comment, please specify the name you would like to use. If you would like your name to link to a specific URL, please share that as well. Thank you.
Comment via email