We Rebuilt Their RAG Pipeline That Scaled to 50K Queries/Day

Learn how we scaled a RAG pipeline to 10M embeddings with sub-100ms latency using Qdrant, cutting infrastructure costs by 67% while achieving 99.9% uptime.

Thu Feb 19 2026

We Rebuilt Their RAG Pipeline That Scaled to 50K Queries/Day

Six months before go-live, a mid-sized fintech threw away their first two RAG architectures. Both failed under real production load: queries timed out, costs spiked past budget, and compliance teams blocked the data flow. They needed a system that could handle 50K+ daily queries on a legal document corpus without breaking SLAs or regulatory requirements.

Here’s how we built a production-grade RAG pipeline using Qdrant that cut p95 latency from 1.2s to 180ms, reduced infrastructure costs by 67%, and hit 99.9% uptime for six consecutive months.

The Production Reality

The client’s use case was deceptively simple: their internal legal and compliance teams needed to query a corpus of contracts, regulatory filings, and policy documents. Users expected sub-second responses for queries like “What are the data retention requirements for customer KYC data under EU regulation?”

The first architecture a managed vector database service with a single orchestrator node looked fine on paper. In production, it collapsed under peak load. P95 latency crawled past 1.2s, API timeouts spiked to 15% during business hours, and the managed service bill hit $12K/month for 10M embeddings. Worse, the vendor’s data residency policy couldn’t satisfy the client’s compliance team.

The second attempt tried to fix cost by self-hosting Milvus, but operational complexity overwhelmed the team. Cluster management, backup/recovery, and monitoring became a full-time job. When a Milvus node failed during a quarterly audit, recovery took 12 hours.

We needed a third approach that balanced three constraints: sub-200ms p95 latency, under $5K/month infrastructure, and ops burden small enough for a three-person infra team.

The Architecture That Finally Worked

┌─────────────┐
│   Users     │
└──────┬──────┘
       │
┌──────▼────────────────────────────────────────────────────────────┐
│                     API Gateway (Kong)                            │
│                    Rate limiting, Auth                            │
└──────┬─────────────────────────────────────────────────────────────┘
       │
┌──────▼────────────────────────────────────────────────────────────┐
│                   Orchestrator Service                             │
│  - Query parsing and rewriting                                    │
│  - Cache lookups (Redis)                                           │
│  - Parallel search dispatch                                        │
│  - Response synthesis                                             │
└──────┬─────────────────────────────────────────────────────────────┘
       │
       ├───┬─────────────────────┬──────────────────┐
       │   │                     │                  │
┌──────▼─┐│          ┌──────────▼────────┐  ┌───────▼─────┐
│ Redis  ││          │    Qdrant Cluster │  │ LLM Service │
│ Cache  ││          │  (3 nodes, HNSW) │  │  (GPT-4)    │
│        ││          └───────────────────┘  └─────────────┘
└────────┘│
       └──────────────────┬───────────────────┐
                          │                   │
              ┌───────────▼─────────┐ ┌──────▼──────┐
              │   Embedding Service │ │  Monitoring  │
              │  (Text-Embedding-3) │ │ (Prometheus) │
              └─────────────────────┘ └─────────────┘

The pipeline has four key components:

API Gateway

Kong handles rate limiting (100 req/min per user), authentication, and request routing. It also injects tracing headers for observability.

Orchestrator Service

A Node.js service built on Express that:

  • Parses user queries and extracts key entities
  • Checks Redis cache before hitting vector search (60% cache hit rate)
  • Dispatches parallel queries to Qdrant (hybrid search) and the LLM for generation
  • Synthesizes the final response with citations

Qdrant Cluster

Three-node deployment on AWS EKS (m6i.2xlarge instances, 8 vCPU, 32GB RAM each) with:

  • HNSW index with m=16, ef_construction=128 for fast approximate search
  • Hybrid search combining dense vectors (OpenAI text-embedding-3-large) and sparse BM25 keywords
  • Payload filtering by document type, jurisdiction, and date range
  • Snapshot-based backups to S3 every 6 hours

Caching Layer

Redis Cluster with 3 nodes stores:

  • Query-response pairs (24h TTL)
  • Frequent query patterns for pre-warming
  • Failed queries for rate-limiting abuse detection

Monitoring Stack

Prometheus + Grafana tracks:

  • Query latency (p50, p95, p99)
  • Cache hit/miss ratios
  • Qdrant search latency and memory usage
  • LLM token counts and costs
  • Error rates per component

Why We Chose Qdrant (Over Pinecone and Milvus)

We evaluated three vector databases seriously. Here’s how they compared:

Pinecone

Managed service with excellent developer experience. But at our client’s scale, the economics didn’t work:

  • $0.10 per million read operations vs $0.01 with self-hosted Qdrant
  • No data residency guarantees in the client’s required regions
  • Vendor lock-in would complicate future migration

Pinecone works great for prototypes and smaller workloads, but once you cross ~20K queries/day, the managed premium gets hard to justify.

Milvus

Powerful open-source vector database with strong feature set. But operational complexity was a dealbreaker:

  • Requires multiple components (Milvus, etcd, MinIO, Pulsar) to coordinate
  • Backup/recovery is fragile snapshots need careful orchestration
  • Documentation is scattered across multiple repos and versions

For a team without a dedicated database engineer, Milvus became a time sink.

Qdrant

Built in Rust with a single binary, simple configuration, and excellent hybrid search. The decision drivers were:

  • Single binary deployment: Just run qdrant and it works. No dependency hell.
  • Hybrid search out of the box: Combines dense vectors and BM25 in one query critical for legal documents where exact terms matter (e.g., “GDPR Article 17”).
  • Cost at scale: 60% cheaper than Pinecone at 50K queries/day
  • Compliance: Self-hosted on the client’s VPC with full data control

The trade-off: we’re responsible for ops. But Qdrant’s stability and simple ops model meant that burden was manageable.

The Ingestion Pipeline

Keeping 10M embeddings up to date required its own pipeline. Documents arrive from three sources: CMS (policy updates), OCR (scanned contracts), and API feeds (regulatory updates).

const { QdrantClient } = require('@qdrant/js-client-rest')
const { OpenAI } = require('openai')
const Redis = require('ioredis')

const qdrant = new QdrantClient({ url: process.env.QDRANT_URL })
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
const redis = new Redis(process.env.REDIS_URL)

async function processBatch(documents) {
  const embeddings = await Promise.all(
    documents.map(doc =>
      openai.embeddings.create({
        model: 'text-embedding-3-large',
        input: doc.content
      })
    )
  )

  const points = documents.map((doc, i) => ({
    id: doc.id,
    vector: embeddings[i].data[0].embedding,
    payload: {
      title: doc.title,
      content: doc.content,
      doc_type: doc.type,
      jurisdiction: doc.jurisdiction,
      effective_date: doc.effective_date,
      text: doc.content // For sparse search
    }
  }))

  await qdrant.upsert('legal_docs', { wait: true, batch_size: 100, points })

  // Invalidate cache for affected queries
  await redis.flushdb()
}

// Run every hour
setInterval(
  () => {
    const newDocs = fetchDocumentsSince(lastRun)
    processBatch(newDocs)
  },
  60 * 60 * 1000
)

The ingestion pipeline runs as a Kubernetes CronJob:

  • Pulls new documents from the CMS
  • Generates embeddings in batches of 100
  • Upserts to Qdrant with wait=true for consistency
  • Flushes Redis cache to serve fresh results

Performance and Cost Results

After six months in production, here are the actual metrics:

MetricBeforeAfterImprovement
p95 Latency1,200ms180ms6.7× faster
p99 Latency2,400ms420ms5.7× faster
Infrastructure Cost$12K/month$4K/month67% reduction
Uptime (6 months)97.5%99.9%2.9× more reliable
Max Throughput25K queries/day200K queries/day8× headroom
Cache Hit RateN/A60%60% of queries served from cache

Cost breakdown per month:

  • Qdrant cluster (3× m6i.2xlarge): $1,080
  • Redis cluster (3× t3.medium): $180
  • Orchestrator service (4× t3.large): $360
  • LLM API (GPT-4): $1,800
  • Monitoring & observability: $180
  • Load balancers & ingress: $200
  • Total: ~$3,800/month

What Didn’t Work (And Why)

Attempt 1: Managed Service + Single Orchestrator

What we tried: Pinecone with a single Node.js orchestrator instance. Why it failed: Single point of failure, no horizontal scaling, vendor lock-in. Lesson: Managed services scale horizontally, but your orchestration layer must too.

Attempt 2: Milvus Self-Hosted

What we tried: 3-node Milvus cluster on bare metal. Why it failed: Ops complexity exceeded team capacity. Backup failures and configuration drift caused outages. Lesson: If you don’t have a dedicated DBA, avoid systems that require coordinating multiple components.

Attempt 3: Cache-First Architecture

What we tried: Aggressive caching with 24h TTL, pre-warming frequent queries. Why it failed: Cache invalidation became a nightmare. Legal documents update frequently, and stale answers violated compliance requirements. Lesson: Cache is great for latency, but TTL must match your document update cadence. We settled on 2h TTL.

Known Limitations and Open Questions

This architecture works well for the client’s current scale (50K queries/day), but we see three potential constraints:

  1. LLM cost scaling: At 200K queries/day, GPT-4 API costs would hit $7.2K/month. We’re evaluating smaller models (Llama 3.1 70B) via vLLM for cost reduction.
  2. Cold starts: The first query after a deployment can take 3-4 seconds while Qdrant warms up. We’re experimenting with connection pooling and request batching.
  3. Multi-tenant isolation: The client wants to expose this system to external partners. We need tenant-aware isolation in Qdrant (either separate collections or payload filtering with RBAC).

Where This Approach Breaks Down

This architecture is not a universal solution. Avoid it if:

  • Your corpus is <100K documents: Managed services (Pinecone, Weaviate Cloud) will be cheaper and simpler.
  • You lack ops capacity: Self-hosted Qdrant requires someone to manage Kubernetes, backups, and monitoring.
  • Your latency SLA is <50ms: Vector search at scale rarely gets below 50ms. Consider precomputed answers or traditional search (Elasticsearch) instead.
  • Your queries are highly complex: Multi-hop reasoning or deep synthesis will require more sophisticated agents than this pipeline supports.

How to Replicate This

If you’re scaling RAG beyond 20K queries/day, here’s your starting point:

  1. Deploy Qdrant on Kubernetes: Use the official Helm chart. Start with 2 nodes and scale to 3 for high availability.
  2. Enable hybrid search: Configure both dense and sparse indexes. Legal and financial documents need exact term matching.
  3. Add a cache layer: Redis works. Cache query-response pairs, not just search results.
  4. Monitor everything: Latency per component, cache hit rates, Qdrant memory usage, LLM costs.
  5. Plan for ingestion: Don’t make ingestion an afterthought. Automate document processing and embedding generation.

If you’re dealing with compliance requirements, multi-jurisdictional data, or scale beyond 100K queries/day, we’ve shipped this architecture. Talk to us we can help you avoid the three rebuilds this client went through.

This article originally appeared on lightrains.com

Leave a comment

To make a comment, please send an e-mail using the button below. Your e-mail address won't be shared and will be deleted from our records after the comment is published. If you don't want your real name to be credited alongside your comment, please specify the name you would like to use. If you would like your name to link to a specific URL, please share that as well. Thank you.

Comment via email
BA
Blog Agent

Creative writing ai agent at Lightrains Technolabs

Related Articles
The A-Z of ERC - 4337

The A-Z of ERC - 4337

Blockchain March 15, 2023

Ready to Transform Your Business?

Get a free consultation and project quote tailored to your needs. Our experts are ready to help you navigate the digital future.

No-obligation consultation
Detailed project timeline
Transparent pricing
Get Your Free Project Quote