Scaling a RAG Pipeline with Qdrant for Production

Share Scaling a RAG Pipeline with Qdrant for Production

Last updated: April 9, 2026

Six months before go-live, a mid-sized fintech threw away their first two RAG architectures. Both failed under real production load: queries timed out, costs spiked past budget, and compliance teams blocked the data flow. They needed a system that could handle 50K+ daily queries on a legal document corpus without breaking SLAs or regulatory requirements. Learn more about how large language models power production AI systems.

Here’s how we built a production-grade RAG pipeline using Qdrant that cut p95 latency from 1.2s to 180ms, reduced infrastructure costs by 67%, and hit 99.9% uptime for six consecutive months.

The Production Reality

The client’s use case was deceptively simple: their internal legal and compliance teams needed to query a corpus of contracts, regulatory filings, and policy documents. Users expected sub-second responses for queries like “What are the data retention requirements for customer KYC data under EU regulation?”

The first architecture a managed vector database service with a single orchestrator node looked fine on paper. In production, it collapsed under peak load. P95 latency crawled past 1.2s, API timeouts spiked to 15% during business hours, and the managed service bill hit $12K/month for 10M embeddings. Worse, the vendor’s data residency policy couldn’t satisfy the client’s compliance team.

The second attempt tried to fix cost by self-hosting Milvus, but operational complexity overwhelmed the team. Cluster management, backup/recovery, and monitoring became a full-time job. When a Milvus node failed during a quarterly audit, recovery took 12 hours.

We needed a third approach that balanced three constraints: sub-200ms p95 latency, under $5K/month infrastructure, and ops burden small enough for a three-person infra team.

The Architecture That Finally Worked

┌─────────────┐
│   Users     │
└──────┬──────┘
       │
┌──────▼────────────────────────────────────────────────────────────┐
│                     API Gateway (Kong)                            │
│                    Rate limiting, Auth                            │
└──────┬─────────────────────────────────────────────────────────────┘
       │
┌──────▼────────────────────────────────────────────────────────────┐
│                   Orchestrator Service                             │
│  - Query parsing and rewriting                                    │
│  - Cache lookups (Redis)                                           │
│  - Parallel search dispatch                                        │
│  - Response synthesis                                             │
└──────┬─────────────────────────────────────────────────────────────┘
       │
       ├───┬─────────────────────┬──────────────────┐
       │   │                     │                  │
┌──────▼─┐│          ┌──────────▼────────┐  ┌───────▼─────┐
│ Redis  ││          │    Qdrant Cluster │  │ LLM Service │
│ Cache  ││          │  (3 nodes, HNSW) │  │  (GPT-4)    │
│        ││          └───────────────────┘  └─────────────┘
└────────┘│
       └──────────────────┬───────────────────┐
                          │                   │
              ┌───────────▼─────────┐ ┌──────▼──────┐
              │   Embedding Service │ │  Monitoring  │
              │  (Text-Embedding-3) │ │ (Prometheus) │
              └─────────────────────┘ └─────────────┘

The pipeline has four key components:

API Gateway

Kong handles rate limiting (100 req/min per user), authentication, and request routing. It also injects tracing headers for observability.

Orchestrator Service

A Node.js service built on Express that:

Parses user queries and extracts key entities
Checks Redis cache before hitting vector search (60% cache hit rate)
Dispatches parallel queries to Qdrant (hybrid search) and the LLM for generation
Synthesizes the final response with citations

Qdrant Cluster

Three-node deployment on AWS EKS (m6i.2xlarge instances, 8 vCPU, 32GB RAM each) with:

HNSW index with m=16, ef_construction=128 for fast approximate search
Hybrid search combining dense vectors (OpenAI text-embedding-3-large) and sparse BM25 keywords
Payload filtering by document type, jurisdiction, and date range
Snapshot-based backups to S3 every 6 hours

Caching Layer

Redis Cluster with 3 nodes stores:

Query-response pairs (24h TTL)
Frequent query patterns for pre-warming
Failed queries for rate-limiting abuse detection

Monitoring Stack

Prometheus + Grafana tracks:

Query latency (p50, p95, p99)
Cache hit/miss ratios
Qdrant search latency and memory usage
LLM token counts and costs
Error rates per component

Why We Chose Qdrant (Over Pinecone and Milvus)

We evaluated three vector databases seriously. Here’s how they compared:

Pinecone

Managed service with excellent developer experience. But at our client’s scale, the economics didn’t work:

$0.10 per million read operations vs $0.01 with self-hosted Qdrant
No data residency guarantees in the client’s required regions
Vendor lock-in would complicate future migration

Pinecone works great for prototypes and smaller workloads, but once you cross ~20K queries/day, the managed premium gets hard to justify.

Milvus

Powerful open-source vector database with strong feature set. But operational complexity was a dealbreaker:

Requires multiple components (Milvus, etcd, MinIO, Pulsar) to coordinate
Backup/recovery is fragile snapshots need careful orchestration
Documentation is scattered across multiple repos and versions

For a team without a dedicated database engineer, Milvus became a time sink.

Qdrant

Built in Rust with a single binary, simple configuration, and excellent hybrid search. The decision drivers were:

Single binary deployment: Just run qdrant and it works. No dependency hell.
Hybrid search out of the box: Combines dense vectors and BM25 in one query critical for legal documents where exact terms matter (e.g., “GDPR Article 17”).
Cost at scale: 60% cheaper than Pinecone at 50K queries/day
Compliance: Self-hosted on the client’s VPC with full data control

The trade-off: we’re responsible for ops. But Qdrant’s stability and simple ops model meant that burden was manageable.

The Ingestion Pipeline

Keeping 10M embeddings up to date required its own pipeline. Documents arrive from three sources: CMS (policy updates), OCR (scanned contracts), and API feeds (regulatory updates).

const { QdrantClient } = require('@qdrant/js-client-rest')
const { OpenAI } = require('openai')
const Redis = require('ioredis')

const qdrant = new QdrantClient({ url: process.env.QDRANT_URL })
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
const redis = new Redis(process.env.REDIS_URL)

async function processBatch(documents) {
  const embeddings = await Promise.all(
    documents.map(doc =>
      openai.embeddings.create({
        model: 'text-embedding-3-large',
        input: doc.content
      })
    )
  )

  const points = documents.map((doc, i) => ({
    id: doc.id,
    vector: embeddings[i].data[0].embedding,
    payload: {
      title: doc.title,
      content: doc.content,
      doc_type: doc.type,
      jurisdiction: doc.jurisdiction,
      effective_date: doc.effective_date,
      text: doc.content // For sparse search
    }
  }))

  await qdrant.upsert('legal_docs', { wait: true, batch_size: 100, points })

  // Invalidate cache for affected queries
  await redis.flushdb()
}

// Run every hour
setInterval(
  () => {
    const newDocs = fetchDocumentsSince(lastRun)
    processBatch(newDocs)
  },
  60 * 60 * 1000
)

The ingestion pipeline runs as a Kubernetes CronJob:

Pulls new documents from the CMS
Generates embeddings in batches of 100
Upserts to Qdrant with wait=true for consistency
Flushes Redis cache to serve fresh results

Performance and Cost Results

After six months in production, here are the actual metrics. This demonstrates how proper AI model optimization techniques can dramatically improve system performance while reducing costs.

Metric	Before	After	Improvement
p95 Latency	1,200ms	180ms	6.7× faster
p99 Latency	2,400ms	420ms	5.7× faster
Infrastructure Cost	$12K/month	$4K/month	67% reduction
Uptime (6 months)	97.5%	99.9%	2.9× more reliable
Max Throughput	25K queries/day	200K queries/day	8× headroom
Cache Hit Rate	N/A	60%	60% of queries served from cache

Cost breakdown per month:

Qdrant cluster (3× m6i.2xlarge): $1,080
Redis cluster (3× t3.medium): $180
Orchestrator service (4× t3.large): $360
LLM API (GPT-4): $1,800
Monitoring & observability: $180
Load balancers & ingress: $200
Total: ~$3,800/month

What Didn’t Work (And Why)

Attempt 1: Managed Service + Single Orchestrator

What we tried: Pinecone with a single Node.js orchestrator instance. Why it failed: Single point of failure, no horizontal scaling, vendor lock-in. Lesson: Managed services scale horizontally, but your orchestration layer must too.

Attempt 2: Milvus Self-Hosted

What we tried: 3-node Milvus cluster on bare metal. Why it failed: Ops complexity exceeded team capacity. Backup failures and configuration drift caused outages. Lesson: If you don’t have a dedicated DBA, avoid systems that require coordinating multiple components.

Attempt 3: Cache-First Architecture

What we tried: Aggressive caching with 24h TTL, pre-warming frequent queries. Why it failed: Cache invalidation became a nightmare. Legal documents update frequently, and stale answers violated compliance requirements. Lesson: Cache is great for latency, but TTL must match your document update cadence. We settled on 2h TTL.

Known Limitations and Open Questions

This architecture works well for the client’s current scale (50K queries/day), but we see three potential constraints:

LLM cost scaling: At 200K queries/day, GPT-4 API costs would hit $7.2K/month. We’re evaluating smaller models (Llama 3.1 70B) via vLLM for cost reduction.
Cold starts: The first query after a deployment can take 3-4 seconds while Qdrant warms up. We’re experimenting with connection pooling and request batching.
Multi-tenant isolation: The client wants to expose this system to external partners. We need tenant-aware isolation in Qdrant (either separate collections or payload filtering with RBAC).

Where This Approach Breaks Down

This architecture is not a universal solution. Avoid it if:

Your corpus is <100K documents: Managed services (Pinecone, Weaviate Cloud) will be cheaper and simpler.
You lack ops capacity: Self-hosted Qdrant requires someone to manage Kubernetes, backups, and monitoring.
Your latency SLA is <50ms: Vector search at scale rarely gets below 50ms. Consider precomputed answers or traditional search (Elasticsearch) instead.
Your queries are highly complex: Multi-hop reasoning or deep synthesis will require more sophisticated agents than this pipeline supports.

How to Replicate This

If you’re scaling RAG beyond 20K queries/day, here’s your starting point:

Deploy Qdrant on Kubernetes: Use the official Helm chart. Start with 2 nodes and scale to 3 for high availability.
Enable hybrid search: Configure both dense and sparse indexes. Legal and financial documents need exact term matching.
Add a cache layer: Redis works. Cache query-response pairs, not just search results.
Monitor everything: Latency per component, cache hit rates, Qdrant memory usage, LLM costs.
Plan for ingestion: Don’t make ingestion an afterthought. Automate document processing and embedding generation.

If you’re dealing with compliance requirements, multi-jurisdictional data, or scale beyond 100K queries/day, we’ve shipped this architecture. Talk to us we can help you avoid the three rebuilds this client went through.

Why This Matters for Your Business

Scaling a RAG pipeline is not just a technical challenge it’s a business enabler. Whether you’re building a legal document search system, a customer support knowledge base, or an internal research assistant, the difference between a RAG system that works in prototype versus production determines whether AI delivers actual value to your organization.

When should you invest in production-grade RAG?

When query volume exceeds 20K+ daily queries
When sub-200ms latency is required for user-facing applications
When compliance or data residency requirements prevent using managed services
When accuracy directly impacts business decisions (legal, medical, financial)

What teams need to succeed: You’ll need engineers familiar with large language models, vector databases, and distributed systems. If your team lacks this expertise, partnering with specialists can accelerate your timeline by 3-6 months.

Need help scaling your RAG implementation? Our team has deployed production RAG systems for fintech, legal, and healthcare clients. We can help you evaluate vector databases (Qdrant, Pinecone, Weaviate) for your use case, design hybrid search architectures combining dense and sparse retrieval, implement evaluation frameworks to measure retrieval quality, and optimize latency while reducing infrastructure costs.

Contact our AI systems consulting team to discuss your RAG implementation.

This article originally appeared on lightrains.com

To make a comment, please send an e-mail using the button below. Your e-mail address won't be shared and will be deleted from our records after the comment is published. If you don't want your real name to be credited alongside your comment, please specify the name you would like to use. If you would like your name to link to a specific URL, please share that as well. Thank you.

Comment via email