From Fragmented Health Records to Unified AI Queries

Share From Fragmented Health Records to Unified AI Queries

A mid-sized healthcare provider specializing in genomic testing faced a crisis that showed up in every metric that mattered: patient turnaround times were climbing, clinical staff were spending 45 minutes per report on manual data entry, and the promise of AI-assisted healthcare remained out of reach. The root cause was simple. Their patients’ most valuable health data existed in a state that made it nearly useless.

Thousands of genetic reports, bloodwork panels, and medical documents were scattered across PDF files in formats that varied by originating lab. Clinicians spent over 30 minutes per patient just finding relevant information. There was no way to ask natural questions like “What cardiovascular risks does this patient have?” and get an answer. The data existed, but it was locked in unstructured documents with no consistent schema.

This is the story of how they solved it.

The Operational Cost

The genomic testing lab processed thousands of reports monthly, each containing trait interpretations, risk assessments, and personalized recommendations. The clinical team needed access to this information in a usable form. They wanted to ask their systems questions and get answers, not browse through files.

The Manual Processing Bottleneck

The operations team faced a different reality. Every incoming report required manual extraction, categorization, and data entry. Staff classified risk levels, identified trait categories, and input data into separate systems by hand. This was not scalable, and it was not sustainable.

The bottlenecks manifested in several ways:

Time-intensive data entry: Clinical staff spent 45 minutes per report on manual transcription, translating PDF contents into structured database fields
Error-prone categorization: Manual classification of traits across 20 health domains introduced inconsistencies
Version control chaos: Multiple file versions existed across different systems with no single source of truth
Retrieval inefficiency: Searching for a patient’s historical data required manually scanning through dozens of PDFs

The Business Impact

The business impact was direct: patient turnaround slowed, clinical staff grew frustrated, and any possibility of leveraging AI for automated insights was out of the question. The organization was essentially sitting on a goldmine of health data but couldn’t access it in any meaningful way.

More critically, the inability to query patient data programmatically meant they couldn’t leverage modern AI capabilities like semantic search or natural language interfaces. Their data was locked in a format that humans could read but machines could not process.

The Solution Architecture

We built an event-driven pipeline designed around three principles: asynchronous processing for scale, hybrid storage combining structured records with semantic search, and strict data isolation for compliance.

Document Processing Pipeline

The processing flow moves through six stages. Uploaded PDFs are parsed and, if scanned, run through optical recognition. Extracted text is classified by type: genomic report, bloodwork panel, or general medical record. This classification determines downstream processing.

Each stage in the pipeline was designed with specific goals:

Ingestion: PDFs arrive via secure upload, triggering processing workflow
Preprocessing: OCR applied to scanned documents, format normalization
Classification: ML-based document type detection routes content appropriately
Extraction: Domain-specific parsing extracts relevant fields
Normalization: Data standardized to consistent schema
Indexing: Documents embedded and indexed for retrieval

Technical Pipeline Architecture

The document processing pipeline follows a state machine pattern with the following transitions:

uploaded → parsing → classified → normalized → embedding → indexed → ready
                                    ↓ (on failure)
                               needs_review → failed

Each stage is implemented as an independent worker in a BullMQ queue backed by Redis:

// Job queue configuration
const JOB_QUEUES = [
  'extract-pdf', // Parse PDF and extract raw text
  'classify-document', // Determine document type
  'normalize-data', // Standardize to schema
  'chunk-content', // Split into semantic segments
  'embed-vectors', // Generate embeddings (BGE-base-en-v1.5)
  'index-qdrant', // Store in vector database
  'audit-log' // Compliance audit trail
]

The processing latency follows this distribution:

P50: 45 seconds (simple bloodwork panel)
P95: 58 seconds (15-page genomic report)
P99: 72 seconds (complex multi-section reports)

Embedding and Retrieval Formulas

The semantic search uses cosine similarity for retrieval. Given a query embedding q and document chunk embeddings d₁, d₂, …, dₙ, the retrieval score is calculated as:

Cosine Similarity:

similarity(q, d) = (q · d) / (||q|| × ||d||)

Where the dot product is computed across the 768-dimensional embedding space:

similarity = Σ(qj × dj) / √(Σqj²) × √(Σdj²)

Final Score (Hybrid Retrieval):

final_score = α × vector_score + (1 - α) × structured_score

Where:

α = 0.7 for natural language queries
α = 0.3 for structured lookups

Chunking Formula:

chunks = ceil((document_length - overlap) / (chunk_size - overlap))

Where chunk_size = 512 tokens and overlap = 50 tokens.

Genomic Report Processing

For genomic reports, the system extracts patient demographics and parses each page into individual traits. Each trait gets categorized across 20 health domains from nutrition to cardiovascular function to metabolic health. Risk levels are assigned as high, intermediate, normal, or low.

The extraction engine handles complex multi-page reports by:

Identifying report sections through header/footer patterns
Parsing tabular data within documents
Extracting variant information with associated confidence scores
Normalizing gene nomenclature to standard databases
Building trait relationships across multiple pages

Bloodwork Processing

Bloodwork follows a parallel path, recognizing laboratory markers and normalizing them to standard formats. Values are extracted with units, and results outside reference ranges are flagged automatically.

Key capabilities include:

Contextual understanding of reference ranges
Unit conversion between different lab standards
Historical comparison flagging for trend analysis
Abnormal value prioritization for clinical review

Semantic Indexing and Storage

Once normalized, documents are chunked into coherent segments, embedded for semantic search, and indexed with metadata for secure retrieval. The entire pipeline runs asynchronously, returning immediately while processing continues in the background.

The hybrid storage architecture combines:

Structured database (MongoDB): Patient records, extracted traits, laboratory values
Vector database (Qdrant): Semantic embeddings for natural language retrieval
Document store: Original PDFs with access controls

Data Schema and Query Patterns

The MongoDB schema uses Mongoose models with selective field projection for performance:

// Document model - stores extracted data
{
  patientId: ObjectId,
  documentType: 'genomic_report' | 'bloodwork' | 'medical_record',
  extractedData: {
    traits: [{
      name: String,
      category: String, // 20 health domains
      value: Mixed,
      riskLevel: 'high' | 'intermediate' | 'normal' | 'low',
      confidence: Number // 0-1
    }],
    markers: [{
      name: String,
      value: Number,
      unit: String,
      referenceRange: { min: Number, max: Number },
      isAbnormal: Boolean
    }]
  },
  metadata: {
    originalFilename: String,
    pageCount: Number,
    processingDuration: Number,
    embeddingTokens: Number
  },
  status: 'parsing' | 'classified' | 'normalized' | 'ready'
}

// Query always filters by userId at database level
const documents = await Document.find({ userId: currentUser.id })
  .select('patientId documentType extractedData.status');

This approach ensures row-level security: every query is automatically filtered by patient ID ownership, and the semantic search layer enforces that vector results never cross user boundaries.

AI Integration Layer

Enabling AI systems to query patient data required a carefully designed interface. We implemented data access functions covering the core patterns: genomic summaries with trait categorization, specific trait lookup, blood panel retrieval with abnormal results highlighted, and historical tracking across multiple tests.

Query Interface Design

For natural language queries, the system accepts conversational requests and returns relevant document segments ranked by relevance. A hybrid approach combines structured database queries with semantic search to handle both precise lookups and open-ended questions.

Example query patterns supported:

“Show me cardiovascular risks for patient X”
“What are the trend analysis for cholesterol over last 3 tests?”
“List all abnormal findings in the latest blood panel”
“Compare vitamin D levels across all tests”

Security and Compliance

Every query enforces patient identification, and all access is filtered at the database level for strict isolation. Complete audit trails capture response times and result counts for compliance.

The security model includes:

Row-level security: Queries filtered by patient ID ownership
Role-based access: Different access levels for clinicians, admins, patients
Audit logging: Complete request/response logging for compliance
Encryption at rest: All stored data encrypted
API rate limiting: Prevents abuse and ensures fair resource allocation

Results After Six Months

Production metrics validated the approach. A typical 15-page genomic report processes from upload to searchable in under a minute. Search queries respond at the 95th percentile in under 200 milliseconds.

Performance Metrics

Metric	Before	After	Improvement
Report processing time	45 min manual	< 1 min automated	98%+ faster
Patient data retrieval	30+ min search	< 5 sec query	99% faster
Trait extraction accuracy	Manual prone to errors	90%+ automated	Significantly improved
Abnormal flagging	Manual review	98% automated	Consistent accuracy
Monthly report capacity	500 reports	15,000+ reports	30x scale

Normalization extracts 50 to 150 traits per genomic report with categorization accuracy above 90%. Bloodwork processing handles 40 to 60 markers per panel with abnormal flagging accuracy of 98%.

Clinical Team Impact

The clinical team retrieves patient information in seconds instead of minutes. AI-assisted queries during consultations surface relevant history without manual searching. This has transformed their workflow:

Faster consultations: Clinicians spend less time searching, more time advising
Better-informed decisions: Complete historical context available instantly
Reduced errors: Automated extraction eliminates transcription mistakes
Scalable operations: Same team handles 10x volume without additional staff

Limitations and Boundaries

Large documents exceeding 200 pages require architectural changes scheduled for the next development cycle. Multilingual accuracy degrades compared to English-language reports.

Current Constraints

Semantic search alone cannot handle questions requiring mathematical comparison across historical data points. The hybrid approach addresses this but routing logic between modes continues to evolve.

Known limitations include:

Document size limits: 200-page max current threshold
Language support: Best results for English; other languages vary
Complex calculations: Historical trend comparison requires hybrid query routing
Image-only documents: Requires high-quality OCR; degraded results possible

Roadmap for Enhancement

The next development cycle addresses these limitations:

Distributed processing for larger documents
Multi-language model fine-tuning
Enhanced calculation engine for trend analysis
Improved image preprocessing for scanned documents

Strategic Position

The provider now has the foundation to integrate larger language models for conversational health insights. The data layer serves as the substrate for any AI initiative, and new document types are supported through additional processing configurations.

Lessons Learned

The fundamental lesson: healthcare AI success depends entirely on data infrastructure. The most capable clinical AI model provides no value without access to organized, searchable patient records. Building the document pipeline first created the foundation for every capability that followed.

Key takeaways for organizations considering similar initiatives:

Start with data infrastructure: Don’t invest in AI models before your data is accessible
Design for compliance first: Healthcare data requires strict access controls from day one
Build hybrid systems: Combine structured databases with semantic search for flexibility
Plan for scale: Design pipelines that can handle 10x current volume
Measure everything: Track processing times, accuracy, and user satisfaction

Ready to Modernize Your Healthcare Data Infrastructure?

Lightrains specializes in building AI-powered document processing pipelines for healthcare organizations. Our expertise spans AI & Machine Learning development, data engineering, and enterprise AI solutions that transform unstructured documents into actionable insights.

Our healthcare AI services include:

Document processing and OCR pipelines
Medical record digitization and indexing
RAG-based semantic search systems
HIPAA-compliant AI infrastructure

This article originally appeared on lightrains.com

To make a comment, please send an e-mail using the button below. Your e-mail address won't be shared and will be deleted from our records after the comment is published. If you don't want your real name to be credited alongside your comment, please specify the name you would like to use. If you would like your name to link to a specific URL, please share that as well. Thank you.

Comment via email