Back to Reports

Performance Analysis

Detailed performance metrics of Apolo across different medical imaging tasks and datasets

Overview of Performance Evaluation

Apolo's performance has been extensively evaluated on established medical imaging datasets spanning multiple specialties. The dual-stage architecture shows strong performance across various tasks while maintaining transparency and privacy protection.

This analysis presents the results of both Stage 1 (description generation) and Stage 2 (diagnostic inference) evaluations, along with detailed ablation studies on the impact of architectural choices on performance metrics.

Key Findings

  • Apolo achieves comparable or superior performance to state-of-the-art end-to-end models while providing explicit reasoning and privacy preservation
  • The decoupled architecture maintains high performance while enabling local deployment of the inference component
  • Quantization of Stage 2 to 4-bit precision has minimal impact on performance (< 0.008 AUC drop)

Stage 1: Description Quality

Apolo-Vision generates detailed, medically accurate descriptions of visual findings in medical images. The performance of the description generation component was evaluated using both automated metrics and expert clinical assessments.

Automated Text Quality Metrics

Dataset ROUGE-1 ROUGE-2 ROUGE-L BLEU-4
MIMIC-CXR 0.51 0.37 0.49 0.32
EyePACS 0.47 0.34 0.46 0.30
ROCO 0.53 0.39 0.52 0.35

Expert Clinical Assessment

Four board-certified specialists (two ophthalmologists and two radiologists) evaluated 250 descriptions per modality on a 5-point Likert scale for different quality aspects.

Quality Metric Ophthalmology (Avg. Score) Radiology (Avg. Score) Overall (Avg. Score)
Accuracy 4.7/5 4.5/5 4.6/5
Completeness 4.5/5 4.6/5 4.55/5
Clarity 4.8/5 4.7/5 4.75/5
Objectivity 4.9/5 4.8/5 4.85/5
Clinical Relevance 4.3/5 4.5/5 4.4/5

Inter-rater reliability was substantial with Cohen's Kappa = 0.84 (95% CI: 0.81-0.87).

Stage 2: Diagnostic Performance

The Apolo-Dx module (quantized to 4-bit precision) demonstrated strong diagnostic performance based solely on textual descriptions from Stage 1 across multiple diagnostic tasks.

Classification Performance

Task AUC-ROC F1-Score Accuracy Sensitivity Specificity
DR Detection (≥ Moderate) 0.94 (0.93-0.95) 0.89 0.91 0.87 0.93
AMD Detection 0.92 (0.90-0.93) 0.87 0.89 0.85 0.91
Pneumonia Detection 0.93 (0.92-0.94) 0.88 0.90 0.86 0.92
Pleural Effusion 0.94 (0.93-0.95) 0.89 0.92 0.88 0.93
Cardiomegaly 0.91 (0.90-0.92) 0.86 0.88 0.84 0.90

Reasoning Quality Assessment

The explicit reasoning process contained within <think> tags was evaluated by clinical experts for logical coherence, clinical relevance, and alignment with standard diagnostic approaches.

Reasoning Aspect Ophthalmology (Avg. Score) Radiology (Avg. Score) Overall (Avg. Score)
Logical Coherence 4.6/5 4.7/5 4.65/5
Clinical Relevance 4.5/5 4.6/5 4.55/5
Alignment with Standard Practice 4.3/5 4.4/5 4.35/5
Completeness of Analysis 4.4/5 4.5/5 4.45/5

Ablation Studies & Architectural Analysis

Impact of Direct Preference Optimization (DPO)

To assess the impact of DPO fine-tuning using expert preference data, we compared the base model with the DPO-enhanced model:

Model Version ROUGE-L (vs. Reference) Expert Rating (Avg.) Downstream AUC (DR Detection)
Without DPO 0.44 4.1/5 0.90
With DPO (Final) 0.49 4.6/5 0.94
Improvement +0.05 +0.5 +0.04

Impact of Quantization on Stage 2

To evaluate the effect of 4-bit quantization on diagnostic performance, we compared the full-precision model with the quantized version:

Model Version Size Inference Latency (A100 GPU) AUC-ROC (Avg. across tasks)
Full Precision (BF16) ~23 GB ~290 ms 0.934
Quantized (4-bit) ~6.5 GB ~145 ms 0.926
Impact ~72% reduction ~50% reduction -0.008

Clinical Validation

Beyond laboratory metrics, Apolo underwent a prospective pilot study in a clinical setting with 120 cases, demonstrating:

  • Mean time-to-report reduced by 18% when using Apolo as an assistant
  • No critical misses compared to senior radiologist ground truth
  • Positive feedback on explanation quality from clinical users (mean rating 4.6/5)

A larger multisite clinical validation study is currently underway, with preliminary results expected in Q3 2025.

Conclusion

Apolo demonstrates that a decoupled, privacy-preserving approach to medical image analysis can achieve high performance while providing multiple levels of explainability. The approach enables local deployment of the diagnostic component with minimal performance impact through quantization, making it practical for real-world clinical deployment.

Future work will focus on expanding the range of supported modalities, enhancing the fine-tuning process, and conducting larger-scale clinical validation studies.