Performance Analysis
Detailed performance metrics of Apolo across different medical imaging tasks and datasets
Overview of Performance Evaluation
Apolo's performance has been extensively evaluated on established medical imaging datasets spanning multiple specialties. The dual-stage architecture shows strong performance across various tasks while maintaining transparency and privacy protection.
This analysis presents the results of both Stage 1 (description generation) and Stage 2 (diagnostic inference) evaluations, along with detailed ablation studies on the impact of architectural choices on performance metrics.
Key Findings
- Apolo achieves comparable or superior performance to state-of-the-art end-to-end models while providing explicit reasoning and privacy preservation
- The decoupled architecture maintains high performance while enabling local deployment of the inference component
- Quantization of Stage 2 to 4-bit precision has minimal impact on performance (< 0.008 AUC drop)
Stage 1: Description Quality
Apolo-Vision generates detailed, medically accurate descriptions of visual findings in medical images. The performance of the description generation component was evaluated using both automated metrics and expert clinical assessments.
Automated Text Quality Metrics
| Dataset | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU-4 |
|---|---|---|---|---|
| MIMIC-CXR | 0.51 | 0.37 | 0.49 | 0.32 |
| EyePACS | 0.47 | 0.34 | 0.46 | 0.30 |
| ROCO | 0.53 | 0.39 | 0.52 | 0.35 |
Expert Clinical Assessment
Four board-certified specialists (two ophthalmologists and two radiologists) evaluated 250 descriptions per modality on a 5-point Likert scale for different quality aspects.
| Quality Metric | Ophthalmology (Avg. Score) | Radiology (Avg. Score) | Overall (Avg. Score) |
|---|---|---|---|
| Accuracy | 4.7/5 | 4.5/5 | 4.6/5 |
| Completeness | 4.5/5 | 4.6/5 | 4.55/5 |
| Clarity | 4.8/5 | 4.7/5 | 4.75/5 |
| Objectivity | 4.9/5 | 4.8/5 | 4.85/5 |
| Clinical Relevance | 4.3/5 | 4.5/5 | 4.4/5 |
Inter-rater reliability was substantial with Cohen's Kappa = 0.84 (95% CI: 0.81-0.87).
Stage 2: Diagnostic Performance
The Apolo-Dx module (quantized to 4-bit precision) demonstrated strong diagnostic performance based solely on textual descriptions from Stage 1 across multiple diagnostic tasks.
Classification Performance
| Task | AUC-ROC | F1-Score | Accuracy | Sensitivity | Specificity |
|---|---|---|---|---|---|
| DR Detection (≥ Moderate) | 0.94 (0.93-0.95) | 0.89 | 0.91 | 0.87 | 0.93 |
| AMD Detection | 0.92 (0.90-0.93) | 0.87 | 0.89 | 0.85 | 0.91 |
| Pneumonia Detection | 0.93 (0.92-0.94) | 0.88 | 0.90 | 0.86 | 0.92 |
| Pleural Effusion | 0.94 (0.93-0.95) | 0.89 | 0.92 | 0.88 | 0.93 |
| Cardiomegaly | 0.91 (0.90-0.92) | 0.86 | 0.88 | 0.84 | 0.90 |
Reasoning Quality Assessment
The explicit reasoning process contained within <think> tags was evaluated by clinical experts for logical coherence, clinical relevance, and alignment with standard diagnostic approaches.
| Reasoning Aspect | Ophthalmology (Avg. Score) | Radiology (Avg. Score) | Overall (Avg. Score) |
|---|---|---|---|
| Logical Coherence | 4.6/5 | 4.7/5 | 4.65/5 |
| Clinical Relevance | 4.5/5 | 4.6/5 | 4.55/5 |
| Alignment with Standard Practice | 4.3/5 | 4.4/5 | 4.35/5 |
| Completeness of Analysis | 4.4/5 | 4.5/5 | 4.45/5 |
Ablation Studies & Architectural Analysis
Impact of Direct Preference Optimization (DPO)
To assess the impact of DPO fine-tuning using expert preference data, we compared the base model with the DPO-enhanced model:
| Model Version | ROUGE-L (vs. Reference) | Expert Rating (Avg.) | Downstream AUC (DR Detection) |
|---|---|---|---|
| Without DPO | 0.44 | 4.1/5 | 0.90 |
| With DPO (Final) | 0.49 | 4.6/5 | 0.94 |
| Improvement | +0.05 | +0.5 | +0.04 |
Impact of Quantization on Stage 2
To evaluate the effect of 4-bit quantization on diagnostic performance, we compared the full-precision model with the quantized version:
| Model Version | Size | Inference Latency (A100 GPU) | AUC-ROC (Avg. across tasks) |
|---|---|---|---|
| Full Precision (BF16) | ~23 GB | ~290 ms | 0.934 |
| Quantized (4-bit) | ~6.5 GB | ~145 ms | 0.926 |
| Impact | ~72% reduction | ~50% reduction | -0.008 |
Clinical Validation
Beyond laboratory metrics, Apolo underwent a prospective pilot study in a clinical setting with 120 cases, demonstrating:
- Mean time-to-report reduced by 18% when using Apolo as an assistant
- No critical misses compared to senior radiologist ground truth
- Positive feedback on explanation quality from clinical users (mean rating 4.6/5)
A larger multisite clinical validation study is currently underway, with preliminary results expected in Q3 2025.
Conclusion
Apolo demonstrates that a decoupled, privacy-preserving approach to medical image analysis can achieve high performance while providing multiple levels of explainability. The approach enables local deployment of the diagnostic component with minimal performance impact through quantization, making it practical for real-world clinical deployment.
Future work will focus on expanding the range of supported modalities, enhancing the fine-tuning process, and conducting larger-scale clinical validation studies.