Click a column header to sort (first click sorts descending; click again to toggle). Hover headers for quick tooltips.
| Rank_toggle | Model | Modality | Retrieval | Overall | Read. | Insh. | Stru. | Vef. | Con. | Cov. | Fid. | Sem. | Acc. | VQA |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Gemini Deep Research (Gemini 3 Pro) | Deep Research | Agent | 49.41 | 84.53 | 89.56 | 70.86 | 35.71 | 56.17 | 52.84 | 31.29 | 41.29 | 87.54 | 28.45 |
| 2 | Gemini 3 Pro | Multimodal | Web Search | 44.68 | 58.05 | 75.39 | 49.85 | 46.43 | 37.98 | 41.85 | 6.46 | 40.69 | 80.44 | 23.15 |
| 3 | Gemini 3 Flash | Multimodal | Web Search | 44.43 | 81.22 | 90.22 | 52.00 | 45.71 | 31.95 | 35.07 | 15.42 | 36.61 | 87.31 | 18.99 |
| 4 | DeepSeek-V3.2 | Single-Modal | Offline | 43.71 | 75.37 | 87.82 | 58.16 | 19.28 | 33.34 | 45.48 | 18.77 | 42.19 | 83.85 | 12.88 |
| 5 | GPT-5 mini | Multimodal | Offline | 38.49 | 70.06 | 81.73 | 47.18 | 39.29 | 20.02 | 26.64 | 32.61 | 33.90 | 94.23 | 15.60 |
| 6 | Gemini 2.5 Flash | Multimodal | Web Search | 38.40 | 56.22 | 68.58 | 55.44 | 32.86 | 25.35 | 27.77 | 38.30 | 40.67 | 75.96 | 25.49 |
| 7 | Gemini 2.5 Pro | Multimodal | Web Search | 38.04 | 80.04 | 85.94 | 51.44 | 38.57 | 30.18 | 28.77 | 14.98 | 19.47 | 92.86 | 12.50 |
| 8 | Perplexity Sonar Deep Research | Deep Research | Agent | 37.55 | 62.29 | 64.35 | 47.80 | 27.86 | 33.12 | 41.51 | 16.68 | 50.79 | 87.75 | 21.22 |
| 9 | GPT-4.1 | Multimodal | Offline | 36.95 | 79.34 | 89.04 | 53.00 | 39.29 | 15.90 | 10.06 | 5.61 | 29.66 | 80.56 | 19.92 |
| 10 | Kimi K2 (Thinking) | Single-Modal | Offline | 36.91 | 71.34 | 77.27 | 47.34 | 17.14 | 23.54 | 24.62 | 27.20 | 42.00 | 90.00 | 9.50 |
| 11 | Grok-4 (Fast Reasoning) | Multimodal | Offline | 36.10 | 60.62 | 80.49 | 52.99 | 36.43 | 17.30 | 14.62 | 6.12 | 28.46 | 87.45 | 19.34 |
| 12 | Qwen3 235B (A22B) | Single-Modal | Offline | 36.04 | 77.56 | 85.74 | 54.05 | 17.14 | 35.60 | 45.73 | 22.98 | 20.43 | 53.09 | 4.95 |
| 13 | Qwen 3 VL 235B (A22B) | Multimodal | Offline | 35.08 | 77.01 | 86.48 | 52.21 | 43.57 | 18.34 | 15.25 | 10.68 | 30.58 | 93.52 | 16.98 |
| 14 | GPT-4.1 mini | Multimodal | Offline | 34.23 | 71.25 | 83.62 | 49.60 | 12.86 | 24.20 | 25.44 | 12.33 | 32.62 | 89.91 | 13.21 |
| 15 | Claude 4.5 Opus | Multimodal | Web Search | 33.84 | 77.81 | 83.86 | 50.70 | 35.00 | 30.64 | 41.14 | 21.97 | 21.30 | 77.21 | 14.75 |
| 16 | Claude 4.5 Haiku | Multimodal | Web Search | 33.67 | 74.60 | 81.80 | 53.22 | 28.57 | 17.90 | 14.10 | 18.56 | 25.98 | 76.90 | 11.70 |
| 17 | Claude 4.5 Sonnet | Multimodal | Web Search | 33.61 | 77.63 | 82.31 | 51.65 | 32.14 | 14.36 | 15.09 | 16.11 | 20.73 | 70.13 | 14.41 |
| 18 | GPT-5.2 | Multimodal | Offline | 32.76 | 69.75 | 83.92 | 54.31 | 46.43 | 14.00 | 1.43 | 5.30 | 12.83 | 50.00 | 9.16 |
| 19 | GPT-5.1 | Multimodal | Offline | 32.69 | 79.34 | 89.04 | 53.00 | 35.71 | 15.90 | 2.30 | 13.67 | 22.03 | 84.29 | 14.32 |
| 20 | OpenAI o3-mini | Single-Modal | Offline | 31.96 | 53.75 | 52.65 | 37.11 | 13.57 | 28.45 | 33.74 | 48.35 | 15.47 | 90.00 | 12.60 |
| 21 | Grok-3 | Multimodal | Offline | 29.89 | 75.17 | 86.13 | 52.24 | 20.00 | 12.57 | 5.79 | 2.80 | 22.18 | 68.39 | 13.89 |
| 22 | ChatGPT Deep Research (o3-mini) | Deep Research | Agent | 29.50 | 52.40 | 63.61 | 37.30 | 29.29 | 10.19 | 4.16 | 11.07 | 27.32 | 73.44 | 21.75 |
| 23 | Tongyi Deep Research (30B-A3B) | Deep Research | Agent | 29.02 | 54.27 | 62.67 | 40.07 | 12.86 | 25.99 | 30.87 | 24.25 | 20.39 | 93.33 | 20.39 |
| 24 | GPT-4o | Multimodal | Offline | 28.62 | 52.52 | 68.41 | 40.90 | 10.04 | 10.94 | 4.61 | 11.89 | 24.10 | 71.43 | 18.72 |
| 25 | GPT-4.1 nano | Multimodal | Offline | 28.07 | 49.77 | 64.82 | 37.28 | 10.79 | 18.99 | 19.86 | 24.42 | 27.02 | 76.30 | 13.04 |
All scores are on a 0–100 scale (higher is better) and are averaged across tasks. The table is sortable: click any header to sort (first click sorts descending).
Overall: one-number summary (0–100; higher is better). It combines three groups of metrics: FLAE (report quality), TRACE (citation grounding), and MOSAIC (visual grounding) using 0.2·FLAE + 0.5·TRACE + 0.3·MOSAIC.
FLAE evaluates writing quality using three dimensions:
Why it matters: This captures how readable and complete the report is—but good writing alone doesn’t guarantee correct grounding.
TRACE checks whether cited sources truly support the claims and whether the report stays faithful to the task (including visual requirements):
Why it matters: High TRACE usually means claims are verifiable and citations are disciplined (fewer “looks supported but isn’t”).
MOSAIC evaluates whether statements that reference images (charts, tables, diagrams, photos) match the visuals:
Why it matters: A report can sound plausible yet be wrong if it misreads a chart/table or fails multi-step visual reasoning.
Each row also includes:
Reference: MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents (arXiv:2601.12346).