MMDeepResearch Model Rankings — Sortable Table

Click a column header to sort (first click sorts descending; click again to toggle). Hover headers for quick tooltips.

简体中文
Rank_toggleModelModalityRetrievalOverallRead.Insh.Stru.Vef.Con.Cov.Fid.Sem.Acc.VQA
1Gemini Deep Research (Gemini 3 Pro)Deep ResearchAgent49.4184.5389.5670.8635.7156.1752.8431.2941.2987.5428.45
2Gemini 3 ProMultimodalWeb Search44.6858.0575.3949.8546.4337.9841.856.4640.6980.4423.15
3Gemini 3 FlashMultimodalWeb Search44.4381.2290.2252.0045.7131.9535.0715.4236.6187.3118.99
4DeepSeek-V3.2Single-ModalOffline43.7175.3787.8258.1619.2833.3445.4818.7742.1983.8512.88
5GPT-5 miniMultimodalOffline38.4970.0681.7347.1839.2920.0226.6432.6133.9094.2315.60
6Gemini 2.5 FlashMultimodalWeb Search38.4056.2268.5855.4432.8625.3527.7738.3040.6775.9625.49
7Gemini 2.5 ProMultimodalWeb Search38.0480.0485.9451.4438.5730.1828.7714.9819.4792.8612.50
8Perplexity Sonar Deep ResearchDeep ResearchAgent37.5562.2964.3547.8027.8633.1241.5116.6850.7987.7521.22
9GPT-4.1MultimodalOffline36.9579.3489.0453.0039.2915.9010.065.6129.6680.5619.92
10Kimi K2 (Thinking)Single-ModalOffline36.9171.3477.2747.3417.1423.5424.6227.2042.0090.009.50
11Grok-4 (Fast Reasoning)MultimodalOffline36.1060.6280.4952.9936.4317.3014.626.1228.4687.4519.34
12Qwen3 235B (A22B)Single-ModalOffline36.0477.5685.7454.0517.1435.6045.7322.9820.4353.094.95
13Qwen 3 VL 235B (A22B)MultimodalOffline35.0877.0186.4852.2143.5718.3415.2510.6830.5893.5216.98
14GPT-4.1 miniMultimodalOffline34.2371.2583.6249.6012.8624.2025.4412.3332.6289.9113.21
15Claude 4.5 OpusMultimodalWeb Search33.8477.8183.8650.7035.0030.6441.1421.9721.3077.2114.75
16Claude 4.5 HaikuMultimodalWeb Search33.6774.6081.8053.2228.5717.9014.1018.5625.9876.9011.70
17Claude 4.5 SonnetMultimodalWeb Search33.6177.6382.3151.6532.1414.3615.0916.1120.7370.1314.41
18GPT-5.2MultimodalOffline32.7669.7583.9254.3146.4314.001.435.3012.8350.009.16
19GPT-5.1MultimodalOffline32.6979.3489.0453.0035.7115.902.3013.6722.0384.2914.32
20OpenAI o3-miniSingle-ModalOffline31.9653.7552.6537.1113.5728.4533.7448.3515.4790.0012.60
21Grok-3MultimodalOffline29.8975.1786.1352.2420.0012.575.792.8022.1868.3913.89
22ChatGPT Deep Research (o3-mini)Deep ResearchAgent29.5052.4063.6137.3029.2910.194.1611.0727.3273.4421.75
23Tongyi Deep Research (30B-A3B)Deep ResearchAgent29.0254.2762.6740.0712.8625.9930.8724.2520.3993.3320.39
24GPT-4oMultimodalOffline28.6252.5268.4140.9010.0410.944.6111.8924.1071.4318.72
25GPT-4.1 nanoMultimodalOffline28.0749.7764.8237.2810.7918.9919.8624.4227.0276.3013.04
What do these columns mean? (click to collapse)

All scores are on a 0–100 scale (higher is better) and are averaged across tasks. The table is sortable: click any header to sort (first click sorts descending).

Overall score

Overall: one-number summary (0–100; higher is better). It combines three groups of metrics: FLAE (report quality), TRACE (citation grounding), and MOSAIC (visual grounding) using 0.2·FLAE + 0.5·TRACE + 0.3·MOSAIC.

FLAE — report quality

FLAE evaluates writing quality using three dimensions:

  • READ (Readability): how clear, coherent, and easy to read the report is.
  • INSH (Insightfulness): whether the report provides meaningful analysis and helpful takeaways (not just restating facts).
  • STRU (Structural Completeness): whether the report is well-structured (sections, coverage of requested items, completeness).

Why it matters: This captures how readable and complete the report is—but good writing alone doesn’t guarantee correct grounding.

TRACE — citation grounding

TRACE checks whether cited sources truly support the claims and whether the report stays faithful to the task (including visual requirements):

  • VEF (Visual Evidence Fidelity): whether claims that depend on the images actually match what’s in the provided visuals. This is treated as a strict “prompt-faithfulness” check in TRACE (pass/fail thresholded).
  • CON (Consistency): whether claims align with their cited sources (not contradicted or mismatched).
  • COV (Coverage): how much of the report’s important content is properly supported with citations (few missing/uncited key claims).
  • FID (Textual Fidelity): how faithfully a claim matches the cited content—penalizing issues like over‑specific claims, unsupported details, or incorrect cause/effect direction.

Why it matters: High TRACE usually means claims are verifiable and citations are disciplined (fewer “looks supported but isn’t”).

MOSAIC — visual grounding

MOSAIC evaluates whether statements that reference images (charts, tables, diagrams, photos) match the visuals:

  • SEM (Visual‑Semantic Alignment): whether the text that references an image matches the image at a meaning/semantic level.
  • ACC (Visual Data Interpretation Accuracy): whether the model correctly reads precise visual details (numbers, labels, table cells, chart values).
  • VQA (Complex Visual QA / Reasoning): how well the model answers harder, multi‑step questions that require reasoning over visual evidence (e.g., comparisons, trends, derived quantities).

Why it matters: A report can sound plausible yet be wrong if it misreads a chart/table or fails multi-step visual reasoning.

Row tags

Each row also includes:

  • Modality:
    • Single‑Modal = text-only.
    • Multimodal = text + images.
    • Deep Research = an agent-style system built for multi-step research/reporting.
  • Retrieval:
    • Offline = no web access.
    • Web Search = can browse/search the web.
    • Agent = tool-using research agent workflow.

Quick interpretation tips

  • High FLAE + low TRACE: well-written, but citations may not truly support claims.
  • High Coverage (Cov.) + low Fidelity (Fid.): lots of citations, but they may be mismatched or over-stretched.
  • High Sem. + low Acc.: matches the “gist” of an image but gets exact numbers/details wrong.
  • Low Vef.: likely misinterprets the provided visual evidence or violates key visual requirements.

Reference: MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents (arXiv:2601.12346).