MMDeepResearch Model Rankings — Sortable Table

Click a column header to sort (first click sorts descending; click again to toggle). Hover headers for quick tooltips.

Filter

Rank_toggle	Model	Modality	Retrieval	Overall	Read.	Insh.	Stru.	Vef.	Con.	Cov.	Fid.	Sem.	Acc.	VQA
1	Gemini Deep Research (Gemini 3 Pro)	Deep Research	Agent	49.41	84.53	89.56	70.86	35.71	56.17	52.84	31.29	41.29	87.54	28.45
2	Gemini 3 Pro	Multimodal	Web Search	44.68	58.05	75.39	49.85	46.43	37.98	41.85	6.46	40.69	80.44	23.15
3	Gemini 3 Flash	Multimodal	Web Search	44.43	81.22	90.22	52.00	45.71	31.95	35.07	15.42	36.61	87.31	18.99
4	DeepSeek-V3.2	Single-Modal	Offline	43.71	75.37	87.82	58.16	19.28	33.34	45.48	18.77	42.19	83.85	12.88
5	GPT-5 mini	Multimodal	Offline	38.49	70.06	81.73	47.18	39.29	20.02	26.64	32.61	33.90	94.23	15.60
6	Gemini 2.5 Flash	Multimodal	Web Search	38.40	56.22	68.58	55.44	32.86	25.35	27.77	38.30	40.67	75.96	25.49
7	Gemini 2.5 Pro	Multimodal	Web Search	38.04	80.04	85.94	51.44	38.57	30.18	28.77	14.98	19.47	92.86	12.50
8	Perplexity Sonar Deep Research	Deep Research	Agent	37.55	62.29	64.35	47.80	27.86	33.12	41.51	16.68	50.79	87.75	21.22
9	GPT-4.1	Multimodal	Offline	36.95	79.34	89.04	53.00	39.29	15.90	10.06	5.61	29.66	80.56	19.92
10	Kimi K2 (Thinking)	Single-Modal	Offline	36.91	71.34	77.27	47.34	17.14	23.54	24.62	27.20	42.00	90.00	9.50
11	Grok-4 (Fast Reasoning)	Multimodal	Offline	36.10	60.62	80.49	52.99	36.43	17.30	14.62	6.12	28.46	87.45	19.34
12	Qwen3 235B (A22B)	Single-Modal	Offline	36.04	77.56	85.74	54.05	17.14	35.60	45.73	22.98	20.43	53.09	4.95
13	Qwen 3 VL 235B (A22B)	Multimodal	Offline	35.08	77.01	86.48	52.21	43.57	18.34	15.25	10.68	30.58	93.52	16.98
14	GPT-4.1 mini	Multimodal	Offline	34.23	71.25	83.62	49.60	12.86	24.20	25.44	12.33	32.62	89.91	13.21
15	Claude 4.5 Opus	Multimodal	Web Search	33.84	77.81	83.86	50.70	35.00	30.64	41.14	21.97	21.30	77.21	14.75
16	Claude 4.5 Haiku	Multimodal	Web Search	33.67	74.60	81.80	53.22	28.57	17.90	14.10	18.56	25.98	76.90	11.70
17	Claude 4.5 Sonnet	Multimodal	Web Search	33.61	77.63	82.31	51.65	32.14	14.36	15.09	16.11	20.73	70.13	14.41
18	GPT-5.2	Multimodal	Offline	32.76	69.75	83.92	54.31	46.43	14.00	1.43	5.30	12.83	50.00	9.16
19	GPT-5.1	Multimodal	Offline	32.69	79.34	89.04	53.00	35.71	15.90	2.30	13.67	22.03	84.29	14.32
20	OpenAI o3-mini	Single-Modal	Offline	31.96	53.75	52.65	37.11	13.57	28.45	33.74	48.35	15.47	90.00	12.60
21	Grok-3	Multimodal	Offline	29.89	75.17	86.13	52.24	20.00	12.57	5.79	2.80	22.18	68.39	13.89
22	ChatGPT Deep Research (o3-mini)	Deep Research	Agent	29.50	52.40	63.61	37.30	29.29	10.19	4.16	11.07	27.32	73.44	21.75
23	Tongyi Deep Research (30B-A3B)	Deep Research	Agent	29.02	54.27	62.67	40.07	12.86	25.99	30.87	24.25	20.39	93.33	20.39
24	GPT-4o	Multimodal	Offline	28.62	52.52	68.41	40.90	10.04	10.94	4.61	11.89	24.10	71.43	18.72
25	GPT-4.1 nano	Multimodal	Offline	28.07	49.77	64.82	37.28	10.79	18.99	19.86	24.42	27.02	76.30	13.04

What do these columns mean? (click to collapse)

All scores are on a 0–100 scale (higher is better) and are averaged across tasks. The table is sortable: click any header to sort (first click sorts descending).

Overall score

Overall: one-number summary (0–100; higher is better). It combines three groups of metrics: FLAE (report quality), TRACE (citation grounding), and MOSAIC (visual grounding) using 0.2·FLAE + 0.5·TRACE + 0.3·MOSAIC.

FLAE — report quality

FLAE evaluates writing quality using three dimensions:

READ (Readability): how clear, coherent, and easy to read the report is.
INSH (Insightfulness): whether the report provides meaningful analysis and helpful takeaways (not just restating facts).
STRU (Structural Completeness): whether the report is well-structured (sections, coverage of requested items, completeness).

Why it matters: This captures how readable and complete the report is—but good writing alone doesn’t guarantee correct grounding.

TRACE — citation grounding

TRACE checks whether cited sources truly support the claims and whether the report stays faithful to the task (including visual requirements):

VEF (Visual Evidence Fidelity): whether claims that depend on the images actually match what’s in the provided visuals. This is treated as a strict “prompt-faithfulness” check in TRACE (pass/fail thresholded).
CON (Consistency): whether claims align with their cited sources (not contradicted or mismatched).
COV (Coverage): how much of the report’s important content is properly supported with citations (few missing/uncited key claims).
FID (Textual Fidelity): how faithfully a claim matches the cited content—penalizing issues like over‑specific claims, unsupported details, or incorrect cause/effect direction.

Why it matters: High TRACE usually means claims are verifiable and citations are disciplined (fewer “looks supported but isn’t”).

MOSAIC — visual grounding

MOSAIC evaluates whether statements that reference images (charts, tables, diagrams, photos) match the visuals:

SEM (Visual‑Semantic Alignment): whether the text that references an image matches the image at a meaning/semantic level.
ACC (Visual Data Interpretation Accuracy): whether the model correctly reads precise visual details (numbers, labels, table cells, chart values).
VQA (Complex Visual QA / Reasoning): how well the model answers harder, multi‑step questions that require reasoning over visual evidence (e.g., comparisons, trends, derived quantities).

Why it matters: A report can sound plausible yet be wrong if it misreads a chart/table or fails multi-step visual reasoning.

Row tags

Each row also includes:

Modality:
- Single‑Modal = text-only.
- Multimodal = text + images.
- Deep Research = an agent-style system built for multi-step research/reporting.
Retrieval:
- Offline = no web access.
- Web Search = can browse/search the web.
- Agent = tool-using research agent workflow.

Quick interpretation tips

High FLAE + low TRACE: well-written, but citations may not truly support claims.
High Coverage (Cov.) + low Fidelity (Fid.): lots of citations, but they may be mismatched or over-stretched.
High Sem. + low Acc.: matches the “gist” of an image but gets exact numbers/details wrong.
Low Vef.: likely misinterprets the provided visual evidence or violates key visual requirements.

Reference: MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents (arXiv:2601.12346).