Provider Reasoning & Knowledge - HLE-Full Reasoning & Knowledge - HLE-Full (w/ tools) Reasoning & Knowledge - AIME 2025 Reasoning & Knowledge - HMMT 2025 (Feb) Reasoning & Knowledge - IMO-AnswerBench Reasoning & Knowledge - GPQA-Diamond Reasoning & Knowledge - MMLU-Pro Image & Video - MMMU-Pro Image & Video - CharXiv (RQ) Image & Video - MathVision Image & Video - MathVista (mini) Image & Video - ZeroBench Image & Video - ZeroBench (w/ tools) Image & Video - OCRBench Image & Video - OmniDocBench 1.5 Image & Video - InfoVQA (val) Image & Video - SimpleVQA Image & Video - WorldVQA Image & Video - VideoMMMU Image & Video - MMVU Image & Video - MotionBench Image & Video - VideoMME Image & Video - LongVideoBench Image & Video - LVBench Coding - SWE-Bench Verified Coding - SWE-Bench Pro Coding - SWE-Bench Multilingual Coding - Terminal Bench 2.0 Coding - PaperBench Coding - CyberGym Coding - SciCode Coding - OJBench (cpp) Coding - LiveCodeBench (v6) Long Context - Longbench v2 Long Context - AA-LCR Agentic Search - BrowseComp Agentic Search - BrowseComp (w/ctx manage) Agentic Search - BrowseComp (Agent Swarm) Agentic Search - WideSearch (item-f1) Agentic Search - WideSearch (item-f1 Agent Swarm) Agentic Search - DeepSearchQA Agentic Search - FinSearchCompT2&T3 Agentic Search - Seal-0
Kimi K2.5 (Thinking) 30.1 50.2 96.1 95.4 81.8 87.6 87.1 78.5 77.5 84.2 90.1 9 11 92.3 88.8 83.5 83.9 47.0 70.0 77.5 61.8 72.0 54.1 54.5 71.6 49.5 56.0 40.4 61.4 32.6 68.5 80.0 77.1 60.0 93.5 60.6 74.9 78.4 68.4 73.0 58.8 62.0 46.4
GPT-5.2 (xhigh) 34.5 45.5 100 99.4 86.3 92.4 86.7* 79.5* 82.1 83.0 82.8* 9* 7* 80.7* 85.7 79.8* 83.7* 41.0* 73.4* 82.0* 61.8* 74.4 55.6* 52.5* 68.8 45.5 50.0 29.5 60.0 24.9 55.0 71.0 71.7 63.4* 92.9* 65.8 57.8 67.5 57.4 52.3 38.9
Claude 4.5 Opus (Extended Thinking) 30.8 43.2 92.8 92.9* 78.5* 87.0 89.3* 74.0 67.2* 77.1* 80.2* 3* 9* 86.5* 87.7* 76.5* 78.7* 38.7* 63.0* 73.2* 56.3* 67.6 44.2* 41.8* 74.0 48.0 44.0 35.3 57.0 22.1 60.0 75.0 74.1 50.7* 86.9* 37.0 59.2 60.9 46.7 49.8 35.9
Gemini 3 Pro (High Thinking Level) 37.5 45.8 95.0 97.3* 83.1* 91.9 90.1 81.0 81.4 86.1* 89.8* 8* 12* 90.3* 88.5 82.4* 83.4* 41.0* 76.5* 82.7* 64.1* 77.2 55.2* 56.4* 74.5 54.0 58.0 39.5 69.0 34.2 69.0 77.0 79.8 65.2* 89.4* 37.8 67.6 68.0 61.0 55.6 47.6
DeepSeek V3.2 (Thinking) 25.1† 40.8† 93.1 92.5 78.3 82.4 85.0 57.6 37.0 39.8 28.2 48.8 20.4 56.2 69.3 73.3 52.9 86.3 51.4 61.3 40.6 48.5 34.4
Claude Opus 4.6 (Adaptive) 91.3 73.9 80.8 77.8 65.4 66.6 86.8 91.3
GPT-5.3-Codex (xhigh) 56.8 77.3
Qwen3-VL-235B-A22B (Thinking) 69.3 66.1 74.6 85.8 4* 3* 87.5 82.0* 73.8 77.5 39.4 64.5 69.0 52.9 59.9 46.7 39.3
Qwen3.5-Plus 28.7 48.3 94.8 80.9 87.8 79.0 80.8 88.6 90.3 12 93.1 90.8 67.1 84.7 75.4 83.7 75.5 76.4 69.3 52.5 83.6 63.2 68.7 69.0 78.6 74.0 46.9
Legend (click to collapse)

Notes

  • * indicates re-evaluated by the source authors.
  • † indicates text-only subset evaluation.
  • null indicates not reported in the source table.
  • Data source: Kimi K2.5 blog post.