提供方 Reasoning & Knowledge - HLE-Full Reasoning & Knowledge - HLE-Full (w/ tools) Reasoning & Knowledge - AIME 2025 Reasoning & Knowledge - HMMT 2025 (Feb) Reasoning & Knowledge - IMO-AnswerBench Reasoning & Knowledge - GPQA-Diamond Reasoning & Knowledge - MMLU-Pro Image & Video - MMMU-Pro Image & Video - CharXiv (RQ) Image & Video - MathVision Image & Video - MathVista (mini) Image & Video - ZeroBench Image & Video - ZeroBench (w/ tools) Image & Video - OCRBench Image & Video - OmniDocBench 1.5 Image & Video - InfoVQA (val) Image & Video - SimpleVQA Image & Video - WorldVQA Image & Video - VideoMMMU Image & Video - MMVU Image & Video - MotionBench Image & Video - VideoMME Image & Video - LongVideoBench Image & Video - LVBench Coding - SWE-Bench Verified Coding - SWE-Bench Pro Coding - SWE-Bench Multilingual Coding - Terminal Bench 2.0 Coding - PaperBench Coding - CyberGym Coding - SciCode Coding - OJBench (cpp) Coding - LiveCodeBench (v6) Long Context - Longbench v2 Long Context - AA-LCR Agentic Search - BrowseComp Agentic Search - BrowseComp (w/ctx manage) Agentic Search - BrowseComp (Agent Swarm) Agentic Search - WideSearch (item-f1) Agentic Search - WideSearch (item-f1 Agent Swarm) Agentic Search - DeepSearchQA Agentic Search - FinSearchCompT2&T3 Agentic Search - Seal-0
Kimi K2.5 (Thinking) 30.1 50.2 96.1 95.4 81.8 87.6 87.1 78.5 77.5 84.2 90.1 9 11 92.3 88.8 83.5 83.9 47.0 70.0 77.5 61.8 72.0 54.1 54.5 71.6 49.5 56.0 40.4 61.4 32.6 68.5 80.0 77.1 60.0 93.5 60.6 74.9 78.4 68.4 73.0 58.8 62.0 46.4
GPT-5.2 (xhigh) 34.5 45.5 100 99.4 86.3 92.4 86.7* 79.5* 82.1 83.0 82.8* 9* 7* 80.7* 85.7 79.8* 83.7* 41.0* 73.4* 82.0* 61.8* 74.4 55.6* 52.5* 68.8 45.5 50.0 29.5 60.0 24.9 55.0 71.0 71.7 63.4* 92.9* 65.8 57.8 67.5 57.4 52.3 38.9
Claude 4.5 Opus (Extended Thinking) 30.8 43.2 92.8 92.9* 78.5* 87.0 89.3* 74.0 67.2* 77.1* 80.2* 3* 9* 86.5* 87.7* 76.5* 78.7* 38.7* 63.0* 73.2* 56.3* 67.6 44.2* 41.8* 74.0 48.0 44.0 35.3 57.0 22.1 60.0 75.0 74.1 50.7* 86.9* 37.0 59.2 60.9 46.7 49.8 35.9
Gemini 3 Pro (High Thinking Level) 37.5 45.8 95.0 97.3* 83.1* 91.9 90.1 81.0 81.4 86.1* 89.8* 8* 12* 90.3* 88.5 82.4* 83.4* 41.0* 76.5* 82.7* 64.1* 77.2 55.2* 56.4* 74.5 54.0 58.0 39.5 69.0 34.2 69.0 77.0 79.8 65.2* 89.4* 37.8 67.6 68.0 61.0 55.6 47.6
DeepSeek V3.2 (Thinking) 25.1† 40.8† 93.1 92.5 78.3 82.4 85.0 57.6 37.0 39.8 28.2 48.8 20.4 56.2 69.3 73.3 52.9 86.3 51.4 61.3 40.6 48.5 34.4
Claude Opus 4.6 (Adaptive) 91.3 73.9 80.8 77.8 65.4 66.6 86.8 91.3
GPT-5.3-Codex (xhigh) 56.8 77.3
Qwen3-VL-235B-A22B (Thinking) 69.3 66.1 74.6 85.8 4* 3* 87.5 82.0* 73.8 77.5 39.4 64.5 69.0 52.9 59.9 46.7 39.3
Qwen3.5-Plus 28.7 48.3 94.8 80.9 87.8 79.0 80.8 88.6 90.3 12 93.1 90.8 67.1 84.7 75.4 83.7 75.5 76.4 69.3 52.5 83.6 63.2 68.7 69.0 78.6 74.0 46.9
说明 (点击收起)

标记说明

  • * 表示由原作者重新评估。
  • † 表示仅文本子集评测。
  • null 表示原始表格未提供。
  • 数据来源:Kimi K2.5 博客原文