Arena.ai
Grok Build 0.1 ranks #15 and Grok 4.3 (High) #17 in the new Agent Arena leaderboard. Grok Build 0.1 improves meaningfully on bash capability over Grok 4.3. It is slightly less steerable and more prone to tool hallucinations, but looks to be successfully completing tasks more… https://twitter.com/arena/status/2064084859103166518/photo/1
中文: Grok Build 0.1 排名第15位,Grok 4.3(高位)排名第17位,位居新Agent Arena排行榜。Grok Build 0.1 比 Grok 4.3 在 bash 能力上显著提升。它的可控性略低,更容易出现工具幻觉,但似乎要成功完成任务......
Arena.ai
RT @ml_angelopoulos: In case you didn’t notice: Agent Arena doesn’t have a voting mechanism. So how do we calculate the scores? The answer…
中文: RT @ml_angelopous:如果你没有注意到:Agent Arena 没有投票机制。那么我们如何计算分数呢? 答案......
Arena.ai
ICYMI: Agentic AI is now measured in the Arena. Agent Mode can handle deep research around competitive intelligence, market sizing & opportunity analysis, scientific & medical research and more. Every session shapes the Agent Arena leaderboard. Get a walkthrough of the causal… https://twitter.com/arena/status/2064021507681276234/video/1
中文: ICYMI:现在在竞技场中测量了Agentic AI。代理模式可以处理有关竞争情报、市场规模和市场、机会分析、科学与研究、医学研究等的深入研究。 每个环节都塑造了代理竞技场的排行榜。了解因果......
Arena.ai
Have you tried out Agent Mode yet? Use frontier AI agents to do your real work. Your sessions feed the data that ranks them on the Agent Arena leaderboard. See details in thread to learn more about Agent Mode and Agent Arena. 👇
中文: 你试过代理模式吗? 使用前沿人工智能代理来完成你的实际工作。您的会话会向代理体育馆排行榜上提供排名数据。 请参阅帖子中的详细信息,以了解有关代理模式和代理竞技场的更多信息。👇
Arena.ai
RT @ProductHunt: arena ranked every major ai model for years. now they built the thing that actually runs them. autonomous agents fo…
中文: RT @ProductHunt:竞技场多年来对所有主要AI车型进行排名。 现在他们构建了真正运行它们的东西。
Arena.ai
In the Image Arena: open-weight Text-to-Image has a clear leader, with a tight race directly behind it: - #1 Ideogram-4.0 Quality has set the pace this week with a score of 1204. @ideogram_ai - #2 Hunyuan Image 3.0 by @TencentHunyuan with a score of 1151, just +1 pt ahead of… https://twitter.com/arena/status/2062997992777609534/photo/1
中文: 在图像竞技场中:敞张的文本图像具有明确的领先,其背后正前方有着激烈的竞争: - #1 Ideogram-4.0 质量本周已设定速度,评分为 1204。@ideogram_ai - #2 洪源图片 3.0 由 @TencentHunyuan 提供,评分为 1151,比 领先 1 分
Arena.ai
Mistral 3.5 by @MistralAI has been added to Arena's new Agent Mode! Put models to work on your most complex real-world tasks, and see how they perform. Your sessions will help shape the Agent Arena leaderboard. https://twitter.com/arena/status/2062965189453259017/photo/1
中文: @MistralAI 的 Mistral 3.5 已被加入 Arena 的新代理模式! 让模型来处理你最复杂的现实世界任务,并了解它们的表现。 您的课程将帮助塑造代理竞技场的排行榜。
Arena.ai
Three new models entered the Image Arena Top 10 this past month (Text-to-Image): - #2 Reve 2.0 by @Reve (1,273), behind only GPT Image 2. - #4 MAI-Image-2.5 by @MicrosoftAI (1,253). - #9 Ideogram 4.0 Quality by @Ideogram_ai enters at #9 (1,204). And the only open-weights model in… https://twitter.com/arena/status/2062957421757452516/photo/1
中文: 上个月,三位新模特进入了Image Arena Top 10(从文本到图像)。 - @Reve 2/2(1,273),仅次于 GPT Image 2。 - #4 MAI-Image-2.5,由 @MicrosoftAI 提供,1,253 位。 - @Ideogram_ai 的 #9 位 Ideogram 4.0 质量 输入 #9(1,204)。以及唯一在......中采用的开放权重模型......
Arena.ai
RT @arena: Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions o…
中文: RT @arena:介绍代理竞技场:大规模的真实特效椭圆。 如何评价实际工作的代理人?我们衡量的是数百万......
Arena.ai
Agentic AI is now evaluated in the Arena with Agent Mode and measured with Agent Arena. Founding Engineer Matt and Product Lead Ted show you Agent Mode in action: deep research, complex bash operations, whatever you throw at it. Every session contributes to the Agent Arena… https://twitter.com/arena/status/2062902033389322477/video/1
中文: Agentic AI 现在通过 Agent 模式在竞技场中进行评估,并使用 Agent Arena 进行测量。 创始工程师马特和产品主管泰德向你展示了代理模式的运行模式:深入的研究、复杂的抨击操作,以及你对此的任何投入。每场会议都为代理竞技场做出贡献......
Arena.ai
Nemotron 3 Ultra has been added to the new Agent Mode! This latest model from @NVIDIA and other top frontier models are ready for your complex, multi-step tasks. Your sessions will help shape the new Agent Arena leaderboard. https://twitter.com/arena/status/2062678350704083305/photo/1
中文: Nemotron 3 Ultra 已加入新的 Agent 模式! 这款来自 @NVIDIA 和其他顶级前沿机型的最新机型已准备就绪,可完成复杂的多步骤任务。您的课程将帮助塑造新的代理竞技场排行榜。
Arena.ai
As we launch Agent Mode on Arena today, we want to celebrate the community that brought us here. Battle Mode - where it all started - just passed 50 million votes. Thank you. https://twitter.com/arena/status/2062648091266936968/video/1
中文: 今天我们在体育馆推出代理模式,我们想庆祝我们来到这里的社区。 战斗模式——一切开始——刚刚通过了5000万张选票。 谢谢。
Arena.ai
Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex… https://twitter.com/arena/status/2062566749418233981/photo/1
中文: 引进特工竞技场:大规模真实的特化椭圆。 如何评价实际工作的代理人?我们衡量数百万个实时会话,让真实用户完成实际任务。 在Arena上,模特现在可以获得网页搜索、文件系统和终端工具,以完成复杂的任务......
Arena.ai
Introducing Agent Mode: Agentic AI is now measured in the Arena. Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more. It completes more complex tasks by using tools like web search, bash in a sandbox environment, image… https://twitter.com/arena/status/2062565126600114484/video/1
中文: 引入代理模式:现在在竞技场中测量了智能人工智能。 代理模式可以进行深入的研究、创建报告、生成图像、构建网站、调试代码等。 它通过使用网页搜索、在沙盒环境中进行抨击、图像等工具来完成更复杂的任务......
Arena.ai
RT @elonmusk: Grok Imagine 1.5 at rank 1
中文: RT @elonmusk:Grok Imagine 1.5 排名第一
Arena.ai
RT @hangg70: we made a new model for text-to-image generation and editing. the results are looking good and the leaderboard is looking stro…
中文: RT @hangg70:我们为文本到图像生成和编辑制作了一款新模型。效果很好,排行榜看起来也不错......
Arena.ai
MiniMax M3 has landed in the Arena and has moved the Pareto frontier! Their latest model ranks #7 for Code Arena: Frontend, scoring 1531, it is neck and neck with GLM-5.1. It moves the Pareto frontier in its price class at $0.60 input/$2.40 output per Mtoken. Congrats to the… https://twitter.com/arena/status/2062367057367519439/photo/1
中文: MiniMax M3 已登陆竞技场,并搬迁了帕雷托边境! 他们的最新型号在《Code Arena: Frontend》中排名第七,得分为1531,与GLM-5.1并列颈部和颈部。其价格等级中,帕累托的价格为每代币0.60美元,即2.40美元。 恭喜......
Arena.ai
RT @MicrosoftAI: MAI-Image-2.5 is here — now #3 on text-to-image and #2 on image-to-image Arena leaderboards, surpassing Nano Banana Pro. L…
中文: RT @MicrosoftAI:MAI-Image-2.5 现已进入——现在在图像到图像的文本上排名第三,在图像到图像的 Arena 排行榜上排名第 2 ,超过了 Nano Banana Pro。L...
Arena.ai
RT @Taesung: Diffusion models are known to be very compute intensive, even more so than LLM training. Now that we reduce images into layout…
中文: RT @Taesung:众所周知,Diffusion 模型具有非常耗量的强度,甚至比 LLM 训练还要多。现在我们将图像简化为布局......
Arena.ai
RT @adammenges: I joined @reve because I believed in the vision, and now I'm so proud to share that we have the #2 image model in the world…
中文: RT @adammenges:我加入@reve是因为我相信这一愿景,现在我很自豪地分享我们拥有世界第二形象模型......
Arena.ai
RT @MicrosoftAI: MAI-Image-2.5 is strong where image models usually break: identity, text, lighting, backgrounds, and clean edits. Here’s w…
中文: RT @MicrosoftAI:图像-2.5 在图像模型通常具有突破性特性:身份、文字、灯光、背景以及简洁的编辑。这里是......
Arena.ai
RT @altryne: What's going on today!? So many new AI releases! @reve has been the underdog for the longest time, one of the coolest AI imag…
中文: RT @altryne:今天发生了什么!如此多的新人工智能发布! @reve 长期以来一直处于劣势,是最酷的人工智能之一......
Arena.ai
RT @reve: Our independent research lab ranks top 2 on @arena Text-to-Image, ahead of Nano Banana 2 and GPT-Image-1.5. https://t.co/ETundult…
中文: RT @reve:我们的独立研究实验室在@arena Text-to-Image上排名前2,排在Nano Banana 2和GPT-Image-1.5之前。
Arena.ai
RT @TianweiY: 1/ Our new @reve image model is now #2 on the @arena text-to-image leaderboard — behind only GPT Image 2, ahead of Nano Banan…
中文: RT @TianweiY:1/ 我们全新的 @reve 图像模型现已在 @arena 文本到图像排行榜上排名第二——仅次于 GPT Image 2,领先于 Nano Banan...
Arena.ai
Reve 2.0 has landed #2 in the Text-to-Image Arena! Scoring 1280, this puts the latest model above Nano Banana 2, MAI-Image-2.5, and GPT-Image-1.5-High Fidelity. This is a +125pt improvement over Reve v1.5. Congratulations to the @reve team on this major milestone! https://twitter.com/arena/status/2062261716831142025/photo/1
中文: Reve 2.0 已登陆文本至图像竞技场第二名! 评分为1280,使最新型号的型号高于Nano Banana 2、MAI-Image-2.5和GPT-Image-1.5-High Fidelity。这比Reve v1.5的改进程度提高了12.5。 祝贺@REVE团队实现这一重要里程碑!
Arena.ai
New open model Ideogram-4.0-Quality has landed at #8 in the Text-to-Image Arena. This makes the new model by @ideogram_ai the #1 open model in that arena! Scoring 1204, this open model approaches the performance of Nano Banana Pro. Congrats to the @ideogram_ai team on this… https://twitter.com/arena/status/2062203346996605116/photo/1
中文: 新的开放式模型Ideogram-4.0-Quality已登上Text-to-Image体育馆的第8名。这使得 @ideogram_ai 的新模型成为该领域排名第一的开放模式!该开放式模型的打分为1204,接近Nano Banana Pro的性能。 祝贺 @ideogram_ai 团队对此......
Arena.ai
RT @chihyaoma: Our image editing model is out today! Better than latest Nano banana from Gemini, landing us top 2 worldwide! Another mile…
中文: RT @chihyaoma:我们的图片编辑模型今天就推出了! 比Gemini最新款的Nano香蕉更好,让我们成为全球最佳的第二名! 再走一英里......
Arena.ai
MAI-Image-2.5 ranks #2 in the Image Edit Arena and advances the Pareto frontier. That means: at its price tier, no model scores higher on Arena. Congrats again to @MicrosoftAI on this release! https://twitter.com/arena/status/2061894541888962712/photo/1
中文: MAI-Image-2.5 在图像编辑竞技场中排名第二,并推进了帕雷托边境。 这意味着:在价格等级上,没有模特在体育馆得分更高。 再次向 @MicrosoftAI 表示祝贺!
Arena.ai
MAI-Image-2.5 has officially released from @MicrosoftAI landing at #2 in the Image Edit Arena (Single-Image-Edit) with a score of 1401 and advances the Pareto frontier! This puts the model +10 pts over Nano Banana 2, Grok Imagine Image Quality and ChatGPT-Image-Latest-High… https://twitter.com/arena/status/2061887242579382660/photo/1
中文: MAI-Image-2.5 已正式从 @MicrosoftAI 发布,以第2名的成绩登陆图片编辑(Single-Image-Edit),并取得了1401分的成绩,并推进了帕雷托前沿的发展! 该模型将 +10 分位的 Nano Banana 2、Grok Imagine Image Quality 和 ChatGPT-Image-Latest-High 的版本置于
Arena.ai
New open model: MiniMax M3 by @MiniMax_AI is live in the Arena! Find it across Text, Vision, Document and Code Arena: Frontend. Bring your toughest prompts and vote. Scores incoming soon! https://twitter.com/arena/status/2061270380770447782/photo/1
中文: 全新开放模式:MiniMax M3 由 @MiniMax_AI 提供,现场直播! 通过《Text》、《Vision》、《Document》和《Code Arena: Frontend》找到它。带上你最严厉的提示和投票。很快会进球!
Arena.ai
RT @jefffhj: Grok Imagine 1.5 Preview is live. Try it in the API (https://t.co/zrZiCGPZNS) and give us your feedback ;)
中文: RT @jeffhj:Grok Imagine 1.5 预览版正在直播。在API中试用(
Arena.ai
RT @jia_xuhui: Grok Imagine is getting better and better. Our only goal is to make it genuinely useful. If we do that well, strong rankings…
中文: RT @jia_xuhui:Grok Imagine 正在变得越来越好。我们唯一的目标就是让它真正有用。如果我们做得好,排名会很强......
Arena.ai
RT @keyang_xx: We’re back to No. 1 on the leaderboard! 🎉 After months of hard work, what a journey! It’s also the best option for fast and…
中文: RT @keyang_xx:我们已重返排行榜第一!经过数月的辛苦努力,真是一段旅程!这也是快速和...的最佳选择
Arena.ai
Grok-Imagine-Video-1.5-Preview (720p) has landed #1 in the Image-to-Video Arena! This is a massive +52 pt improvement over Grok-Imagine-Video (720p), surpassing the best video models Seedance-2.0 and HappyHorse. Congrats to @xAI and @elonmusk on this big achievement! https://twitter.com/arena/status/2060874057130934376/photo/1
中文: Grok-Imagine-Video-1.5-预览(720p)已登上“Awsite-to-Video Arena”的第一名! 这比Grok-Imagine-Video(720p)的大幅提升,超过了最佳视频型号Seedance-2.0和HappyHorse。 祝贺 @xAI 和 @elonmusk 取得这一重大成就!
Arena.ai
Wan2.7-t2v-2026-04-25 is #3 on the Text-to-Video Arena! Congrats to the @Alibaba_Wan team on this achievement. https://twitter.com/arena/status/2060532939914682751/photo/1
中文: Wan2.7-t2v-2026-04-25 在文字视频竞技场上排名第三! 祝贺@Alibaba_Wan团队取得这一成就。
Arena.ai
Arena's AI Capability Lead @petergostev runs @AnthropicAI's latest Claude Opus 4.8 through 200+ Code Arena: Frontend tests. Both thinking and non-thinking, head-to-head with past Opus variants, Gemini 3.1 Pro, 3.5 Flash, and GLM 5.1. Compare outputs across 3D scenes, game…
中文: Arena的人工智能能力主管@petergostev通过200多个Code Arena:前端测试,运营着@AnthinpicAI最新的Claude Opus 4.8。既有思维,又无思维,与以往的Opus版本、Gemini 3.1 Pro、3.5 Flash和GLM 5.1正面交锋。 比较3D场景、游戏的输出结果......
Arena.ai
Claude Opus 4.8 by @AnthropicAI is live in Battle Mode. Time to see how it holds up to live, real-world tasks. Bring your toughest prompts, and vote. Scores incoming. https://twitter.com/arena/status/2060050997196795997/photo/1
中文: @AntropicAI 的克劳德·奥普斯 4.8 正在以战斗模式进行直播。 是时候看看它如何能够完成现实中的任务了。 带上最严厉的提示,然后投票。进球。
Arena.ai
Where is AI-assisted web development heading? We've added new categories to Code Arena: Frontend, covering 7 domains across agentic web development. Learn about the ML methodology behind it, what the shifting data tells us about how people are actually using AI to build for… https://twitter.com/arena/status/2060016362484048316/video/1
中文: 人工智能辅助网页开发将走向何方? 我们为《Code Arena: Frontend》新增了多个类别,涵盖了代理网页开发的7个域名。 了解其背后的机器学习方法,这些不断变化的数据告诉我们人们实际如何使用人工智能来构建......
Arena.ai
RT @AlibabaGroup: Proud to see Qwen3.7‑Max debut at #4 in Code Arena, marking a significant milestone for Qwen in agentic web development.…
中文: RT @阿里巴巴集团:为在 Code Arena 上首次亮相 Qwen3.7–Max 而感到自豪,标志着 Qwen 在专业网页开发领域的重要里程碑。......
Arena.ai
RT @MicrosoftAI: Meet MAI-Image-2.5 - ranked third on the @arena text-to-image leaderboard. It's another great advance in quality. And with…
中文: RT @MicrosoftAI:在@arena 文本到图像排行榜上,MAI-Image-2.5 排名第三。这是质量上的又一次重大进步。而且......
Arena.ai
RT @mustafasuleyman: Meet MAI-Image-2.5 - ranked third on the @arena text-to-image leaderboard. It's another great advance in quality. And…
中文: RT @mustafasuleyman:在@arena 文本到图像排行榜上,MAI-Image-2.5 排名第三。这是质量上的又一次重大进步。还有......
Arena.ai
Exciting news, MAI-Image-2.5 (Preview) from @MicrosoftAI debuts at #3 in the Text-to-Image Arena with a score of 1,254 — a +72 point improvement over MAI-Image-2. A top 5 arena previously held only by @GoogleDeepMind and @OpenAI has a new lab in the mix. Congrats to the… https://twitter.com/arena/status/2059346024632820146/photo/1
中文: 令人振奋的消息:来自@MicrosoftAI的MAI-Image-2.5(预览)以1,254分的成绩在Text-to-Image体育馆以第3名的成绩亮相,比MAI-Image-2提高了72分。 此前仅由 @GoogleDeepMind 和 @OpenAI 举办的五大场馆,其新实验室已展开。 恭喜......
Arena.ai
Qwen3.7 Max (20250517) debuts at #4 in Code Arena: Frontend - the top-ranked Chinese lab on the board, surpassing GLM-5.1 and is now on par with Claude Opus 4.6 on agentic web development tasks. Huge congrats to @Alibaba_Qwen on this achievement! https://twitter.com/arena/status/2059297720079393107/photo/1
中文: Qwen3.7 Max(2025517)在《Code Arena: Frontend》第4期首次亮相——该实验室是中国排名最高的实验室,超过了GLM-5.1,目前在代理网页开发任务上与Claude Opus 4.6处于同等领先地位。 向@Alibaba_Qwen 祝贺,祝贺取得这一成就!
Arena.ai
New provider HiDream-01-Image by @HiDream_AI ranks #27 overall in the Text-to-Image Arena, making it the #4 open source model. Congrats to the @HiDream_AI team on the release! https://twitter.com/arena/status/2057541510334488829/photo/1
中文: 由@HiDream_AI推出的新服务商HiDream-01-Image在Text-to-Image Arena中排名第27位,使其成为第4号开源模型。 祝贺@HiDream_AI团队发布!
Arena.ai
5 patterns in Text Arena's price–performance Pareto frontier since 2023: 1. GPT-4-level quality is now ~500x lower cost. - From a ~$50 blended price per million tokens in 2023 to ~$0.10 today. 2. The higher-price end is both better and lower-priced since 2023. - The… https://twitter.com/arena/status/2057486887938646370/video/1
中文: 自2023年以来,Text Arena在Pareto的性价比方面有5种模式: 1。GPT-4级质量现在比成本低500倍。 - 从2023年每百万代币的50美元混合价格到如今的约0.10美元。 2。自2023年以来,价格较高且价格更低。 -
Arena.ai
5 patterns in Text Arena's price–performance Pareto frontier since 2023: 1. GPT-4-level quality is now ~500x lower cost. - From a ~$50 blended price per million tokens in 2023 to ~$0.10 today. 2. The higher-price end is both better and lower-priced since 2023. - The leading… https://twitter.com/arena/status/2057485281650147822/video/1
中文: 自2023年以来,Text Arena在Pareto的性价比方面有5种模式: 1。GPT-4级质量现在比成本低500倍。 - 从2023年每百万代币的50美元混合价格到如今的约0.10美元。 2。自2023年以来,价格较高且价格更低。 - 领先...
Arena.ai
Asked Gemini 3.5 Flash to render the Petra Treasury. It built the entire stone canyon around it - something other frontier models didn't do. Gemini also added ambient sound, which wasn’t in the prompt either. Whether you want this agentic behavior depends on what you're trying… https://twitter.com/arena/status/2057209766112547265/video/1
中文: 要求双子座3.5闪电队为佩特拉财政部提供资源。它建造了周围整个石峡谷——这是其他前沿模型所没有做到的。 双子座还添加了环境声音,但提示中也不在其中。 你是否想要这种行为,取决于你尝试什么......
Arena.ai
A closer look at Gemini 3.5 Flash by @GoogleDeepMind In the Code Arena: Frontend we see sweeping gains, and a Flash model now surpasses the previous Pro variant. - vs. 3 Flash, a +70 jump overall, large improvements in every subcategory - vs. 3.1 Pro, outperforms it in every… https://twitter.com/arena/status/2056803661859479812/photo/1
中文: @GoogleDeepMind 在代码领域对 Gemini 3.5 Flash 进行更近距离的审视:前看,我们看到了巨大的收益,而 Flash 型号如今也超过了之前的 Pro 版本。 - 与3 个闪光灯,总体提升 +70 ,在每个子类别中都有显著提升 - 与3.1 Pro 在每款版上的表现都优于它......
Arena.ai
Gemini 3.5 Flash has landed #9 for Text and Code Arena: Frontend. Code Arena: Frontend evaluates models on agentic frontend coding tasks from real users building apps and websites (HTML and React). Scoring 1507, this is a significant +70 point improvement over Gemini-3 Flash.… https://twitter.com/arena/status/2056793176720195693/photo/1
中文: 双子座3.5 Flash在《文本与代码》中排名第9:前端。 代码竞技场:前端评估模型,以实际用户构建应用程序和网站(HTML 和 React)的代理前端编码任务。得分为1507,比Gemini-3 Flash显著提升+70分。
Arena.ai
RT @Alibaba_Qwen: 🚀🚀Qwen3.7 Preview lands on Arena ! Here come Qwen3.7-Max-Preview & Qwen3.7-Plus-Preview. Alibaba now #6 lab in Text, #5…
中文: RT @Alibaba_Qwen:🚀🚀Qwen3.7 预赛登陆体育馆! 以下是 Qwen3.7-Max-Preview & Qwen3.7-Plus-Preview。阿里巴巴现在在文本中排名第六,第5位......
Arena.ai
RT @Alibaba_Qwen: 🚀🚀
Arena.ai
Qwen3.7 Preview By @Alibaba_Qwen lands on Arena for Text and Vision. In Text Arena, Qwen3.7 Max Preview ranks #13 overall. Alibaba is now the #6 lab in this arena. - #7 Math - #9 Expert - #9 Software & IT - #10 Coding In Vision Arena: Qwen3.7 Plus Preview ranks #16 overall,… https://twitter.com/arena/status/2056400044862111757/photo/1
中文: @Alibaba_Qwen 的 Qwen3.7 预览版将登陆 Arena 以获取文本和视觉功能。 在《文本竞技场》中,Qwen3.7 Max Preview 的排名总体排名第13位。 阿里巴巴现在是这个领域的第六号实验室。 - #7 数学 - 第9名专家 - 9 号软件与放大器;信息技术 - #10 编码 在Vision Arena:Qwen3.7 Plus Preview 排名总体排名第16位......
Arena.ai
In the Vision Arena, Qwen3.7 Plus Preview makes @Alibaba_Qwen the #5 lab, ranking #16 overall. https://twitter.com/arena/status/2056400046640566548/photo/1
中文: 在Vision Arena,Qwen3.7 Plus Preview 将 @Alibaba_Qwen 列为第5号实验室,整体排名排名第16位。
Arena.ai
Millions of votes a week. One tagging system. Arena researchers Guanglei Song and I-Hung Hsu walk through the data pipeline behind Arena's category leaderboards: Databricks → Spark → a pluggable tagger framework calling LLMs to categorize every evaluation across our text,… https://twitter.com/arena/status/2055316670898741526/video/1
中文: 每周数百万张选票。一个标记系统。 竞技场研究人员广雷·宋和I-Hung Hsu在Arena的类别排行榜背后走过数据管道:Databricks → Spark → 一个可插拔的标签框架,用于调用LLM,对我们文本中的每一项评价进行分类。
Arena.ai
According to @tryramp, @AnthropicAI just overtook OpenAI in business customers (34.4% vs 32.3% this week). In the Text Arena, that flip happened in Q4 2025. Real-world signal led enterprise adoption by ~6 months. But the picture shifts fast: Codex crossed 3M+ weekly developers… https://twitter.com/arena/status/2054995034043470317/video/1
中文: 根据@tryramp的说法,@AnthropicAI 刚刚在商业客户中超过了 OpenAI(本周为34.4%,而用户为32.3%。在文本竞技场中,那次翻转发生在2025年第四季度。现实世界信号引领企业采用约6个月。 但情况变化很快:Codex 每周开发者都在进行3M+开发......
Arena.ai
US vs China update. Stanford's AI Index put the US–China gap at 2.7%. Here's what two years of real-world use from the Text Arena shows. Gap three years ago: +278. Today: +29. @AnthropicAI's Claude Opus 4.6 Thinking vs. Baidu's @ErnieforDevs Ernie 5.1 at the top. The US… https://twitter.com/arena/status/2054969739735335190/video/1
中文: 美国对中国更新斯坦福大学的人工智能指数显示,美中差距为2.7%。以下是Text Arena两年来真实使用的内容。 三年前的差距:+278。今天:+29。 @AnthropicAI 的 Claude Opus 4.6 思维 对比百度 @ErnieforDevs Ernie 5.1 在榜首。 美国......
Arena.ai
US vs China update. Stanford's AI Index put the US–China gap at 2.7%. Here's what two years of real-world use from the Text Arena shows. Gap two years ago: +278. Today: +29. @AnthropicAI's Claude Opus 4.6 Thinking vs. Baidu's @ErnieforDevs Ernie 5.1 at the top. The US has… https://twitter.com/arena/status/2054969062435021110/video/1
中文: 美国对中国更新斯坦福大学的人工智能指数显示,美中差距为2.7%。以下是Text Arena两年来真实使用的内容。 两年前的差距:+278。今天:+29。 @AntropicAI 的 Claude Opus 4.6 思维 对比百度的@ErnieforDevs Ernie 5.1 处于榜首位。 美国有......
Arena.ai
The top 5 labs in Text Arena rankings by category show that frontier models have distinct strengths and tradeoffs. #1 @AnthropicAI, Claude Opus 4.7 - The most consistently dominant model overall, leading top-tier across nearly every major category. #2 @GoogleDeepMind, Gemini… https://twitter.com/arena/status/2054223408427372831/photo/1
中文: Text Arena 排名中排名前五的实验室显示,前沿模型具有显著的优势和权衡。 #1 @AntropicAI,克劳德·奥普斯 4.7 - 整体上占据主导地位的最强模式,在几乎所有主要类别中都处于领先地位。 #2 @GoogleDeepMind,双子座......
Arena.ai
GPT-5.5 Instant by @OpenAI is in ChatGPT and has landed on Arena, across multiple leaderboards. Here’s how it ranks by modality: - Vision Arena: #11 overall, on par with Claude-Sonnet-4.6 - Text Arena: #18 overall, Multi-Turn #5 - Occupational: #5 Life, Physical & Social… https://twitter.com/arena/status/2052876951329919383/photo/1
中文: @OpenAI 的 GPT-5.5 即时登录 ChatGPT,已登陆 Arena,横跨多个排行榜。 按模式排名如下: - 视觉竞技场:总体排名第11位,与克劳德-索内特-4.6持平 - 文字竞技场:总体排名第18,多转 #5 - 职业:第5名:生活、身体和身体;社交...
Arena.ai
Introducing 7 new leaderboard views for frontend output in Code Arena. Aggregate leaderboards don’t tell the full story. "Best frontend coding model" depends on what you're building, so we built leaderboards that show exactly that. After analyzing 250,000+ Code Arena prompts,… https://twitter.com/arena/status/2052827202027426278/photo/1
中文: 在 Code Arena 中引入 7 个新的排行榜视图以获取前端输出。 聚合排行榜无法完整地讲述。最佳前端编码模型取决于您正在构建的内容,因此我们构建了完全显示这一点的排行榜。 在分析了25万多个代码竞技场提示后,
Arena.ai
Ernie-5.1 by Baidu’s @ErnieforDevs has landed as #4 in the Search Arena! This makes Baidu a top 3 lab in Search performance, and the only Chinese model in the top 10 overall. Congrats to the @ErnieforDev team on this accomplishment! https://twitter.com/arena/status/2052780949826666748/photo/1
中文: 百度 @ErnieforDevs 的 Ernie-5.1 已以第4名的成绩进入搜索领域!这使得百度成为搜索性能排名前三的实验室,也是中国前十名中唯一的机型。 祝贺@ErnieforDev团队取得这一成就!
Arena.ai
Gemma-4 lands in Vision Arena as #2 & #4 open models, and shifts the Pareto frontier! @GoogleDeepMind dominates the price-performance Pareto in Vision across both proprietary and open models. - Gemma-4-31b ranks #2 open (#20 overall) - Gemma-4-26b-a4b ranks #4 open (#26 overall)… https://twitter.com/arena/status/2052496756773093814/photo/1
中文: 杰玛-4号位于Vision Arena,作为#2和4号开放式模型,并改变了帕雷托的前沿! @GoogleDeepMind 在专有和开放型机型中均主导着 Pareto in Vision 性价比。 - 杰玛-4-31b排名第二(整体排名第20) - Gemma-4-26b-a4b 排名第4名(总体排名第26位)。
Arena.ai
Code Arena's frontend leaderboard for models using visual inputs in agentic coding has turned over fast. Half the top 10 is new this month, with Claude setting the pace and older OpenAI and Gemini entries no longer in the top 10. - Claude by @AnthropicAI now takes all the top… https://twitter.com/arena/status/2052467871117418888/photo/1
中文: Code Arena 为使用智能编码视觉输入的模型提供前端排行榜,迅速颠覆了这一点。 本月前十名中有一半是新选手,克劳德也创下了新星,而老款的OpenAI和双子座选手已不在前十名中。 - @AntropicAI 的克劳德现在占据了所有榜首......
Arena.ai
Have open source models closed the gap with proprietary ones? We've tracked three years of Arena data across three arenas. The short answer: mostly yes. In Text Arena, the proprietary winner had a +250 Arena lead. By early 2025, it had fallen to low double digits, and at its… https://twitter.com/arena/status/2052455463573426452/photo/1
中文: 开源模型是否与专有模型弥补了差距?我们已追踪了三个体育馆三年的赛场数据。简短的回答:大多是。 在《文本竞技场》中,这位专有冠军以+250的领先优势领先。到2025年初,这一数字已降至两位数以下,且在
Arena.ai
Gemma-4 lands in Code Arena: Frontend Webdev and shifts the Pareto Frontier! Among open models, Gemma-4-31b ranks #13 and Gemma-4-26b-a4b ranks #17. Congrats to @GoogleDeepMind on shifting the frontier! https://twitter.com/arena/status/2052061349312921686/photo/1
中文: 杰玛-4号位于Code Arena:前端Webdev,并换上帕雷托边疆! 在开放模型中,Gemma-4-31b 排在第13位,Gemma-4-26b-a4b排名第17位。 恭喜@GoogleDeepMind改变了前沿领域!
Arena.ai
GPT-5.5 Instant is now live on Text, Vision, and Document Arena. Put it to the test and vote. Scores coming soon. https://twitter.com/arena/status/2051763069882351949/photo/1
中文: GPT-5.5 即时通讯现已在 Text、Vision 和 Document Arena 上上线。 付诸考验并付诸表决。即将获得的分数。
Arena.ai
Max, Arena's model router powered by 5M+ community votes, is now multimodal. Starting today, Max is the default in Direct chat across every modality: search, vision, image generation, image editing, and front-end coding with the same latency-controlled performance as the… https://twitter.com/arena/status/2051736696312754404/photo/1
中文: Max 是 Arena 的模型路由器,由 5M+ 社区投票支持,如今是多模式的。 从今天起,Max 是每种模式的直接聊天默认配置:搜索、视觉、图像生成、图像编辑和前端编码,与 具有相同的延迟控制性能
Arena.ai
RT @baaadas: #3 on Image Edit. #3 on Text-to-Image. @arena The compute we did it with would surprise you. Proud of this team @LumaLabsAI…
中文: RT @baaadas:图片编辑第3名。@arena 我们用它做的计算会让你大吃一惊。为这支团队 @LumaLabsAI 感到自豪......
Arena.ai
RT @gravicle: Uni-1.1 API is now live and creates a new pareto frontier - thinking images at the cost and efficiency of old school diffusio…
中文: RT @gravicle:Uni-1.1 API 现已上线,打造了新的 pareto 前沿——以老派 diffusio 的成本和效率来思考图像......
Arena.ai
RT @_sam_sinha_: So excited to share that Uni 1.1 is out on API and on LMArena!! A model competitive with ones from GDM+OAI for a fraction…
中文: RT @_sam_sinha_:非常期待分享,Uni 1.1 已在 API 和 LMArena 上推出! 与GDM+OAI的模型竞争,比例为一小部分......
Arena.ai
RT @shenbokui: 🚀 UNI-1 debuts us as the best lab not named @OpenAI / @GeminiApp. Not bad for our first generation of unified image model!…
中文: RT @shenbokui:🚀 UNI-1 首次将我们视为最不称号的实验室 @OpenAI / @GeminiApp。对我们第一代统一形象模型来说并不差!......
Arena.ai
Exciting news: UNI-1.1-Max and UNI-1.1 debuts making @LumaLabsAI the #3 lab in the Image Arena across both Text-to-Image and Image Edit! These are versions released without agentic search. Text-to-Image Arena - UNI-1.1-Max #6 overall (1193), +12 points over MAI-Image-2 - UNI-1.1… https://twitter.com/arena/status/2051688029522436295/photo/1
中文: 令人振奋的消息:UI-1.1-Max 和 UI-1.1 首次将 @LumaLabsAI 与 UX-A 图像和图像编辑对齐,作为影像竞技场的 #3 实验室!这些版本未经探空搜索即可发布。 文字图像竞技场 - UI-1.1-Max 整体排名第6(1193),比MAI-Image-2积分增加12分 - UNI-1.1...
Arena.ai
Grok 4.3 by @xAI is now live in the Arena, landing across multiple leaderboards. At $1.25 / $2.50 price per 1M token, Grok 4.3 comes with more efficiency over Grok 4.20 (37.5% less for input, and 58.3% less for output). Here’s how it ranks by modality: - Code Arena, frontend… https://twitter.com/arena/status/2050333241388007818/photo/1
中文: @xAI 的 Grok 4.3 现已在体育馆内直播,横跨多个排行榜。每100万枚代币售价1.25美元/2.50美元,Grok 4.3 比 Grok 4.20 的能效更高(输入价减少 37.5%,输出效率低 58.3%)。 按模式排名如下: - 门面前门牌:Code Arena...
Arena.ai
Laguna XS.2 & M.1 by @poolsideai are ready in the Code Arena: Front-end. Come bring your toughest agentic webdev tasks and vote for the outputs that deliver best for your use case. Scores coming soon. https://twitter.com/arena/status/2050314664693977338/photo/1
中文: @poolsideai 的《Laguna XS.2 & M.1》已在 Code Arena: 前端准备就绪。 带上你最严苛的 webdev 任务,为最适合你使用的输出结果进行投票。 即将开始的分数。
Arena.ai
RT @XiaomiMiMo: Xiaomi MiMo-V2.5-Pro achieves multiple breakthroughs in the latest Arena rankings (Apr 26, 2026) 🔥 🏆 Text Arena (Expert) —…
中文: RT @XiaomiMiMo:小米Mo-V2.5-Pro在最新的Arena排名中取得多项突破(2026年4月26日)🔥 🏆 文本竞技场(专家)——
Arena.ai
Grok 4.3 by @xAI is in Battle Mode in the Text, Vision, Document & Code Arena: Front-end. Come test it out with your toughest prompts. Scores coming soon! https://twitter.com/arena/status/2049992557527187794/photo/1
中文: @xAI 的 Grok 4.3 处于文本、视觉、文档和示例中的战斗模式;代码 竞技场:前端。 用你最严厉的提示来测试它。分数即将到来!
Arena.ai
RT @ml_angelopoulos: Excited to join Stanford CS 153 Office Hours tomorrow: Friday, May 1st at 12PM PT with @AnjneyMidha and @Mabb0tt. Li…
中文: RT @ml_angelopoulos:很高兴明天加入斯坦福CS 153办公时间:5月1日星期五下午12点,与@AnjneyMidha和@MabbXtt合作。 李......
Arena.ai
Where do autoraters break down? Arena researchers Li Chen and I-Hung Hsu walk through how they'd build an autorater from scratch — different kinds of autoraters, training objectives, what dimensions actually matter to rate on — then get into what makes it hard in practice:… https://twitter.com/arena/status/2049938415672815756/video/1
中文: 自动评级器在哪里发生问题? 场馆研究人员李晨和I-Hung Hsu将讲述如何从零开始构建自动评分器——不同类型的自动评分器、训练目标,以及实际需要评估的尺寸——然后了解在实践中难以实现的因素:
Arena.ai
Hy3-Preview by @TencentHunyuan is top 7 among labs with open models in Text Arena. Their newest model ranks #80 overall. It is an MoE model with 295B total parameters, 21B active. Priced at $0.29 / $1.17 per 1M tokens. Congrats to the @TencentHunyuan team! https://twitter.com/arena/status/2049905966188249230/photo/1
中文: @TencentHunyeuan 的 Hy3-Preview 是 Text Arena 中拥有开放式模型的实验室中排名前七的。 他们的最新型号总体排名第80位。它是一种具有295B总参数的MoE模型,具有21B活性。每100万枚代币售价为0.29美元/ 1.17美元。 祝贺@腾讯元团队!
Arena.ai
RT @Baidu_Inc: ERNIE 5.1 Preview just went live 🚀 With a lighter, more efficient architecture, it delivers strong performance at its scal…
中文: RT @Baidu_Inc:ERNIE 5.1 预览版刚刚上线 🚀 凭借更轻便、更高效的架构,其出色性能表现出色。
Arena.ai
MiMo-V2.5 Pro by @XiaomiMiMo is the #11 model (#3 among open) in Code Arena: Frontend WebDev and has shifted the Pareto frontier with $1 input / $3 output per MToken. https://twitter.com/arena/status/2049582973926949116/photo/1
中文: @XiaomiMiMo 的 MiMo-V2.5 Pro 是 Code Arena: Frontend WebDev 中排名第11的 11 款机型,已通过每 MToken 提供 1 美元 输入 / 3 美元输出,改变了 Pareto 前沿。
Arena.ai
Ernie-5.1 from @ErnieforDevs lands at #13 in Text Arena — now the #1 highest-ranked model from a Chinese lab. Strongest categories: - #9 Math - #1 Legal & Government - #4 Business, Management & Financial Ops - #7 Software & IT Services Congrats to the Baidu @ErnieforDevs team… https://twitter.com/arena/status/2049522953793274197/photo/1
中文: 来自@ErnieforDevs的Ernie-5.1 登陆了Text Arena的第13号,如今该模型在中国实验室排名第一。 最强的类别: - #9 数学 - 第一名 法律与行动;政府 - 第4名 商业、管理与运营;财务运营 - 第7名软件与安普;信息技术服务 恭喜百度 @ErnieforDevs 团队......
Arena.ai
RT @petergostev: Note we've renamed Code Arena to Frontend Design: WebDev for these chats. I hope this is less confusing, but lmk if you ha…
中文: RT @petergostev:请注意,我们已将 Code Arena 更名为 Frontend Design: WebDev,用于这些聊天。希望这能不那么令人困惑,但如果你有的话......
Arena.ai
It's true. Here's a plot of GPT models and their usage of "goblin", "gremlin", "troll", etc over time. There's no anti-gremlin system instruction on our side, we get to see GPT-5.5 run free. https://twitter.com/arena/status/2049270072934617090/photo/1
中文: 这是真的。以下是GPT模型及其使用“妖精”、“妖精”等时间的图集。我们这边没有反灰林系统指令,我们可以免费查看GPT-5.5的运行情况。
Arena.ai
MiMo-V2.5 by @XiaomiMiMo is the #11 model (#3 among open) in Code Arena for frontend design. A new MIT-licensed open source model with 1M context, it also ranks strongly as an open model in Text and Vision Arena. Code Arena: frontend webdev design - MiMo-V2.5-Pro: #3 open (#11… https://twitter.com/arena/status/2049172075143942282/photo/1
中文: @XiaomiMiMo 的 MiMo-V2.5 是 Code Arena 前端设计版中排名第11的车型(开放版中排名第3)。一款采用1M上下文的全新MIT授权开源模型,在Text和Vision Arena中也具有强大的开放模式。 代码竞技场:前端网页设计 - MiMo-V2.5-Pro:第3个打开(第11个)......
Arena.ai
RT @ml_angelopoulos: Why GPT-5.5 is lower than Claude? The answer is simple: Code Arena currently only supports frontend/web development t…
中文: RT @ml_angelopoulos:为什么GPT-5.5比克劳德低? 答案很简单:Code Arena 目前仅支持前端/网页开发。
Arena.ai
GPT-5.5 xHigh is in Battle Mode in the Code Arena. Evaluate models on agentic coding tasks for front-end websites and apps. Scores coming soon! https://twitter.com/arena/status/2048846896744247468/photo/1
中文: GPT-5.5 x High 处于代码竞技场的战斗模式中。 评估前端网站和应用程序的代理编码任务模型。 分数即将到来!
Arena.ai
To clarify, the Arena community evaluated GPT-5.5 with reasoning effort medium (default) and high. The best of GPT-5.5 with xHigh is still incoming! Stay tuned.
中文: 需要说明的是,Arena 社区对 GPT-5.5 进行了评估,其推理能力为中等(默认值)且高度合理。 GPT-5.5 与 xHigh 的最佳状态仍在进行中!敬请关注。
Arena.ai
In Expert Arena, GPT-5.5-High ranks #5 - trailing only Claude Opus 4.6 and 4.7. Expert Arena evaluates models on advanced expert-level prompts in the Text Arena, with a focus on real-world professional use cases. This demonstrates GPT-5.5’s strong performance on complex,… https://twitter.com/arena/status/2048808366810800259/photo/1
中文: 在专家竞技场,GPT-5.5-High排名排名第5,仅落后于Claude Opus 4.6和4.7。 Expert Arena 通过文本领域高级专家级提示对模型进行评估,重点关注现实世界中的专业使用案例。这体现了GPT-5.5在复杂领域的出色表现......
Arena.ai
GPT-5.5 by @OpenAI is now live in the Arena, landing across multiple leaderboards. Here’s how it ranks by modality: - Code Arena (agentic web dev): #9, a strong +50pt jump over GPT-5.4 - Document Arena (analysis & long-content reasoning): #6, on par with Sonnet 4.6 - Text… https://twitter.com/arena/status/2048794479646388732/photo/1
中文: @OpenAI 的 GPT-5.5 现已在体育馆内直播,横跨多个排行榜。 按模式排名如下: - Code Arena(代理网络开发):第9名,超过GPT-5.4,实现强劲+50分的跳转 - 文档竞技场(分析与内容推理;长内容推理):第6位,与Sonnet 4.6相当 - 文字......
Arena.ai
RT @Alibaba_Qwen: Qwen-Image-2.0-Pro is now live 🚀🚀 We’ve pushed image quality, multilingual text rendering, and instruction following to…
中文: RT @Alibaba_Qwen:Qwen-Image-2.0-Pro 现已上线 🚀🚀 我们已将图像质量、多语言文本渲染和指导内容推送至......
Arena.ai
RT @TencentHunyuan: Excited to see Hy3 preview live on @arena. Try it out and let us know what you think!
中文: RT @TencentHunyuan:很高兴在@arena上观看Hy3预览。试试看,告诉我们你的想法!
Arena.ai
Hy3 preview (295B A21B) an open source model by @TencentHunyuan is now live on Arena. Evaluate it across Text & Code Arena in Battle mode. Scores incoming soon. https://twitter.com/arena/status/2047871322647318901/photo/1
中文: Hy3 预览版(295B A21B) @TencentHun 元的开源模型现已在 Arena 上上线。 通过 Text & Code Arena 进行对战模式的评估。即将进球的分数。
Arena.ai
Let’s dive deeper into the difference between DeepSeek V4 Pro & V4 Flash by @DeepSeek_AI. - Both support 1M token context and V4 Flash Thinking shifts the price Pareto frontier. V4 Pro ranks ~30 places higher than the V4 Flash variants, but costs 12x more at launch pricing.… https://twitter.com/arena/status/2047774037204742255/photo/1
中文: 让我们更深入地探讨 @DeepSeek V4 Pro & V4 Flash 之间的区别。 - 两者都支持1M代币上下文,V4 Flash思维改变了帕雷托的价格前沿。 V4 Pro 的排名比 V4 Flash 机型高出约 30 个,但在发布时价格却高出 12 倍。
Arena.ai
GPT-5.5 by @OpenAI is now live on Arena! Evaluate it across: Text, Vision, Search, Document (document analysis and long-content reasoning), and Code Arena (agentic coding tasks like live websites and apps). Find it in Battle mode to start prompting and voting. Scores coming… https://twitter.com/arena/status/2047745981111013441/photo/1
中文: @OpenAI 的 GPT-5.5 现已在体育馆上线! 通过文本、视觉、搜索、文档(文档分析和长内容推理)以及代码领域(如实时网站和应用程序等代理编码任务)来评估。 在战斗模式中找到它,开始提示和投票。即将得分......
Arena.ai
Competition between Chinese labs is intensifying. Top 3 Open Models in Text Arena are all now sitting just below the top proprietary tier, each leading different real-world categories. All category rankings are based on overall, which includes proprietary models. #1 open… https://twitter.com/arena/status/2047714237502677405/photo/1
中文: 中国实验室之间的竞争正在加剧。Text Arena 中排名前三的开放模型目前均位于顶尖专有层级之下,每个级别均处于不同现实世界类别的领先地位。 所有类别排名均基于整体排名,包括专有模型。 排名第一的开放...
Arena.ai
RT @arena: Watch first impressions of DeepSeek V4 on Arena’s YouTube: https://youtu.be/AC2jj_jfunQ
中文: RT @arena:在 Arena 的 YouTube 上观看 DeepSeek V4 的第一印象:
Arena.ai
RT @arena: DeepSeek v4 lands in the Text Arena: - DeepSeek V4 Pro (thinking): #2 open model (#14 overall), matching Kimi-2.6 - DeepSeek V4…
中文: RT @arena:DeepSeek v4 登陆文本竞技场: - DeepSeek V4 Pro(思维模式):第2个开放型号(整体排名第14),与Kimi-2.6相匹配 - 深度寻找V4...
Arena.ai
RT @HaoningTimothy: As long as K2.5/K2.6 is multimodal, we are also making it to use (I am really amazed by how it excels at long multi-ima…
中文: RT @HaoningTimothy:只要K2.5/K2.6是多式联运的,我们也在使用它(我对它在长多岛式中的表现感到非常惊讶......
Arena.ai
RT @Kimi_Moonshot: We're the #1 open model in Vision and Document Arena!
中文: RT @Kimi_Moonshot:我们是Vision和Document Arena中排名第一的开放模特!
Arena.ai
Kimi K2.6 is the new SOTA open model in Vision and Document Arena, with solid gains since Kimi K2.5: - #1 open on Vision Arena (#15 overall), +14 over #2 Kimi K2.5 (Thinking) - #1 open on Document Arena (#8 overall), +9 over K2.5 and on par with proprietary models like Muse Spark… https://twitter.com/arena/status/2047539004816732257/photo/1
中文: Kimi K2.6 是 Vision 和 Document Arena 的新型 SOTA 开放模式,自 Kimi K2.5 以来取得了稳步增长: - 在Vision Arena(整体排名第15)上排名第一,在排名第2的Kimi K2.5上以14比14开赛(思考) - 在Document Arena(整体排名第8)上排名第一,在K2.5上排名第9,与Muse Spark等专有型号相当......
Arena.ai
DeepSeek V4 Flash Thinking at 284B parameters (13B activated) shifts the Text Pareto frontier with $0.14 input / $0.28 output per MToken. Congrats again to @DeepSeek_AI on the open model progress! https://twitter.com/arena/status/2047524055679729885/photo/1
中文: 深度寻找V4 Flash思维,采用284B参数(13B激活),以每MToken 0.14美元输入/0.28美元的输出来改变文本帕雷托边界。 再次向 @DeepSeek_AI 祝贺,关注开放模式的进展!
Arena.ai
DeepSeek v4 lands in the Text Arena: - DeepSeek V4 Pro (thinking): #2 open model (#14 overall), matching Kimi-2.6 - DeepSeek V4 Flash (thinking): #10 open model (#47 overall) Top 10 Text categories: - #1 Medicine & Healthcare (v4 Pro) - #8 Legal & Government (v4 Pro) - #8 Math… https://twitter.com/arena/status/2047518357726056502/photo/1
中文: DeepSeek v4 登陆文本竞技场: - DeepSeek V4 Pro(思维模式):第2个开放型号(整体排名第14),与Kimi-2.6相匹配 - DeepSeek V4 闪光灯(思考):第10个开放模型(总体排名第47位) 十大文本类别: - 第一名 Medicine & 医疗保健(v4 Pro) - 第8名法律与行动;政府(第4条专业) - #8 数学......
Arena.ai
Watch first impressions of DeepSeek V4 on Arena’s YouTube: https://youtu.be/AC2jj_jfunQ
中文: 在Arena的YouTube上观看DeepSeek V4的第一印象:
Arena.ai
Exciting news - DeepSeek V4 Pro is in the Arena with 1.6T parameters (49B activated) alongside V4 Flash at 284B parameters (13B activated). Both support 1M token context. It’s a major leap over DeepSeek V3.2! Code Arena: - DeepSeek V4 Pro (thinking): #3 open model (#14 overall),… https://twitter.com/arena/status/2047518354903359697/photo/1
中文: 令人振奋的消息——DeepSeek V4 Pro 采用 1.6T 参数(激活 49B),同时采用 284B 参数(13B)的 V4 Flash。两者都支持1M令牌上下文。这是对DeepSeek V3.2的一次重大飞跃! 代码竞技场: - DeepSeek V4 Pro(思考):第3个开放模式(总体排名第14)......
Arena.ai
Qwen Image 2.0 Pro 2026-04-22 lands at #9 in Text-to-Image Arena. Highlights of the latest image model from @Alibaba_Qwen: - #9 Text-to-Image - #17 Image Edit (Single Image) Top 10 in Text-to-Image categories: - #6 Portraits - #7 Photorealistic & Cinematic Imagery - #7 Art… https://twitter.com/arena/status/2047341506441380054/photo/1
中文: Qwen Image 2.0 Pro 2026-04-22 将于 #9 发布,位于 Text-to-Image Arena。 最新图片模型的亮点来自 @Alibaba_Qwen: - #9 文本到图像 - #17 图像编辑(文字图片) 文本到图像类别中的前十名: - #6 肖像 - 第7名 摄影与摄影;电影摄影 - #7 艺术......
Arena.ai
RT @BytePlusGlobal: Dreamina Seedance 2.0 just ranked #1 across all three Video Arenas by @arena, leading the way in the latest generative…
中文: RT @BytePlusGlobal:Dreamina Seedance 2.0 在 @arena 的三大视频竞技场中排名第一,在最新的生成式中处于领先地位。
Arena.ai
RT @BytePlusGlobal: Proud to see Dreamina Seedance 2.0 recognized on the @arena leaderboard 🏆 Securing the #1 spot in Text-to-Video, Ima…
中文: RT @BytePlusGlobal:看到Dreamina Seedance 2.0在@arena排行榜上获得认可,我们感到自豪🏆 在 Text-to-Video 中获得第一位的位置,Ima...
Arena.ai
RT @jack_w_rae: Nice debut from Muse Spark in the Agentic Coding arena. Ranking ahead of GPT 5.4 and Gemini models, and behind the Opus se…
中文: RT @jack_w_rae:来自Muse Spark在Agentic Coding体育馆的精彩首秀。 排在GPT 5.4和双子座车型的前列,落后于Opus se...
Arena.ai
Kimi-K2.6 is now live in the Arena, and it’s a big improvement over Kimi-K2.5: - #2 open model in Code Arena (#6 overall), on par with Claude Sonnet 4.6 - #1 open model in Vision Arena (#15 overall) - #1 open model in Document Arena (#8 overall) - #2 open model in Text Arena (#24… https://twitter.com/arena/status/2047073519851438345/photo/1
中文: 基米-K2.6 现在位于体育馆,与 Kimi-K2.5 相比有了很大改进: - Code Arena 中排名第 2 的开放模式(总体排名第 6 位),与 Claude Sonnet 4.6 相当 - 视觉竞技场中排名第一的开放模式(总体排名第15位) - 文件竞技场中排名第一的开放模式(总体排名第8位) - 文本竞技场中的第二款开放模式(第24页)
Arena.ai
Muse Spark debuts at #7 in the Code Arena - making @AIatMeta the #3 lab right behind @AnthropicAI’s Claude Sonnet 4.6 and @Zai_org’s GLM-5.1, surpassing Gemini-3.1-Pro and GPT-5.4. Code Arena evaluates agentic coding on real-world tasks - building live websites and apps, ranked… https://twitter.com/arena/status/2047018936110334316/photo/1
中文: 缪斯·斯帕克在《代码》中首次亮相第7名,使@AIatMeta成为仅次于@AnthropicAI的Claude Sonnet 4.6和@Zai_org的GLM-5.1的#3实验室,超过了Gemini-3.1-Pro和GPT-5.4。 Code Arena 评估实际任务中的专业编程——建立实时网站和应用程序,排名排名......
Arena.ai
MiMo-V2.5 by @XiaomiMiMo is now live on Arena. Evaluate it across Text, Vision & Code Arena - Pro versions available specifically in Text & Code. Start prompting and voting in Battle mode. Scores incoming. https://twitter.com/arena/status/2047013664142893286/photo/1
中文: @XiaomiMiMo 的 MiMo-V2.5 现已在体育馆上线。 通过 Text、Vision & Code Arena - Pro 版本进行评估,具体版本提供 Text & Code。 在战斗模式下开始提示和投票。进球入场。
Arena.ai
GPT-Image-2 had a 93% win rate in Image Arena. Arena rankings come from blind, pairwise battles where voters pick between two anonymized image outputs for the same prompt. GPT-Image-2 from @OpenAI was preferred 93% of the time, resulting in a record-breaking +242 point leap… https://twitter.com/arena/status/2046996779418423431/photo/1
中文: GPT-Image-2在Image Arena的胜率达到93%。 竞技场排名源于盲目的配对对抗,选民在两个匿名图像输出之间选择相同的提示。 93% 的 @OpenAI 使用 GPT-Image-2 被青睐,导致 +242 分的破纪录......
Arena.ai
RT @benedictk__: frog eats banana life at sota is good https://twitter.com/benedictk__/status/2046708457819041842/photo/1
中文: RT @benedictk_:青蛙吃香蕉 sota 的生活很好
Arena.ai
RT @nickaturley: good!
中文: RT @nickaturley:很好!
Arena.ai
RT @petergostev: GPT Image 2 + Codex: or how to make Codex not suck at UI. Step 1: Generate a UI image (native in Codex) Step 2: Get Codex…
中文: RT @petergostev:GPT Image 2 + Codex:或如何使Codex在用户界面上不看。 步骤1:生成一个UI镜像(原在Codex中) 步骤2:获取法典......
Arena.ai
RT @npew: Good model!
中文: RT @npew:优秀的模特!
Arena.ai
RT @adele__li: SOTA models are rare. Clear wins are even rarer. ImageGen 2.0 doesn’t just edge ahead, it’s in its own league
中文: RT @adele__li:SOTA 模型很少见。明确的胜利更加罕见。ImageGen 2.0 不仅领先,而且处于其自身的联盟中
Arena.ai
RT @BoyuanChen0: This is what I’ve been cooking in the past 4 months . GPT Image 2 is over a massive 240 elo jump over the second place mod…
中文: RT @BoyuanChen0:这是我过去四个月一直在烹饪的。GPT Image 2 的超大240 elo 跳转了第二名模组......
Arena.ai
RT @gabeeegoooh: wow
Arena.ai
Arena Trends: Text-to-Image, Jan 2026 – Apr 2026 For most of the year, @GoogleDeepMind and @OpenAI traded the top spot within a tight margin - GPT-Image vs. Nano Banana - with the rest of the field clustered below 1,200. Today, GPT-Image-2 breaks away with a score of 1,512, 242… https://twitter.com/arena/status/2046690103515648061/video/1
中文: 竞技场趋势:文字对图像,2026年1月 – 2026年4月 今年大部分时间里,@GoogleDeepMind 和 @OpenAI 以微弱优势位居榜首位置——GPT-Image 与 vs。纳米香蕉——其余部分聚类低于1200。 今天,GPT-Image-2 以 1,512 分、242 分离场......
Arena.ai
Exciting news - GPT-Image-2 by @OpenAI has claimed the #1 spot across all Image Arena leaderboards! A clean sweep with a record-breaking +242 point lead in Text-to-Image - the largest gap we’ve seen to date. - #1 Text-to-Image (1512), +242 over #2 (Nano-banana-2 with web-search… https://twitter.com/arena/status/2046670703311884548/photo/1
中文: 令人振奋的消息——@OpenAI 的 GPT-Image-2 已在所有 Image Arena 排行榜上排名第一! 一次零零失球的横扫,在“从文字到图像”中以创纪录的242分领先优势,这是迄今为止我们见过的最大差距。 - #1 文本对图像(1512),+242 以上,第2位(纳诺-巴纳纳-2,带网页搜索)...
Arena.ai
Dreamina Seedance-2.0 is now #1 on Video Edit Arena with a score of 1,362. It is now #1 across all three Video Arenas: Text-to-Video, Image-to-Video, and Video Edit! In Video Edit: - +60 points ahead of #2 Happyhorse 1.0 - +103 points ahead of #3 Grok Imagine Video Video… https://twitter.com/arena/status/2046653987580264563/photo/1
中文: Dreamina Seedance-2.0 现已在视频编辑体育馆排名第一,评分为 1362 分。现在在三个视频竞技场中排名第一:从文字到视频,从图像到视频,视频编辑! 视频编辑: - 比#2 Happyhorse 1.0领先60分 - 比#3 Grok Imagine Video 领先 +103 分 视频......
Arena.ai
More on Claude Opus 4.7: the Thinking variant from @AnthropicAI takes #1 in Code Arena! This is +27 points over Opus-4.6 Thinking and +40 over the next non-Anthropic model, GLM-5.1 (#5). - Thinking also takes #1 on the React leaderboard, surpassing Non-Thinking. The Code Arena… https://twitter.com/arena/status/2046631013011599540/photo/1
中文: 更多关于克劳德·奥普斯 4.7 的内容:来自 @AntropicAI 的“Thinking”版本在 Code Arena 中排名第一! 比 Opus-4.6 Thinking 和 +40 的 GLM-5.1(第 5 名)高出 +27 分。 - Thinking 在 React 排行榜上也排名第一,超过了《非思考》。 代码竞技场......
Arena.ai
RT @Alibaba_Qwen: Keep improving 🚀🚀 @arena
中文: RT @Alibaba_Qwen:不断提升 🚀🚀 @arena
Arena.ai
RT @Kimi_Moonshot: 🤗
Arena.ai
Kimi K2.6 by @Kimi_Moonshot is now live on Arena. Evaluate it in Battle Mode across Text, Vision, Code, Image-to-WebDev, and Document Arena! Scores incoming - start prompting and get voting. https://twitter.com/arena/status/2046324093704970604/photo/1
中文: @Kimi_Moonshot 的 Kimi K2.6 现已在体育馆现场直播。 在文本、视觉、代码、从图像到WebDev和文档领域的对战模式下进行评估! 评分入选——开始提示并获取投票。
Arena.ai
Qwen3.6 Plus lands at #7 in Code Arena with a score of 1476 - up +16 points since the Preview. The new score also moves @AlibabaGroup to #3 lab in Code Arena. In the Text Arena, Qwen3.6 Plus lands at #36, a +13 point improvement since Preview. Congrats to the Qwen team on the… https://twitter.com/arena/status/2046268995163258958/photo/1
中文: Qwen3.6 Plus在Code Arena排名第7位,得分为1476分,比预赛以来的得分上升了16分。 新评分还将@AlibabaGroup 移至 Code Arena 的 #3 实验室。 在文本竞技场中,Qwen3.6 Plus 的评分为第36位,自预赛以来提升了13分。 祝贺Qwen团队的......
Arena.ai
Claude Opus 4.7 from @AnthropicAI takes #1 in Vision & Document Arena! In Document Arena: Opus 4.7 lands +4 points over Opus-4.6 and +45 over the next non-Anthropic model, GPT-5.4 (#6). This is huge ~70 pts lead over Muse Spark and Gemini-3.1-Pro. Real world research work like… https://twitter.com/arena/status/2046224760657658239/photo/1
中文: 来自@AntropicAI的克劳德·奥普斯 4.7 在 Vision & 文件体育馆排名第一! 在文件竞技场中: Opus 4.7 比 Opus 4.6 和 +45 比 下一个非 Antropic 模型 GPT-5.4(第 6 道)获得 +4 分。领先于Muse Spark和Gemini-3.1-Pro,非常大,领先约70分。 现实世界的研究工作内容如......
Arena.ai
Claude Opus 4.7 by @AnthropicAI advances the price-performance Pareto frontier in both Code and Text Arena! This makes Claude Opus 4.7 now the only model from a US lab that remains on the Pareto frontier for Code Arena. https://twitter.com/arena/status/2045206342173086156/photo/1
中文: @AntropicAI 的克劳德·奥普斯 4.7 在 Code 和 Text Arena 中均推进了价格表现的帕雷托前沿市场! 这使得克劳德·奥普斯4.7成为目前唯一一位留在帕雷托前沿的美国实验室用于Code Arena的模型。
Arena.ai
Let’s dig into how @AnthropicAI's Claude has progressed with Opus 4.7. Opus 4.7 (Thinking) outperforms Opus 4.6 (Thinking) on some key dimensions, including: - Overall (#1 vs #2) - Expert (#1 vs #3) - Creative Writing (#2 vs #3) However, there are several categories where Opus… https://twitter.com/arena/status/2045194638630560104/photo/1
中文: 让我们深入探讨一下@AntropicAI 的克劳德在 Opus 4.7 中的进展。 Opus 4.7(Thinking)在某些关键维度上优于Opus 4.6(Thinking),包括: - 总体(第1名 vs #2) - 专家(第1名 vs #3) - 创意写作(第2名 vs #3) 然而,Opus 有多个类别......
Arena.ai
Exciting news - Claude Opus 4.7 from @AnthropicAI takes #1 in Code Arena! +37 points over Opus-4.6 and +46 over the next non-Anthropic model, GLM-5.1 (#4). Massive ~130 pts lead over GPT-5.4 and Gemini-3.1-Pro. #1 on both React and HTML leaderboards. Code Arena evaluates… https://twitter.com/arena/status/2045177492936532029/photo/1
中文: 令人振奋的消息——来自@AnthropicAI的克劳德·奥普斯在《代码竞技场》中排名第一! 比Opus-4.6和+46领先下一个非人类模型GLM-5.1(第4位),得分+37。以约130分的优势领先于GPT-5.4和Gemini-3.1-Pro。 React 和 HTML 排行榜上排名第一。代码竞技场评估......
Arena.ai
HappyHorse-1.0 ranks #2 for both Text-to-Video and Image-to-Video Arena. - #2 Text-to-Video: Scores 1444, +69 points over #3 Veo-3.1 with audio. - #2 Image-to-Video: Scores 1444, +23 points over #3 Grok-Imagine-Video-720p This puts HappyHorse-1.0 in the top 2 for all 3 Video… https://twitter.com/arena/status/2044977389185482998/photo/1
中文: HappyHorse-1.0 在文本到视频和图像到视频竞技场中均排名第二。 - #2 文本对视频:音频评分为 1444,比第 3 号 Veo-3.1 得分 +69 分。 - #2 图像对视频:比#3 Grok-Imagine-Video-720p 得分 1444 分,+23 分 这使得HappyHorse-1.0在全部3个视频中排名前2位......
Arena.ai
Claude Opus 4.7 by @AnthropicAI is now live on Arena. Evaluate it in Battle Mode across Text, Vision, Code, Image-to-WebDev, and Document Arena - thinking and non-thinking versions both available. Scores incoming. Start prompting and voting. https://twitter.com/arena/status/2044794924072317362/photo/1
中文: @AntropicAI 的 Claude Opus 4.7 现已在体育馆上线。 在文本、视觉、代码、图像到网页导航和文档领域的对战模式下进行评估——这两个版本都可用,具有思维和无思维模式。 得分进入。开始提示和投票。
Arena.ai
A new leaderboard has arrived: Image to WebDev. It ranks models based on their ability to generate websites based on screenshots and images. Who’s in the top 10: - #1-3 @Anthropic takes the lead with Claude 4.6 (Sonnet and Opus) - #4-6 @GoogleDeepMind is right behind with Gemini… https://twitter.com/arena/status/2044480481790726161/photo/1
中文: 新的排行榜已经到来:图片来源:WebDev。 它根据模型基于屏幕截图和图像生成网站的能力进行排名。 谁排在前十位: - 排名第1胜3负 @Anthoropic 以 Claude 4.6 领先(Sonnet 和 Opus) - #4-6 @GoogleDeepMind 紧随其后,支持 Gemini...
Arena.ai
Document Arena update: four new models are reshaping the top ranks - including two open models! - #1 Claude Opus 4.6 Thinking is new, keeping @AnthropicAI in the top 3 - #8 Kimi-K2.5 Thinking by @Kimi_Moonshot now the best open model (Modified MIT) - #10 Gemma-4-31b by… https://twitter.com/arena/status/2044437193205395458/photo/1
中文: 文件竞技场更新:四款新车型正在重塑顶尖行列,其中包括两款开放模式! - 排名第一的克劳德·奥普斯 4.6 思考是新的,保持@AnthropicAI 排在前三 - #8 Kimi-K2.5 Thinking,由 @Kimi_Moonshot 现为最佳开放模式(修改后的 MIT) - 第10名:Gemma-4-31b,来源:
Arena.ai
RT @AlibabaGroup: Thrilled to see HappyHorse‑1.0 land #1 in the Video Edit Arena! 🎉@HappyHorseATH #AlibabaAI #HappyHorse
中文: RT @阿里巴巴集团:很高兴看到HappyHorse1.0在视频编辑体育馆排名第一!🎉@HappyHorseATH #阿里巴巴 #快乐马
Arena.ai
New video model HappyHorse-1.0 by Alibaba-ATH debuts at #1 in Video Edit Arena. It scores 1299, leading Grok Image Video by +42 points and Kling o3 Pro by +48 points. Video editing is an emerging frontier capability for video models, and only a small number of models support… https://twitter.com/arena/status/2044260620317667644/photo/1
中文: 全新视频模特HappyHorse-1.0由阿里巴巴-ATH推出,首播于视频编辑体育馆首播。 得分为1299分,领先Grok Image Video +42分,Kling o3 Pro获得+48分。 视频编辑是视频模型的新兴前沿功能,仅支持少量模型......
Arena.ai
RT @ml_angelopoulos: Cool to see this work from @Jsjcl293905 , @davidsimchilevi, and @WillWeiSun deriving efficiency bounds and new estimat…
中文: RT @ml_angelopoulos:很高兴看到 @Jsjcl293905、@davidsimchilevi 和 @WillWeiSun 获取效率限制和新的估计......
Arena.ai
New Eval mode: Battles in Direct We sample two random anonymous models during Direct chats - enabling pairwise comparison beyond turn 1. Why this matters: • Evaluates under longer context + multi-turn dependency • Captures failure modes: drift, consistency, recovery • Closer… https://twitter.com/arena/status/2044096836114493609/video/1
中文: 新的《Eval》模式:直接战斗 我们在直接聊天时对两个随机的匿名模型进行采样——使第一回合后的对比更新。 这为什么很重要: • 在更长的上下文和多转依赖下进行评估 • 捕捉故障模式:漂移、一致性、恢复 • 更近......
Arena.ai
RT @jietang: welcome to give it a try. hmmmm... indeed too many users and short of GPUs....
中文: RT @jietang:欢迎尝试一下。嗯......确实用户太多,缺乏 GPU 的用户。
Arena.ai
How do verifiable benchmarks and human preference work together to evaluate LLMs? The Arena team shares their perspective. https://twitter.com/arena/status/2042973873944301730/video/1
中文: 可验证的基准和人类偏好如何协同评估LLM? 竞技场团队分享他们的观点。
Arena.ai
Meta is back in the Arena! Muse Spark debuts as a top frontier model across both Text and Vision: - Text Arena: #3 tied with Gemini-3.1-Pro and Claude-Opus-4.6 - Vision Arena: #2 tied with Claude-Opus-4.6 This marks Meta’s first major release since early 2025. Highlights: - #4… https://twitter.com/arena/status/2042726806038680019/photo/1
中文: Meta 回到了竞技场! 缪斯·斯帕克作为顶级前沿模特在《文本》和《视觉》中首次亮相: - 文字竞技场:第3名与双子座-3.1-Pro和克劳德-奥普斯-4.6并列 - 视觉竞技场:与克劳德-奥普斯-4.6并列第二 这标志着Meta自2025年初以来首次发布重大版本。 亮点: - #4...
Arena.ai
With GLM-5.1, https://z.ai/ maintains the #1 open model rank in Code Arena and is now within ~20 points of the top overall while outperforming Claude Sonnet 4.6, Opus 4.5, GPT-5.4 High, and Gemini-3.1 Pro. Open models are now competitive at the frontier. https://twitter.com/arena/status/2042643933768151485/photo/1
中文: 在 GLM-5.1 中, 在 Code Arena 中保持第一的开放模式排名,目前排名在首位的 20 分以内,表现优于 Claude Sonnet 4.6、Opus 4.5、GPT-5.4 High 和 Gemini-3.1 Pro。 开放模型如今在前沿领域具有竞争力。
Arena.ai
With GLM-5.1 @Zai_org maintains the #1 open model rank in Code Arena at 1530. It’s now within ~20 points of the top overall while outperforming Claude Sonnet 4.6, Opus 4.5 (Thinking), GPT-5.4 High, and Gemini-3.1 Pro. Open models are now competitive at the frontier. For this… https://twitter.com/arena/status/2042634075274645577/photo/1
中文: 在 1530 年,在 Code Arena 中,GLM-5.1 @Zai_org 保持了排名第一的开放模式排名。 目前排名在榜首的积分范围内,表现在Claude Sonnet 4.6、Opus 4.5(Thinking)、GPT-5.4 High和Gemini-3.1 Pro之外。 开放模型如今在前沿领域具有竞争力。 为此......
Arena.ai
GLM-5.1 by @Zai_org is now #3 in Code Arena - surpassing Gemini 3.1 and GPT-5.4, and now on par with Claude Sonnet 4.6. The first frontier level open model to break into the top 3. It’s a major +90 point jump over GLM-5, and +100 over Kimi K2.5 Thinking. Huge congrats to… https://twitter.com/arena/status/2042611135434891592/photo/1
中文: @Zai_org 的 GLM-5.1 现已在 Code Arena 中排名第三,超过了 Gemini 3.1 和 GPT-5.4,与 Claude Sonnet 4.6 相位。 首个突破前列的前沿型开放模式。比GLM-5领先+90,比Kimi K2.5 Thinking高出100分。 向......表示热烈祝贺......
Arena.ai
How much better is Dreamina Seedance 2.0? This visualization shows it clearly as it sits in a tier of its own. Dreamina Seedance 2.0 by Bytedance delivers a major leap in Text-to-Video performance, outperforming the nearest competitor (Veo-3.1-1080p) by +79 points. It also… https://twitter.com/arena/status/2041969275196567780/photo/1
中文: Dreamina Seedance 2.0 有多好?这种可视化在它处于自身层面时清晰显示。 字节跳动的梦幻种子2.0在文本到视频性能方面实现了重大飞跃,超过了最接近的竞争对手(Veo-3.1-1080p)+79分。 也......
Arena.ai
Today at @HumanXCo: CEO and co-founder @ml_angelopoulos & Chairman and co-founder @istoica05 take the stage at The Loop Theater at 1:30pm PT with @CristinaCriddle of @FT to answer a question the whole industry is asking: Who's winning the race to reliability? - Crowdsourced… https://twitter.com/arena/status/2041910061224882513/photo/1
中文: 今天在@HumanXCo上:首席执行官兼联合创始人@ml_angelopos &董事长兼联合创始人@istoica05于下午1点30分在The Loop剧院与@FT的@CristinaCriddle共同登台,回答整个行业都在提问的问题: 谁赢得了可靠性竞赛? - 众包......
Arena.ai
ICYMI: we released the FULL history of Arena leaderboard data as a public dataset, nearly 3 years of rankings across 10 Arenas, dozens of categories, and 700+ models. Hear more about it from @cthorrez and @petergostev on our YouTube channel: https://www.youtube.com/watch?v=QbpW77m90kw
中文: ICYMI:我们发布了Arena排行榜数据的完整历史,作为公共数据集,在10个Arenas、数十个类别以及700多个模型中排名近3年。 请访问我们的YouTube频道,通过@cthorrez和@petergostev了解更多信息:
Arena.ai
Dreamina Seedance 2.0 has landed #1 across the Video Arena for both Text-to-Video and Image-to-Video. This is the score for the 720p variant. Text-to-Video: - #1 model scoring 1450, +79pts over #2 Veo 3.1 1080p - This is a +191pt jump since Seedance-v1.5-Pro Image-to-Video: -… https://twitter.com/arena/status/2041713742485045590/photo/1
中文: Dreamina Seedance 2.0 已在视频竞技场上排名第一,内容包括视频和视频图像。这是720p变异株的分数。 视频文本: - 排名第一的型号,评分为1450分,超过第2名Veo 3.1 1080p - 自 Seedance-v1.5-Pro 以来,这一增长 +191pt 视频图像: - . . .
Arena.ai
Let’s deep dive into GLM model improvement by @Zai_org over the past 3 generations: 5.1, 5 and 4.7. GLM-5 was an improvement over 4.7 in similar ways, but 5.1 appears as a more rounded model with a few trade-offs. GLM-5.1 is currently the #1 open model in the Text Arena.… https://twitter.com/arena/status/2041650737206456529/photo/1
中文: 让我们深入探讨过去三代中由@Zai_org改进的GLM模型:5.1、5和4.7。 GLM-5在类似方面比4.7有所改善,但5.1似乎是一个更全面的模型,并存在一些权衡。GLM-5.1 目前是 Text Arena 中排名第一的开放模式。
Arena.ai
Check out the GLM-5.1 first impressions with Peter on our YouTube https://www.youtube.com/watch?v=f11tVBXWr2g
中文: 通过我们的 观看与 Peter 的 GLM-5.1 第一印象
Arena.ai
Come check out GLM 5.1 in Code Arena for agentic web development tasks using tools. Don’t forget to vote, Code Arena scores are coming up next! https://arena.ai/code
中文: 快来查看 Code Arena 中的 GLM 5.1 使用工具进行代理网页开发任务。别忘了投票,接下来将进行代码竞技场比分!
Arena.ai
GLM-5.1 by @Zai_org just launched in the Text Arena, and is now the #1 open model. It outperforms the next best open model, its predecessor, GLM-5, by +11 points and +15 over Kimi K2.5 Thinking. It shows strength in: - #1 open model in Longer Query (#4 overall) - #1 open model… https://twitter.com/arena/status/2041641149677629783/photo/1
中文: @Zai_org 推出的 GLM-5.1 刚刚在 Text Arena 上发布,现已推出第一开放模式。 其前身GLM-5指数+11分,比Kimi K2.5 Thinking领先+15,表现优于前身。 它在以下内容中显示出力量: - 长查询中排名第一的开放模型(总体排名第4) - #1 开放模式......
Arena.ai
Tomorrow at @HumanXCo, our co-founder and CEO @ml_angelopoulos sits down with @andykonwinski for a press Q&A at 10am PT in the Media Lounge, then joins co-founder and Chairman @istoica05 for a presentation and conversation with @CristinaCriddle of @FT at 1:30pm PT in The Loop… https://twitter.com/arena/status/2041574831045669012/photo/1
中文: 明天上午10点,在@HumanXCo,我们的联合创始人兼首席执行官@ml_angelopos与@andykonwinski在媒体休息室的新闻发布会上 Q&A 会合,随后与联合创始人兼董事长 @istoica05 于下午1点30分在 The Loop 中与 @FT 的 @CristinaCriddle 进行演示和交谈。
Arena.ai
A new open model has entered the Arena! GLM-5.1 by @Zai_org is now ready for your prompts in the Text and Code Arena. Come vote and let's see how it stacks up! https://twitter.com/arena/status/2041554549488685370/photo/1
中文: 新的开放模式已进入竞技场! @Zai_org 的 GLM-5.1 现已在文本和代码竞技场中准备好迎接您的提示。来投票,看看它是怎么叠加的!
Arena.ai
New eval mode: Battles in Direct anonymously surfaces a second model mid-conversation during your direct chat for comparison. Longer context windows mean evaluation happens deeper in flow, leading to more decisive voting and bringing Arena closer to real-world use. We're…
中文: 新的椭圆形模式:在直接聊天中,直接对战会以匿名方式显示第二个模型的中置对话。 时间较长意味着评估的进行得更深,从而导致投票更加果断,并使Arena更接近现实世界的使用。 我们是......
Arena.ai
Code Arena can handle image inputs for agentic web dev tasks, reasoning through multi-step problems and using tools along the way. Watch how it works with @aryanvichare10 in this clip. Find a link to the full walkthrough on how Code Arena creates sites and apps from images in… https://twitter.com/arena/status/2040145898060304635/video/1
中文: Code Arena 可以处理用于代理网页开发任务的图像输入、多步骤问题的推理以及沿途使用工具。 观看此视频中与 @aryanvichare10 的配合。查找有关 Code Arena 如何通过图片创建网站和应用程序的完整链接......
Arena.ai
Gemma 4 31B shifts the Pareto frontier, scoring +30 Arena points above similarly priced models like DeepSeek 3.2. Its position on the Pareto frontier is based on early pricing indicators from third parties. https://twitter.com/arena/status/2040128319719670101/photo/1
中文: 杰玛4号31B改变了帕雷托的前沿,比DeepSeek 3.2等价格相似的球衣获得了+30个竞技场积分。其在帕雷托前沿的立场基于第三方的早期定价指标。
Arena.ai
RT @demishassabis: Gemma 4 outperforms models over 10x their size! (note the x-axis is log scale!) https://twitter.com/demishassabis/status/2040067244349063326/photo/1
中文: RT @demishassabis:Gemma 4 的尺寸优于其尺寸的 10 倍!注意,x 轴是日志刻度!
Arena.ai
Qwen 3.6 Plus is ready for your real-world use cases in the Text and Code Arena! In the Code Arena, you can compare models on real-world, agentic web development tasks—generating HTML or React apps that you can immediately share, or download. Get testing and don't forget to… https://twitter.com/arena/status/2040082258833645605/photo/1
中文: Qwen 3.6 Plus 已在文本和代码竞技场中为您的实际应用案例做好准备! 在代码领域,您可以比较真实世界中的模型、代理式网页开发任务——生成可立即共享或下载的 HTML 或 React 应用程序。 获取测试,别忘了......
Arena.ai
Let’s look at how the open model Gemma has progressed across its last three versions. - Gemma 4 ranks 100 places above Gemma 3 - Gemma 3 ranks 87 above Gemma 2 All three models from @GoogleDeepMind are roughly the same size (31B, 27B, 27B), and these gains came only 9 and 13… https://twitter.com/arena/status/2039848959301361716/photo/1
中文: 让我们来看看开放型型号Gemma在其最后三个版本中的进展。 - 杰玛4级排名超过吉马3位100位 - 杰玛3号排名高于杰玛2级87 @GoogleDeepMind 的三款机型大小大致相同(31B、27B、27B),且涨幅仅为9和13。
Arena.ai
We're releasing the full history of Arena leaderboard data as a public dataset, nearly 3 years of rankings across 10 Arenas, dozens of categories, and hundreds of models. Optimized to empower analysis and unlock new insights across modalities and over time. Check out some… https://twitter.com/arena/status/2039796686953087183/photo/1
中文: 我们将发布Arena排行榜数据的完整历史,作为公共数据集,在10个Arenas、数十个类别以及数百个模型中排名近3年。 优化以增强分析能力,并在各种模式及时间之间释放新的洞察。查看一些内容......
Arena.ai
Gemma 4 by @GoogleDeepMind debuts at 3rd and 6th on the open source leaderboard, making it the #1 ranked US open source model. By total parameter count, Gemma 4 31B is 24× smaller than GLM-5 and 34× smaller than Kimi-K2.5-Thinking, delivering comparable performance at a… https://twitter.com/arena/status/2039782449648214247/photo/1
中文: @GoogleDeepMind 的《Gemma 4》在开源排行榜上排名第三、第六,成为美国排名第一的开源机型。 按总参数统计,Gemma 4 31B 比 GLM-5 小 24 × 且比 Kimi-K2.5 Thinking 小 34 × , 可在 实现类似性能
Arena.ai
RT @Google: Gemma 4 is our most capable open model family yet: 🔵 Four versatile sizes 🔵 Up to 256K context window 🔵 Native function-callin…
中文: RT @Google:Gemma 4 是我们迄今为止最有能力的开放式模型系列: 🔵 四种多功能尺寸 🔵 最高可达 256K 上下文窗口 🔵 原生函数卡林......
Arena.ai
RT @GoogleAI: Today, we’re launching Gemma 4, our most intelligent open models to date. Built with the same breakthrough technology as Gemi…
中文: RT @GoogleAI:今天,我们将推出迄今为止最智能的开放模型 Gemma 4。采用与Gemi相同的突破性技术制造......
Arena.ai
RT @OfficialLoganK: Introducing Gemma 4, our series of open weight (Apache 2.0 licensed) models, which are byte for byte the most capable o…
中文: RT @OfficialLoganK:推出Gemma 4系列开放式重量(Apache 2.0授权版)型号,可字节(Bute)最有功能...
Arena.ai
Gemma-4-31B is now live in Text Arena - ranking #3 among open models (#27 overall), matching much larger models at 10× smaller scale! A significant jump from Gemma-3-27B (+87 pts). Highlights: - #3 open (#27 overall), on par with the best open models Kimi-K2.5, Qwen-3.5-397b -… https://twitter.com/arena/status/2039739427715735645/photo/1
中文: Gemma-4-31B 现已在 Text Arena 上线,在开放模式中排名第 3 位(总体排名第 27 位),与规模小 10 倍的大号型号相媲美!从Gemma-3-27B(+87分)大幅跃升。 亮点: - 第3个开放型号(总体排名第27位),与最佳开式型号Kimi-K2.5、Qwen-3.5-397b相当 - . . .
Arena.ai
RT @Alibaba_Qwen: #8 in Coding in Code Arena overall,#2 Lab in Code Arena on the React leaderboard! !👏👏 This is a great testament to our l…
中文: RT @Alibaba_Qwen:代码竞技场中第8名,代码竞技场中的#2实验室在React排行榜上排名!!EE0[EE] 这是我们的一个伟大证明......
Arena.ai
RT @mustafasuleyman: Three models. Three top-tier results. All shipped within just a few months by the @MicrosoftAI team. - MAI-Transcribe-…
中文: RT @mustafasuleyman:三位模特。三个顶级结果。所有货物均由 @MicrosoftAI 团队在短短几个月内发货。 - MAI-转录
Arena.ai
RT @Alibaba_Qwen: 🚀🚀Let's go!!
中文: RT @阿里巴巴_Qwen:🚀🚀 我们走吧!
Arena.ai
Qwen 3.6 Plus Preview is the #2 lab for the React leaderboard in Code Arena which ranks models based on agentic workflows involving multi-step reasoning, tool use, and multi-file apps. https://twitter.com/arena/status/2039723549976678779/photo/1
中文: Qwen 3.6 Plus Preview 是 Code Arena 中 React 排行榜的第二号实验室,该实验室基于基于多种步骤推理、工具使用和多文件应用的代理工作流程对模型进行排名。
Arena.ai
Qwen 3.6 Plus Preview is now #8 in Code Arena overall. This makes @Alibaba_Qwen the #2 lab in Code Arena on the React leaderboard, as it demonstrates strength in agentic coding tasks involving multi-step reasoning, tool use and multi-file apps. Congrats to the @Alibaba_Qwen… https://twitter.com/arena/status/2039723547569144187/photo/1
中文: Qwen 3.6 Plus Preview 现在在 Code Arena 中排名第 8 位。 这使得@Alibaba_Qwen成为 React 排行榜上 Code Arena 中的 #2 实验室,因为它展示了在涉及多步骤推理、工具使用和多文件应用程序的代理编程任务方面的优势。 恭喜 @Alibaba_Qwen......
Arena.ai
Qwen 3.6 Plus Preview is now #8 in Code Arena overall. This makes @Alibaba_Qwen the #2 lab in Code Arena on the React leaderboard, as it demonstrates strength in agentic coding tasks involving multi-step reasoning, tool use and multi-file apps. https://twitter.com/arena/status/2039722566689198349/photo/1
中文: Qwen 3.6 Plus Preview 现在在 Code Arena 中排名第 8 位。 这使得@Alibaba_Qwen成为 React 排行榜上 Code Arena 中的 #2 实验室,因为它展示了在涉及多步骤推理、工具使用和多文件应用的代理编程任务中的优势。
Arena.ai
GLM-5V-Turbo is now live in Vision Arena. Test its ability to reason over visual inputs using your real-world prompts. Don't forget to vote so we can see how it stacks up. https://twitter.com/arena/status/2039400189178556814/photo/1
中文: GLM-5V-Turbo 现居 Vision Arena。 使用实际提示测试其对视觉输入进行推理的能力。别忘了投票,以便我们了解它是如何叠加的。
Arena.ai
April is here. Here’s what changed at the frontier last month with Arena. New Arenas, leaderboard shifts, and product updates across Document, Video, Text, and Code - all grounded in real-world evaluation. Catch up on what changed and why it matters ↓
中文: 四月来了。上个月阿雷纳球场在边境发生了哪些变化。 新场馆、排行榜轮班以及文档、视频、文本和代码中的产品更新,全部基于现实世界的评估。 了解哪些变化以及为何重要 ↓
Arena.ai
We’ve added Pareto frontier charts to the leaderboard. Now available across: Text, Vision, Search, Document, and Code Arena. The Pareto frontier curve demonstrates which models are most efficient at their level of performance (by Arena score) vs. a blended price per 1M tokens… https://twitter.com/arena/status/2039377186432618885/photo/1
中文: 我们已经将帕雷托前沿排行榜列入了排行榜。 现已提供: 文本、视觉、搜索、文档和代码竞技场。 帕雷托前沿曲线展示了哪些模型在性能水平上效率最高(按竞技场评分),而每100万个代币的混合价格中表现最为高......
Arena.ai
We’ve added Pareto frontier charts to the leaderboard. Now available across: Text, Vision, Search, Document, and Code Arena. The Pareto curve demonstrates which models are most efficient at their level of performance (by Arena score) vs. a blended price per 1M tokens (3:1… https://twitter.com/arena/status/2039372539349323849/photo/1
中文: 我们已经将帕雷托前沿排行榜列入了排行榜。 现已提供: 文本、视觉、搜索、文档和代码竞技场。 帕雷托曲线展示了哪些型号在性能水平上效率最高(按竞技场评分),而每100万个代币的混合价格(3:1)
Arena.ai
How did the Top 10 in Text Arena change in the last month? Let's take a look. Claude Opus 4.6 models by @AnthropicAI remained on top - with new entrants Gemini-3.1 Pro by @GoogleDeepMind, GPT-5.4 High by @OpenAI and Grok-4.20 (Reasoning) by @xAI landing on 3rd, 6th and 7th… https://twitter.com/arena/status/2039081680515011022/photo/1
中文: 上个月,Text Arena 前十名的变化是什么?让我们来看看。 @AnthropicAI 的 Claude Opus 4.6 型号名列前茅——新进入机型的 Gemini-3.1 Pro 由 @GoogleDeepMind 推出,GPT-5.4 High 由 @OpenAI 推出,Grok-4.20(起号)由 @xAI 于第三、第六和第七着陆......
Arena.ai
Grok 4.20 Multi-agent Beta Reasoning has landed on Arena leaderboards! - #7 for Search Arena - #11 for Text Arena - #22 for Vision Arena Arena rankings reflect real-world usage and can be broken down further by expert-level prompts and occupational domains. When we look at… https://twitter.com/arena/status/2039072419500179854/photo/1
中文: Grok 4.20 多智能测试推理已登陆 Arena 排行榜! - 搜索竞技场第7名 - 文本竞技场第11名 - 视觉竞技场第22名 竞技场排名反映了实际使用情况,可进一步通过专家级提示和职业领域进行细分。 当我们查看时......
Arena.ai
RT @ml_angelopoulos: Our team at @arena is solving one of the most important problems in AI: Evaluation. We have some of the top researcher…
中文: RT @ml_angelopulos:我们位于 @arena 的团队正在解决人工智能领域最重要的问题之一:评估。我们有一些顶尖的研究者......
Arena.ai
Most people have heard of big model smell, but what about pristine pre-training smell? Evan and Derry seem to think it exists. https://twitter.com/arena/status/2037994222578712663/video/1
中文: 大多数人听说过大模型气味,但原始的预训练气味又如何呢? 埃文和德里似乎认为它存在。
Arena.ai
Across real-world use in the Text Arena, GPT-5.4 High variants (Regular, Mini, Nano) by @OpenAI do indeed behave as scaled versions of the same model, validating that pricing differences reflect efficiency, and not fundamentally different capabilities. https://twitter.com/arena/status/2037654358519906399/photo/1
中文: 在文本领域实际应用中,@OpenAI 的 GPT-5.4 高版本(Regular、Mini、Nano)确实表现为同一型号的缩放版本,这证实了价格差异反映了效率,且功能并非本质差异。
Arena.ai
You can smell a big model. Not the parameter count. Not the benchmark score. It's that feeling when something is actually reasoning. Not just pattern matching. We call it "big model smell." https://twitter.com/arena/status/2037607507510821107/video/1
中文: 你可以闻到一个大模型的气味。参数计数不。不是基准分数。当某件事真正是推理时,就是那种感觉。不仅仅是模式匹配。 我们称之为“大模型气味”。
Arena.ai
Are open source models catching up to proprietary models? We’ve looked back at 3 years of Arena’s data to show how the race has evolved. For comparison, we’ve taken the top 20% of the models and uncovered the following: - Before mid 2024: The gap was between 100-150 points - In… https://twitter.com/arena/status/2037584085997216100/video/1
中文: 开源模型能否跟上专有模型?我们回顾了阿雷纳公司三年的数据,以展示比赛的演变。 为便于比较,我们已采用前20%的模型,并发现了以下内容: - 2024年中期之前:差距在100到150分之间 - 网址:
Arena.ai
GPT-5.4 came with a variety of models across cost and performance. In the Text Arena: - GPT-5.4-Mini-High ranks #22 overall at $0.75 / $4.50, which is similar pricing to top open models - GPT-5.4-Nano ranks #88, with even more efficient pricing at $0.20 / $1.25 Mini-High… https://twitter.com/arena/status/2037563653537583162/photo/1
中文: GPT-5.4 配备了多种型号,包括成本和性能。在文本竞技场中: - GPT-5.4-Mini-High 整体排名第22位,售价为0.75美元/ 4.50美元,与顶级开放型号的定价相似 - GPT-5.4-Nano 排名第88位,价格更为高效,价格为0.20美元/ 1.25美元 迷你高...
Arena.ai
Gemini 3.1 Pro Grounding has landed #2 in the Search Arena. This places three Gemini models in the top 7 for Search, more than any other lab. Congrats to @GoogleDeepMind on this achievement! https://twitter.com/arena/status/2037246509255983210/photo/1
中文: 3.1 Gemini Pro Grounding 已在搜索体育馆获得第二名。 将三个双子座模型置于搜索前7名中,比任何其他实验室都多。 恭喜 @GoogleDeepMind 取得这一成就!
Arena.ai
Arena rankings measure model quality using large-scale human preference data. While style (length, tone, format) can influence individual votes, you can control for these factors to better isolate true capability. Our co-founder and CEO @ml_angelopoulos explains how in this… https://twitter.com/arena/status/2036893299861242130/video/1
中文: 竞技场排名利用大规模的人类偏好数据来衡量模型质量。虽然风格(长度、音调、格式)会影响个人投票,但你可以控制这些因素,以更好地隔离真实能力。 我们的联合创始人兼首席执行官 @ml_angelopolos 解释了其中的内容......
Arena.ai
RT @felicis: You can't trust AI you can't measure. @Arena is building the infrastructure to fix that - and we've believed since seed. New…
中文: RT @felicis:你无法信任无法衡量的人工智能。@Arena 正在建设基础设施来解决这个问题——我们从种子开始就一直相信。 新的......
Arena.ai
In this clip, co-founder and CEO @ml_angelopoulos talks about the scaling law in vote prediction with prompt-level data. Learn more about the math behind Arena's leaderboards on YouTube. Link in thread 🧵👇 https://twitter.com/arena/status/2036154529364975997/video/1
中文: 在这段视频片段中,联合创始人兼首席执行官@ml_angelopos通过快速数据在投票预测中谈到了这一扩展法。 了解更多关于Arena排行榜背后的数学原理,请在YouTube上观看。链接在 帖子 🧵👇 中
Arena.ai
MiMo V2 Pro has landed as a top 6 lab for Code Arena, and top 10 for Arena Expert. Highlights - top 6 lab, #13 in Code Arena for agentic webdev tasks - #10 for Arena Expert - top 20 for Life, Physical, & Social Science and Business, Management, & Financial Ops occupational… https://twitter.com/arena/status/2035068569063690289/photo/1
中文: MiMo V2 Pro 已成为 Code Arena 的 6 号实验室,以及 Arena Expert 的前十名。 亮点 - 顶级6号实验室,Code Arena中排名第13位,适用于代理网页设计任务 - 第10名,代表竞技场专家 - 人生、物理、环境领域前20名;社会科学与商业、管理、经济运营职业...
Arena.ai
MiMo V2 Omni by @XiaomiMiMo is ready for you in the Vision Arena! It’s set up to test its reasoning capabilities over visual inputs. Come find it in Battle Mode and vote, we'll see how it stacks up. https://twitter.com/arena/status/2035016588668350792/photo/1
中文: @XiaomiMiMo 的 MiMo V2 Omni 已在 Vision Arena 为您准备好! 其设置是为了测试其推理能力而不是视觉输入。快来在战斗模式和投票中找到它,我们将看看它是如何叠加的。
Arena.ai
RT @Xudong_Lin_AI: Proud of our team that makes the huge leap happen compared to last version but this is just the start. Better models are…
中文: RT @Xudong_Lin_AI:我们团队为实现这一巨大飞跃而感到自豪,但这一起点已经不为人分。更好的模型是......
Arena.ai
RT @AlibabaGroup: Proud moment! 😎 Qwen 3.5 Max Preview is bringing the heat! ✅ #3 Math ✅ Top 10 Arena Expert ✅ Top 15 Overall Big thanks t…
中文: RT @阿里巴巴集团:自豪时刻!😎 Qwen 3.5 Max Preview 带来热度! ✅ #3 数学 ✅ 十大体育馆专家 ✅ 整体排名前15 非常感谢......
Arena.ai
GPT-5.4 Mini High is available in the Code Arena. Setup with the Codex Harness, @OpenAI’s latest model is ready for your real-world, agentic web development tasks. Come test it out and we'll see how it stacks up soon. https://twitter.com/arena/status/2034710747738165452/photo/1
中文: GPT-5.4 Mini High 可在 Code Arena 购买。 使用 Codex Harness 进行设置,@OpenAI 的最新模型已准备就绪,可完成您真实、智能的网页开发任务。 来测试一下,我们很快就会看到它的叠加。
Arena.ai
Grok 4.20 Beta Reasoning makes @xAI a top 5 lab in Vision Arena. Scoring 1240, this model ranks #11 across all Vision models today. Congrats to the @xAI team for this milestone! https://twitter.com/arena/status/2034676243212484736/photo/1
中文: Grok 4.20 Beta Reasoning 使 @xAI 成为 Vision Arena 中排名前五的实验室。 该模型在1240上评分,在如今所有视觉模型中排名第11位。 祝贺@xAI团队迎来这一里程碑!
Arena.ai
RT @MicrosoftAI: Meet MAI‑Image‑2. Built with creatives, for real creative work. Ranked #5 on @arena’s text‑to‑image leaderboard. Available…
中文: RT @MicrosoftAI:认识 MAI-Image-2。采用创意人造,用于真正的创造性工作。在@arena的文本版排行榜上排名第5。可用...
Arena.ai
RT @mustafasuleyman: Our new image generator MAI-Image-2 is out! Available now on MAI Playground for everything from lifelike realism to de…
中文: RT @mustafasuleyman:我们的新图像生成器 MAI-Image-2 现已发布!现已在MAI游乐场上推出,适用于从逼真逼真到无尽的各种事物。
Arena.ai
Let’s dive deeper into the massive improvements between MAI-Image-2 vs. MAI-Image-1 by @MicrosoftAI. MAI-Image-2 shows significant gains across all sub-categories for Text-to-Image: Gains across all 7 sub-categories in order of magnitude: - Text Rendering (+115 pts) -… https://twitter.com/arena/status/2034661341370384447/photo/1
中文: 让我们更深入地探讨MAI-Image-2与MAI之间的重大改进。MAI-Image-1 由 @MicrosoftAI 提供。 MAI-Image-2 显示了所有文本转图像子类别的显著增长: 所有7个子类别的收益,按数量级排列: - 文本渲染(+115 分量) - . . .
Arena.ai
MAI-Image-2 debuts at #5 in the Image Arena! Highlights: - #5 in Text-to-Image overall - #5 for 3D Imaging & Modeling, Cartoon, Anime & Fantasy, Photorealistic & Cinematic Imagery, Art and Portraits - #6 for Product, Branding & Commercial Design Congrats to the @MicrosoftAI… https://twitter.com/arena/status/2034660389284360585/photo/1
中文: MAI-Image-2 首演于图片画场第5名! 亮点: - 全文第5 - 3D成像及摄影模式第5名;模特、卡通、动漫和安普特;幻想、摄影真实与影子;电影影像、艺术与肖像 - 产品、品牌与安普;商业设计第6名 恭喜@MicrosoftAI......
Arena.ai
RT @Alibaba_Qwen: Pretty proud of this one! 😎 Qwen 3.5 Max Preview just hit #3 in Math, Top 10 in Arena Expert, and Top 15 overall! We're…
中文: RT @阿里巴巴_Qwen:为这个感到非常自豪!😎 Qwen 3.5 Max Preview 刚刚在数学方面排名第三,在 Arena Expert 中排名前十,整体排名也排名前15位! 我们是......
Arena.ai
With the preview of Qwen 3.5 Max Preview by @Alibaba_Qwen, we’re looking back at past Qwen Max variants to see how far it has progressed. Where Qwen 3.5 Max sees the largest gains vs. Qwen 3 Max: - Text Overall (+45pts) - Creative Writing (+57pts) - Math (+49pts) -… https://twitter.com/arena/status/2034658045113065603/photo/1
中文: 通过 @Alibaba_Qwen 预览版 Qwen 3.5 Max 预览版,我们回顾以往的 Qwen Max 版本,以了解其进展如何。 其中Qwen 3.5 Max的涨幅最大。Qwen 3 最大值: - 全文(+45pts) - 创意写作(+57分) - 数学(+49pts) - . . .
Arena.ai
Qwen 3.5 Max Preview has landed in top 10 for Arena Expert and top 15 for Text Arena. It shows particular strength in Math. Highlights: - #3 Math - #10 Expert - #15 Text Arena - Top 20 for Writing, Literature & Language, Life, Physical, & Social Science, Entertainment, Sports,… https://twitter.com/arena/status/2034653740465336407/photo/1
中文: Qwen 3.5 Max Preview 已进入 Arena Expert 前十名,Text Arena 前15名。 它在数学中显示出特别的力量。 亮点: - #3 数学 - 第10名专家 - #15 文本竞技场 - 写作、文学及语言学前20名;语言、生活、身体、体育;社会科学、娱乐、体育;
Arena.ai
MiniMax M2.7 is ranked #8 in Code Arena. It’s also the most cost-efficient of the top 10 at $0.30 / $1.20 per MToken. Congrats to the team at @MiniMax_AI 👏 https://twitter.com/arena/status/2034397085022462451/photo/1
中文: MiniMax M2.7 在 Code Arena 中排名第 8 位。 也是前十名中最具成本效益的,每台MToken售价为0.30美元/ 1.20美元。 祝贺团队:@MiniMax_AI 👏
Arena.ai
When LLMs are unreliable, developers build scaffolding: Retries. Judges. Multi-step pipelines. When LLMs become reliable, you just prompt and ship. That’s been the real unlock over the last year. We break down what changed with AI capability expert @petergostev on YouTube:…
中文: 当LLM不可靠时,开发者会建造脚手架:重制。法官。多步骤管道。 当LLM变得可靠时,你只需提示并发货即可。 那是过去一年里真正的解锁。 我们通过YouTube上的人工智能功能专家@petergostev来分解变化:......
Arena.ai
The key to building a leaderboard that can’t be gamed is starting with the right structural foundations: neutrality and rigorous methodology. Arena scores are powered by a constant stream of real-world prompts from millions of users across the globe comparing responses from the… https://twitter.com/arena/status/2034330545954713876/video/1
中文: 构建一个无法玩得力的排行榜的关键,是从正确的结构基础开始:中立性和严谨的方法论。 竞技场的得分由来自全球数百万用户的源源不断的实时提示提供,可对比来自以下网址的回复。
Arena.ai
MiniMax M2.7 - the latest from @MiniMax_AI is ready for you in the Text and Code Arena! Let's see how it stacks up to real-world use. In Text Arena, we'll soon be able to compare its performance across multiple key categories like: Math, Coding, Creative Writing, Expert and… https://twitter.com/arena/status/2034300510086496329/photo/1
中文: MiniMax M2.7——最新动态:@MiniMax_AI 可在文本和代码竞技场为您准备! 让我们看看它是如何叠加到现实世界使用的。 在 Text Arena 中,我们将很快能够将其表现比对于数学、编程、创意写作、专家和......
Arena.ai
Did you know? We’re funding independent research in AI evaluation and measurement—up to $50k per project. The Q1 deadline to apply for Arena’s Academic Partnerships Program is March 31. https://twitter.com/arena/status/2034294095150215182/photo/1
中文: 你知道吗?我们正在资助人工智能评估与测量领域的独立研究,每个项目高达5万美元。 申请Arena学术合作项目的第一季度截止日期为3月31日。
Arena.ai
GPT 5.4 Mini and Nano by @OpenAI are available in the Text and Vision Arena! Check them out and don't forget to vote, we'll see how they stack up on the leaderboards. https://twitter.com/arena/status/2033994249264599321/photo/1
中文: @OpenAI 的 GPT 5.4 Mini 和 Nano 可在 Text 和 Vision Arena 上观看! 查看他们并别忘了投票,我们将看看他们如何在排行榜上叠加。
Arena.ai
Today we’re launching the Video Edit Arena to evaluate the frontier capability of video models! - #1 Grok-Imagine-Video, @xAI - #2 Kling-o3-pro, @Kling_ai - #3 Kling-o1-pro, @Kling_ai - #4 Gen4-aleph, @Runwayml The leaderboard is powered by thousands of real-world community… https://twitter.com/arena/status/2033981066873319610/photo/1
中文: 今天我们将推出视频编辑竞技场,以评估视频模型的前沿性能! - #1 Grok-Imagine-Video,@xAI - #2 克林-o3-pro,@Kling_ai - #3 克林-o1-pro,@Kling_ai - #4 代阿利芙,@Runwayml 排行榜由成千上万个现实世界的社区提供支持......
Arena.ai
Customize your Arena leaderboard. Everyone's real-world use for AI differs. Select the columns and data that matters most to you: - Rank Spread - Model Organization - License - Total Votes - Price ($/MToken) - Max Context https://twitter.com/arena/status/2033951859136938202/video/1
中文: 自定义您的Arena排行榜。 每个人在人工智能领域的实际使用都有所不同。选择对您最重要的列和数据: - 排名点差 - 模型组织 - 许可证 - 总票数 - 价格(美元/代币) - 最大背景
Arena.ai
Can AI tell when a question is total nonsense, or does it just make up an answer? @petergostev tested 80 models with nonsense questions. Some pushed back. Others confidently invented fake metrics and kept going. All of them were ranked on the "BS Bench". One surprise: thinking… https://twitter.com/arena/status/2033710089983660448/video/1
中文: 人工智能能判断一个问题何时完全是胡说八道,还是只是构成一个答案? @petergostev 测试了80款模型,并提出了无稽之谈。有些人推倒了。其他人则自信地发明了虚假指标,并不断推进。他们均在“BS Bench”上排名。 一个惊喜:思考......
Arena.ai
How often are users unhappy with the answers that top AI models give? This can give us a real world view of how the frontier shifts throughout time. We have looked back to 2023 and traced back how often users rated both responses in Battle Mode to be bad (limited to Top 25… https://twitter.com/arena/status/2033668985095647256/photo/1
中文: 用户对顶级人工智能模型给出的答案感到不满的频率是多少?这可以让我们真实世界地了解边境如何在时间上不断变化。 我们回顾了2023年,回顾了用户在战斗模式中将这两个回复评定为差的频率(仅限于前25名)。
Arena.ai
Grok 4.20 Beta Reasoning has landed #7 for Text Arena & #28 for Code Arena. The model is on par with DeepSeek-v3.2- thinking and Qwen3.5-122b-a10b in Code Arena's agentic webdev tasks. More Highlights: - #7 in Text Arena overall tied with GPT-5.4-high - top 10 in Math,… https://twitter.com/arena/status/2033652123800588419/photo/1
中文: Grok 4.20 Beta Reasoning 已登陆 Code Arena 第 7 号 Text Arena & 第28名。 该模型与 DeepSeek-v3.2- 思维模式和 Qwen3.5-122b-a10b 在 Code Arena 的代理 WebDev 任务中相当。 更多精彩内容: - 文字竞技场第7名,整体与GPT-5.4高位并列 - 数学前十名......
Arena.ai
Arena leaderboards now include Price and Context. - Price is shown as input / output cost per 1M tokens, and context shows the maximum context window. Compare Arena scores based on what matters for your use case. https://twitter.com/arena/status/2032579517177540787/video/1
中文: 竞技场排行榜现在包括普莱斯和普阿普特。 - 价格显示为每1M代币的输入/输出成本,上下文显示最大上下文窗口。 根据您的使用情况比较竞技场的分数。
Arena.ai
RT @nbrichtova: Once upon a time we dropped an anonymous model on Arena. 🍌🍌🍌 It quickly became the most-voted model in the platform’s hist…
中文: RT @nbrichto:从前我们在体育馆投放了一位匿名模特。🍌🍌🍌 它迅速成为该平台中投票最多的模式......
Arena.ai
An anonymous image model appeared on Arena on Aug 12, 2025 and quickly became the most-voted model in Arena's history. The codename: Nano Banana. It was later revealed to be built on Google Gemini and publicly released on Aug 26, 2025. We sat down with Lead Engineer Yue to… https://twitter.com/arena/status/2032198835997720886/video/1
中文: 一张匿名图片模型于2025年8月12日出现在Arena上,并迅速成为Arena历史上投票最多的模特。 代号:纳罗香蕉。 后来发现它基于谷歌双子座打造,并于2025年8月26日公开发布。 我们与首席工程师岳坐下来......
Arena.ai
An anonymous image model appeared on Arena on Aug 12, 2025 and quickly became the most-voted model in Arena's history. The codename: Nano Banana. It was later revealed to be built on Google Gemini and publicly released on Aug 26, 2025. We sat down with Lead Engineer Yue to… https://twitter.com/arena/status/2032198645983166619/video/1
中文: 一张匿名图片模型于2025年8月12日出现在Arena上,并迅速成为Arena历史上投票最多的模特。 代号:纳诺香蕉。 后来发现它基于谷歌双子座打造,并于2025年8月26日公开发布。 我们与首席工程师岳坐下来......
Arena.ai
An anonymous image model appeared on Arena on Aug 12, 2025 and quickly became the most-voted model in Arena's history. The codename: Nano Banana. It was later revealed to be built on Google Gemini and publicly released on Aug 26, 2025. We sat down with Lead Engineer Yue to… https://twitter.com/arena/status/2032198600995061813/video/1
中文: 一张匿名图片模型于2025年8月12日出现在Arena上,并迅速成为Arena历史上投票最多的模特。 代号:纳米香蕉。 后来发现它基于谷歌双子座打造,并于2025年8月26日公开发布。 我们与首席工程师岳坐下来......
Arena.ai
GPT-5.4-high has landed in the Code Arena top 6. Setup with the Codex Harness, @OpenAI’s latest model is on par with Gemini 3.1 Pro Preview for real-world web development tasks. Highlights: - top 6 in WebDev overall - #6 for Multi-File React - top 10 for Single-File HTML https://twitter.com/arena/status/2032126328842117612/photo/1
中文: GPT-5.4 高位已进入 Code Arena 前六名。 使用 Codex Harness 进行安装,@OpenAI 的最新版本与 Gemini 3.1 Pro 预览版在实际网页开发任务中相当。 亮点: - 网页开发(WebDev)整体排名前六 - 多文件 React 第 6 页 - 单文件HTML前10名
Arena.ai
GPT-5.4 and GPT-5.4-High by @OpenAI both sit in the top 5 for Arena Expert where rankings are based only on expert-level prompts. But beyond that similarity, let's take a closer look at specific domains and categories. Where 5.4-High has the most gains over 5.4: -… https://twitter.com/arena/status/2031849889249030583/photo/1
中文: GPT-5.4 和 GPT-5.4-High by @OpenAI 均位居 Arena Expert 前五名,该排名仅基于专家级提示。 但除了这种相似性之外,让我们更仔细地了解一下具体的领域和类别。 5.4-High 的涨幅最大: - . . .
Arena.ai
With Search AI, the real challenge isn’t retrieval - it’s reasoning about which sources to trust and how to incorporate them. Deep dive on how AI search actually works with our Research Engineer Logan in thread ↓
中文: 使用搜索人工智能,真正的挑战不在于检索——而在于如何信任哪些来源以及如何整合它们。 深入了解人工智能搜索在线下与研究工程师洛根的实际工作原理
Arena.ai
GPT-5.4 by @OpenAI lands tied #2 on Document Arena and in top 5 for Arena Expert. Document Arena Highlight: - #2 tied with Claude Sonnet 4.6 Text Arena Highlights: - #5 for Arena Expert - top 10 in Business, Management, & Financial Ops and Writing, Literature, & Language… https://twitter.com/arena/status/2031826221756268710/photo/1
中文: @OpenAI 的 GPT-5.4 在 Document Arena 上排名第二,并列 Arena Expert 的前五名。 文件竞技场亮点: - 第2名与克劳德·索内特并列4.6 文本竞技场亮点: - 5号体育馆专家 - 商业、管理、环境领域前十名;财务运营与写作、文学、文学与语言学;
Nemotron 3 Super by @NVIDIAAI ranks #37 among open models on Expert Arena. Super places in the top 50 open models in Text Arena and expands the Nemotron 3 family of open models for agentic AI applications. The family includes Nano, Super, and Ultra. Nano was released last… https://twitter.com/arena/status/2031763981963284680/photo/1
中文: Nemotron 3 Super 由 @NVIDIAAI 在 Expert Arena 的开放模型中排名第 37 位。 在 Text Arena 排名前 50 的开放模型中位居首位,并扩展了 Nemotron 3 系列的开放模型,用于代理人工智能应用。 家族包括Nano、Super和Ultra。Nano 于上次发布......
First impressions of GPT-5.4-High by @OpenAI with AI Capabilities Lead @petergostev. How does it compare to GPT-5.4-Medium? Find out on Arena's YouTube: https://www.youtube.com/watch?v=4T9_deFRI30
中文: @OpenAI 的 GPT-5.4-High 与人工智能功能的首次展示 领先于 @petergostev。 与GPT-5.4-Medium相比如何? 请访问 Arena 的 YouTube:
RT @ml_angelopoulos: Cool to see @arena on here 💪
中文: RT @ml_angelopolos:在这里看到@arena 的 很酷 💪
Claude Sonnet 4.6 lands at #2 on Document Arena. The top three models for document analysis and long-form reasoning are now all from @AnthropicAI. - #1 Opus 4.6 - #2 Sonnet 4.6 - #3 Opus 4.5 Ranking are all powered by anonymous side-by-side evaluations on user-uploaded PDFs… https://twitter.com/arena/status/2031012090681663717/photo/1
中文: 克劳德·索内特4.6号在文件竞技场的2号着陆。文档分析和长篇推理的前三大模型现在都来自 @AntropicAI。 - 第1名 4.6 - 第2号十四行诗 4.6 - 第3名 Opus 4.5 排名均采用用户上传的PDF文件的匿名并排评估...
PixVerse V5.6 by @PixVerse_ has landed in the Video Arena top 15. - #15 for Image-to-Video - #15 for Text-to-Video https://twitter.com/arena/status/2030076606757359751/photo/1
中文: @PixVerse_ 的 PixVerse V5.6 已进入视频竞技场第15名。 - 视频视频第15名 - 第15名,适用于文本视频
We’re still waiting to see what the community thinks about GPT-5.4 in Code Arena. We took it for a spin, now it’s your turn. https://twitter.com/arena/status/2030049237908787274/video/1
中文: 我们仍在等待社区对Code Arena中GPT-5.4的看法。我们把它转过来了,现在轮到你了。
GPT-5.4 High by @OpenAI has landed in the top 10 Text Arena. Let’s dig into why. Overall the latest model is much more rounded than the previous GPT-5.2-High, with significant improvements across quite a large number of categories. Below are where it has made the largest gains:… https://twitter.com/arena/status/2030018716440924225/photo/1
中文: @OpenAI 的 GPT-5.4 High 已进入前十名文本体育馆。让我们深入探讨一下原因。 总体而言,最新型号比之前的GPT-5.2-High更全面,在众多类别中均取得了显著改进。以下是取得最大进展的地方:
GPT-5.4-high is now in the Text Arena, tied with Gemini-3-Pro. Highlights: - Top 3 in Creative Writing, and top 10 in Instruction Following, Hard Prompts. - Top 6 for Occupational categories: Writing, Literature & Language, Entertainment, Sports & Media, Business, Management &… https://twitter.com/arena/status/2029648008602857694/photo/1
中文: GPT-5.4 高点现已位于 Text 球馆,与 Gemini-3-Pro 并列。 亮点: - 创意写作前三名,指导性指导前十名,精彩提示。 - 职业类别前六名:写作、文学与宣传、语言、娱乐、体育与体育、媒体、商业、管理及活动
AI needs better evaluations. Today we’re announcing Arena’s Academic Partnerships Program to fund independent academic research in AI evaluation and measurement. ▫️Up to $50K/project. Q1 Deadline: March 31, 2026. See more in thread for details and how to apply 👇 https://twitter.com/arena/status/2021268433619374336/photo/1
中文: 人工智能需要更好的评估。 今天,我们宣布将推出Arena的学术合作项目,以资助人工智能评估与测量领域的独立学术研究。 ▫ 项目金额高达 5 万美元。第一季度截止日期:2026年3月31日。 详情请查看更多内容 👇