Arena.ai
Meta is back in the Arena! Muse Spark debuts as a top frontier model across both Text and Vision: - Text Arena: #3 tied with Gemini-3.1-Pro and Claude-Opus-4.6 - Vision Arena: #2 tied with Claude-Opus-4.6 This marks Meta’s first major release since early 2025. Highlights: - #4… https://twitter.com/arena/status/2042726806038680019/photo/1
中文: Meta 回到了竞技场! 缪斯·斯帕克作为顶级前沿模特在《文本》和《视觉》中首次亮相: - 文字竞技场:第3名与双子座-3.1-Pro和克劳德-奥普斯-4.6并列 - 视觉竞技场:与克劳德-奥普斯-4.6并列第二 这标志着Meta自2025年初以来首次发布重大版本。 亮点: - #4...
Arena.ai
With GLM-5.1, https://z.ai/ maintains the #1 open model rank in Code Arena and is now within ~20 points of the top overall while outperforming Claude Sonnet 4.6, Opus 4.5, GPT-5.4 High, and Gemini-3.1 Pro. Open models are now competitive at the frontier. https://twitter.com/arena/status/2042643933768151485/photo/1
中文: 在 GLM-5.1 中, 在 Code Arena 中保持第一的开放模式排名,目前排名在首位的 20 分以内,表现优于 Claude Sonnet 4.6、Opus 4.5、GPT-5.4 High 和 Gemini-3.1 Pro。 开放模型如今在前沿领域具有竞争力。
Arena.ai
With GLM-5.1 @Zai_org maintains the #1 open model rank in Code Arena at 1530. It’s now within ~20 points of the top overall while outperforming Claude Sonnet 4.6, Opus 4.5 (Thinking), GPT-5.4 High, and Gemini-3.1 Pro. Open models are now competitive at the frontier. For this… https://twitter.com/arena/status/2042634075274645577/photo/1
中文: 在 1530 年,在 Code Arena 中,GLM-5.1 @Zai_org 保持了排名第一的开放模式排名。 目前排名在榜首的积分范围内,表现在Claude Sonnet 4.6、Opus 4.5(Thinking)、GPT-5.4 High和Gemini-3.1 Pro之外。 开放模型如今在前沿领域具有竞争力。 为此......
Arena.ai
GLM-5.1 by @Zai_org is now #3 in Code Arena - surpassing Gemini 3.1 and GPT-5.4, and now on par with Claude Sonnet 4.6. The first frontier level open model to break into the top 3. It’s a major +90 point jump over GLM-5, and +100 over Kimi K2.5 Thinking. Huge congrats to… https://twitter.com/arena/status/2042611135434891592/photo/1
中文: @Zai_org 的 GLM-5.1 现已在 Code Arena 中排名第三,超过了 Gemini 3.1 和 GPT-5.4,与 Claude Sonnet 4.6 相位。 首个突破前列的前沿型开放模式。比GLM-5领先+90,比Kimi K2.5 Thinking高出100分。 向......表示热烈祝贺......
Arena.ai
How much better is Dreamina Seedance 2.0? This visualization shows it clearly as it sits in a tier of its own. Dreamina Seedance 2.0 by Bytedance delivers a major leap in Text-to-Video performance, outperforming the nearest competitor (Veo-3.1-1080p) by +79 points. It also… https://twitter.com/arena/status/2041969275196567780/photo/1
中文: Dreamina Seedance 2.0 有多好?这种可视化在它处于自身层面时清晰显示。 字节跳动的梦幻种子2.0在文本到视频性能方面实现了重大飞跃,超过了最接近的竞争对手(Veo-3.1-1080p)+79分。 也......
Arena.ai
Today at @HumanXCo: CEO and co-founder @ml_angelopoulos & Chairman and co-founder @istoica05 take the stage at The Loop Theater at 1:30pm PT with @CristinaCriddle of @FT to answer a question the whole industry is asking: Who's winning the race to reliability? - Crowdsourced… https://twitter.com/arena/status/2041910061224882513/photo/1
中文: 今天在@HumanXCo上:首席执行官兼联合创始人@ml_angelopos &董事长兼联合创始人@istoica05于下午1点30分在The Loop剧院与@FT的@CristinaCriddle共同登台,回答整个行业都在提问的问题: 谁赢得了可靠性竞赛? - 众包......
Arena.ai
ICYMI: we released the FULL history of Arena leaderboard data as a public dataset, nearly 3 years of rankings across 10 Arenas, dozens of categories, and 700+ models. Hear more about it from @cthorrez and @petergostev on our YouTube channel: https://www.youtube.com/watch?v=QbpW77m90kw
中文: ICYMI:我们发布了Arena排行榜数据的完整历史,作为公共数据集,在10个Arenas、数十个类别以及700多个模型中排名近3年。 请访问我们的YouTube频道,通过@cthorrez和@petergostev了解更多信息:
Arena.ai
Dreamina Seedance 2.0 has landed #1 across the Video Arena for both Text-to-Video and Image-to-Video. This is the score for the 720p variant. Text-to-Video: - #1 model scoring 1450, +79pts over #2 Veo 3.1 1080p - This is a +191pt jump since Seedance-v1.5-Pro Image-to-Video: -… https://twitter.com/arena/status/2041713742485045590/photo/1
中文: Dreamina Seedance 2.0 已在视频竞技场上排名第一,内容包括视频和视频图像。这是720p变异株的分数。 视频文本: - 排名第一的型号,评分为1450分,超过第2名Veo 3.1 1080p - 自 Seedance-v1.5-Pro 以来,这一增长 +191pt 视频图像: - . . .
Arena.ai
Let’s deep dive into GLM model improvement by @Zai_org over the past 3 generations: 5.1, 5 and 4.7. GLM-5 was an improvement over 4.7 in similar ways, but 5.1 appears as a more rounded model with a few trade-offs. GLM-5.1 is currently the #1 open model in the Text Arena.… https://twitter.com/arena/status/2041650737206456529/photo/1
中文: 让我们深入探讨过去三代中由@Zai_org改进的GLM模型:5.1、5和4.7。 GLM-5在类似方面比4.7有所改善,但5.1似乎是一个更全面的模型,并存在一些权衡。GLM-5.1 目前是 Text Arena 中排名第一的开放模式。
Arena.ai
Check out the GLM-5.1 first impressions with Peter on our YouTube https://www.youtube.com/watch?v=f11tVBXWr2g
中文: 通过我们的 观看与 Peter 的 GLM-5.1 第一印象
Arena.ai
Come check out GLM 5.1 in Code Arena for agentic web development tasks using tools. Don’t forget to vote, Code Arena scores are coming up next! https://arena.ai/code
中文: 快来查看 Code Arena 中的 GLM 5.1 使用工具进行代理网页开发任务。别忘了投票,接下来将进行代码竞技场比分!
Arena.ai
GLM-5.1 by @Zai_org just launched in the Text Arena, and is now the #1 open model. It outperforms the next best open model, its predecessor, GLM-5, by +11 points and +15 over Kimi K2.5 Thinking. It shows strength in: - #1 open model in Longer Query (#4 overall) - #1 open model… https://twitter.com/arena/status/2041641149677629783/photo/1
中文: @Zai_org 推出的 GLM-5.1 刚刚在 Text Arena 上发布,现已推出第一开放模式。 其前身GLM-5指数+11分,比Kimi K2.5 Thinking领先+15,表现优于前身。 它在以下内容中显示出力量: - 长查询中排名第一的开放模型(总体排名第4) - #1 开放模式......
Arena.ai
Tomorrow at @HumanXCo, our co-founder and CEO @ml_angelopoulos sits down with @andykonwinski for a press Q&A at 10am PT in the Media Lounge, then joins co-founder and Chairman @istoica05 for a presentation and conversation with @CristinaCriddle of @FT at 1:30pm PT in The Loop… https://twitter.com/arena/status/2041574831045669012/photo/1
中文: 明天上午10点,在@HumanXCo,我们的联合创始人兼首席执行官@ml_angelopos与@andykonwinski在媒体休息室的新闻发布会上 Q&A 会合,随后与联合创始人兼董事长 @istoica05 于下午1点30分在 The Loop 中与 @FT 的 @CristinaCriddle 进行演示和交谈。
Arena.ai
A new open model has entered the Arena! GLM-5.1 by @Zai_org is now ready for your prompts in the Text and Code Arena. Come vote and let's see how it stacks up! https://twitter.com/arena/status/2041554549488685370/photo/1
中文: 新的开放模式已进入竞技场! @Zai_org 的 GLM-5.1 现已在文本和代码竞技场中准备好迎接您的提示。来投票,看看它是怎么叠加的!
Arena.ai
New eval mode: Battles in Direct anonymously surfaces a second model mid-conversation during your direct chat for comparison. Longer context windows mean evaluation happens deeper in flow, leading to more decisive voting and bringing Arena closer to real-world use. We're…
中文: 新的椭圆形模式:在直接聊天中,直接对战会以匿名方式显示第二个模型的中置对话。 时间较长意味着评估的进行得更深,从而导致投票更加果断,并使Arena更接近现实世界的使用。 我们是......
Arena.ai
Code Arena can handle image inputs for agentic web dev tasks, reasoning through multi-step problems and using tools along the way. Watch how it works with @aryanvichare10 in this clip. Find a link to the full walkthrough on how Code Arena creates sites and apps from images in… https://twitter.com/arena/status/2040145898060304635/video/1
中文: Code Arena 可以处理用于代理网页开发任务的图像输入、多步骤问题的推理以及沿途使用工具。 观看此视频中与 @aryanvichare10 的配合。查找有关 Code Arena 如何通过图片创建网站和应用程序的完整链接......
Arena.ai
Gemma 4 31B shifts the Pareto frontier, scoring +30 Arena points above similarly priced models like DeepSeek 3.2. Its position on the Pareto frontier is based on early pricing indicators from third parties. https://twitter.com/arena/status/2040128319719670101/photo/1
中文: 杰玛4号31B改变了帕雷托的前沿,比DeepSeek 3.2等价格相似的球衣获得了+30个竞技场积分。其在帕雷托前沿的立场基于第三方的早期定价指标。
Arena.ai
RT @demishassabis: Gemma 4 outperforms models over 10x their size! (note the x-axis is log scale!) https://twitter.com/demishassabis/status/2040067244349063326/photo/1
中文: RT @demishassabis:Gemma 4 的尺寸优于其尺寸的 10 倍!注意,x 轴是日志刻度!
Arena.ai
Qwen 3.6 Plus is ready for your real-world use cases in the Text and Code Arena! In the Code Arena, you can compare models on real-world, agentic web development tasks—generating HTML or React apps that you can immediately share, or download. Get testing and don't forget to… https://twitter.com/arena/status/2040082258833645605/photo/1
中文: Qwen 3.6 Plus 已在文本和代码竞技场中为您的实际应用案例做好准备! 在代码领域,您可以比较真实世界中的模型、代理式网页开发任务——生成可立即共享或下载的 HTML 或 React 应用程序。 获取测试,别忘了......
Arena.ai
Let’s look at how the open model Gemma has progressed across its last three versions. - Gemma 4 ranks 100 places above Gemma 3 - Gemma 3 ranks 87 above Gemma 2 All three models from @GoogleDeepMind are roughly the same size (31B, 27B, 27B), and these gains came only 9 and 13… https://twitter.com/arena/status/2039848959301361716/photo/1
中文: 让我们来看看开放型型号Gemma在其最后三个版本中的进展。 - 杰玛4级排名超过吉马3位100位 - 杰玛3号排名高于杰玛2级87 @GoogleDeepMind 的三款机型大小大致相同(31B、27B、27B),且涨幅仅为9和13。
Arena.ai
We're releasing the full history of Arena leaderboard data as a public dataset, nearly 3 years of rankings across 10 Arenas, dozens of categories, and hundreds of models. Optimized to empower analysis and unlock new insights across modalities and over time. Check out some… https://twitter.com/arena/status/2039796686953087183/photo/1
中文: 我们将发布Arena排行榜数据的完整历史,作为公共数据集,在10个Arenas、数十个类别以及数百个模型中排名近3年。 优化以增强分析能力,并在各种模式及时间之间释放新的洞察。查看一些内容......
Arena.ai
Gemma 4 by @GoogleDeepMind debuts at 3rd and 6th on the open source leaderboard, making it the #1 ranked US open source model. By total parameter count, Gemma 4 31B is 24× smaller than GLM-5 and 34× smaller than Kimi-K2.5-Thinking, delivering comparable performance at a… https://twitter.com/arena/status/2039782449648214247/photo/1
中文: @GoogleDeepMind 的《Gemma 4》在开源排行榜上排名第三、第六,成为美国排名第一的开源机型。 按总参数统计,Gemma 4 31B 比 GLM-5 小 24 × 且比 Kimi-K2.5 Thinking 小 34 × , 可在 实现类似性能
Arena.ai
RT @Google: Gemma 4 is our most capable open model family yet: 🔵 Four versatile sizes 🔵 Up to 256K context window 🔵 Native function-callin…
中文: RT @Google:Gemma 4 是我们迄今为止最有能力的开放式模型系列: 🔵 四种多功能尺寸 🔵 最高可达 256K 上下文窗口 🔵 原生函数卡林......
Arena.ai
RT @GoogleAI: Today, we’re launching Gemma 4, our most intelligent open models to date. Built with the same breakthrough technology as Gemi…
中文: RT @GoogleAI:今天,我们将推出迄今为止最智能的开放模型 Gemma 4。采用与Gemi相同的突破性技术制造......
Arena.ai
RT @OfficialLoganK: Introducing Gemma 4, our series of open weight (Apache 2.0 licensed) models, which are byte for byte the most capable o…
中文: RT @OfficialLoganK:推出Gemma 4系列开放式重量(Apache 2.0授权版)型号,可字节(Bute)最有功能...
Arena.ai
Gemma-4-31B is now live in Text Arena - ranking #3 among open models (#27 overall), matching much larger models at 10× smaller scale! A significant jump from Gemma-3-27B (+87 pts). Highlights: - #3 open (#27 overall), on par with the best open models Kimi-K2.5, Qwen-3.5-397b -… https://twitter.com/arena/status/2039739427715735645/photo/1
中文: Gemma-4-31B 现已在 Text Arena 上线,在开放模式中排名第 3 位(总体排名第 27 位),与规模小 10 倍的大号型号相媲美!从Gemma-3-27B(+87分)大幅跃升。 亮点: - 第3个开放型号(总体排名第27位),与最佳开式型号Kimi-K2.5、Qwen-3.5-397b相当 - . . .
Arena.ai
RT @Alibaba_Qwen: #8 in Coding in Code Arena overall,#2 Lab in Code Arena on the React leaderboard! !👏👏 This is a great testament to our l…
中文: RT @Alibaba_Qwen:代码竞技场中第8名,代码竞技场中的#2实验室在React排行榜上排名!!EE0[EE] 这是我们的一个伟大证明......
Arena.ai
RT @mustafasuleyman: Three models. Three top-tier results. All shipped within just a few months by the @MicrosoftAI team. - MAI-Transcribe-…
中文: RT @mustafasuleyman:三位模特。三个顶级结果。所有货物均由 @MicrosoftAI 团队在短短几个月内发货。 - MAI-转录
Arena.ai
RT @Alibaba_Qwen: 🚀🚀Let's go!!
中文: RT @阿里巴巴_Qwen:🚀🚀 我们走吧!
Arena.ai
Qwen 3.6 Plus Preview is the #2 lab for the React leaderboard in Code Arena which ranks models based on agentic workflows involving multi-step reasoning, tool use, and multi-file apps. https://twitter.com/arena/status/2039723549976678779/photo/1
中文: Qwen 3.6 Plus Preview 是 Code Arena 中 React 排行榜的第二号实验室,该实验室基于基于多种步骤推理、工具使用和多文件应用的代理工作流程对模型进行排名。
Arena.ai
Qwen 3.6 Plus Preview is now #8 in Code Arena overall. This makes @Alibaba_Qwen the #2 lab in Code Arena on the React leaderboard, as it demonstrates strength in agentic coding tasks involving multi-step reasoning, tool use and multi-file apps. Congrats to the @Alibaba_Qwen… https://twitter.com/arena/status/2039723547569144187/photo/1
中文: Qwen 3.6 Plus Preview 现在在 Code Arena 中排名第 8 位。 这使得@Alibaba_Qwen成为 React 排行榜上 Code Arena 中的 #2 实验室,因为它展示了在涉及多步骤推理、工具使用和多文件应用程序的代理编程任务方面的优势。 恭喜 @Alibaba_Qwen......
Arena.ai
Qwen 3.6 Plus Preview is now #8 in Code Arena overall. This makes @Alibaba_Qwen the #2 lab in Code Arena on the React leaderboard, as it demonstrates strength in agentic coding tasks involving multi-step reasoning, tool use and multi-file apps. https://twitter.com/arena/status/2039722566689198349/photo/1
中文: Qwen 3.6 Plus Preview 现在在 Code Arena 中排名第 8 位。 这使得@Alibaba_Qwen成为 React 排行榜上 Code Arena 中的 #2 实验室,因为它展示了在涉及多步骤推理、工具使用和多文件应用的代理编程任务中的优势。
Arena.ai
GLM-5V-Turbo is now live in Vision Arena. Test its ability to reason over visual inputs using your real-world prompts. Don't forget to vote so we can see how it stacks up. https://twitter.com/arena/status/2039400189178556814/photo/1
中文: GLM-5V-Turbo 现居 Vision Arena。 使用实际提示测试其对视觉输入进行推理的能力。别忘了投票,以便我们了解它是如何叠加的。
Arena.ai
April is here. Here’s what changed at the frontier last month with Arena. New Arenas, leaderboard shifts, and product updates across Document, Video, Text, and Code - all grounded in real-world evaluation. Catch up on what changed and why it matters ↓
中文: 四月来了。上个月阿雷纳球场在边境发生了哪些变化。 新场馆、排行榜轮班以及文档、视频、文本和代码中的产品更新,全部基于现实世界的评估。 了解哪些变化以及为何重要 ↓
Arena.ai
We’ve added Pareto frontier charts to the leaderboard. Now available across: Text, Vision, Search, Document, and Code Arena. The Pareto frontier curve demonstrates which models are most efficient at their level of performance (by Arena score) vs. a blended price per 1M tokens… https://twitter.com/arena/status/2039377186432618885/photo/1
中文: 我们已经将帕雷托前沿排行榜列入了排行榜。 现已提供: 文本、视觉、搜索、文档和代码竞技场。 帕雷托前沿曲线展示了哪些模型在性能水平上效率最高(按竞技场评分),而每100万个代币的混合价格中表现最为高......
Arena.ai
We’ve added Pareto frontier charts to the leaderboard. Now available across: Text, Vision, Search, Document, and Code Arena. The Pareto curve demonstrates which models are most efficient at their level of performance (by Arena score) vs. a blended price per 1M tokens (3:1… https://twitter.com/arena/status/2039372539349323849/photo/1
中文: 我们已经将帕雷托前沿排行榜列入了排行榜。 现已提供: 文本、视觉、搜索、文档和代码竞技场。 帕雷托曲线展示了哪些型号在性能水平上效率最高(按竞技场评分),而每100万个代币的混合价格(3:1)
Arena.ai
How did the Top 10 in Text Arena change in the last month? Let's take a look. Claude Opus 4.6 models by @AnthropicAI remained on top - with new entrants Gemini-3.1 Pro by @GoogleDeepMind, GPT-5.4 High by @OpenAI and Grok-4.20 (Reasoning) by @xAI landing on 3rd, 6th and 7th… https://twitter.com/arena/status/2039081680515011022/photo/1
中文: 上个月,Text Arena 前十名的变化是什么?让我们来看看。 @AnthropicAI 的 Claude Opus 4.6 型号名列前茅——新进入机型的 Gemini-3.1 Pro 由 @GoogleDeepMind 推出,GPT-5.4 High 由 @OpenAI 推出,Grok-4.20(起号)由 @xAI 于第三、第六和第七着陆......
Arena.ai
Grok 4.20 Multi-agent Beta Reasoning has landed on Arena leaderboards! - #7 for Search Arena - #11 for Text Arena - #22 for Vision Arena Arena rankings reflect real-world usage and can be broken down further by expert-level prompts and occupational domains. When we look at… https://twitter.com/arena/status/2039072419500179854/photo/1
中文: Grok 4.20 多智能测试推理已登陆 Arena 排行榜! - 搜索竞技场第7名 - 文本竞技场第11名 - 视觉竞技场第22名 竞技场排名反映了实际使用情况,可进一步通过专家级提示和职业领域进行细分。 当我们查看时......
Arena.ai
RT @ml_angelopoulos: Our team at @arena is solving one of the most important problems in AI: Evaluation. We have some of the top researcher…
中文: RT @ml_angelopulos:我们位于 @arena 的团队正在解决人工智能领域最重要的问题之一:评估。我们有一些顶尖的研究者......
Arena.ai
Most people have heard of big model smell, but what about pristine pre-training smell? Evan and Derry seem to think it exists. https://twitter.com/arena/status/2037994222578712663/video/1
中文: 大多数人听说过大模型气味,但原始的预训练气味又如何呢? 埃文和德里似乎认为它存在。
Arena.ai
Across real-world use in the Text Arena, GPT-5.4 High variants (Regular, Mini, Nano) by @OpenAI do indeed behave as scaled versions of the same model, validating that pricing differences reflect efficiency, and not fundamentally different capabilities. https://twitter.com/arena/status/2037654358519906399/photo/1
中文: 在文本领域实际应用中,@OpenAI 的 GPT-5.4 高版本(Regular、Mini、Nano)确实表现为同一型号的缩放版本,这证实了价格差异反映了效率,且功能并非本质差异。
Arena.ai
You can smell a big model. Not the parameter count. Not the benchmark score. It's that feeling when something is actually reasoning. Not just pattern matching. We call it "big model smell." https://twitter.com/arena/status/2037607507510821107/video/1
中文: 你可以闻到一个大模型的气味。参数计数不。不是基准分数。当某件事真正是推理时,就是那种感觉。不仅仅是模式匹配。 我们称之为“大模型气味”。
Arena.ai
Are open source models catching up to proprietary models? We’ve looked back at 3 years of Arena’s data to show how the race has evolved. For comparison, we’ve taken the top 20% of the models and uncovered the following: - Before mid 2024: The gap was between 100-150 points - In… https://twitter.com/arena/status/2037584085997216100/video/1
中文: 开源模型能否跟上专有模型?我们回顾了阿雷纳公司三年的数据,以展示比赛的演变。 为便于比较,我们已采用前20%的模型,并发现了以下内容: - 2024年中期之前:差距在100到150分之间 - 网址:
Arena.ai
GPT-5.4 came with a variety of models across cost and performance. In the Text Arena: - GPT-5.4-Mini-High ranks #22 overall at $0.75 / $4.50, which is similar pricing to top open models - GPT-5.4-Nano ranks #88, with even more efficient pricing at $0.20 / $1.25 Mini-High… https://twitter.com/arena/status/2037563653537583162/photo/1
中文: GPT-5.4 配备了多种型号,包括成本和性能。在文本竞技场中: - GPT-5.4-Mini-High 整体排名第22位,售价为0.75美元/ 4.50美元,与顶级开放型号的定价相似 - GPT-5.4-Nano 排名第88位,价格更为高效,价格为0.20美元/ 1.25美元 迷你高...
Arena.ai
Gemini 3.1 Pro Grounding has landed #2 in the Search Arena. This places three Gemini models in the top 7 for Search, more than any other lab. Congrats to @GoogleDeepMind on this achievement! https://twitter.com/arena/status/2037246509255983210/photo/1
中文: 3.1 Gemini Pro Grounding 已在搜索体育馆获得第二名。 将三个双子座模型置于搜索前7名中,比任何其他实验室都多。 恭喜 @GoogleDeepMind 取得这一成就!
Arena.ai
Arena rankings measure model quality using large-scale human preference data. While style (length, tone, format) can influence individual votes, you can control for these factors to better isolate true capability. Our co-founder and CEO @ml_angelopoulos explains how in this… https://twitter.com/arena/status/2036893299861242130/video/1
中文: 竞技场排名利用大规模的人类偏好数据来衡量模型质量。虽然风格(长度、音调、格式)会影响个人投票,但你可以控制这些因素,以更好地隔离真实能力。 我们的联合创始人兼首席执行官 @ml_angelopolos 解释了其中的内容......
Arena.ai
RT @felicis: You can't trust AI you can't measure. @Arena is building the infrastructure to fix that - and we've believed since seed. New…
中文: RT @felicis:你无法信任无法衡量的人工智能。@Arena 正在建设基础设施来解决这个问题——我们从种子开始就一直相信。 新的......
Arena.ai
In this clip, co-founder and CEO @ml_angelopoulos talks about the scaling law in vote prediction with prompt-level data. Learn more about the math behind Arena's leaderboards on YouTube. Link in thread 🧵👇 https://twitter.com/arena/status/2036154529364975997/video/1
中文: 在这段视频片段中,联合创始人兼首席执行官@ml_angelopos通过快速数据在投票预测中谈到了这一扩展法。 了解更多关于Arena排行榜背后的数学原理,请在YouTube上观看。链接在 帖子 🧵👇 中
Arena.ai
MiMo V2 Pro has landed as a top 6 lab for Code Arena, and top 10 for Arena Expert. Highlights - top 6 lab, #13 in Code Arena for agentic webdev tasks - #10 for Arena Expert - top 20 for Life, Physical, & Social Science and Business, Management, & Financial Ops occupational… https://twitter.com/arena/status/2035068569063690289/photo/1
中文: MiMo V2 Pro 已成为 Code Arena 的 6 号实验室,以及 Arena Expert 的前十名。 亮点 - 顶级6号实验室,Code Arena中排名第13位,适用于代理网页设计任务 - 第10名,代表竞技场专家 - 人生、物理、环境领域前20名;社会科学与商业、管理、经济运营职业...
Arena.ai
MiMo V2 Omni by @XiaomiMiMo is ready for you in the Vision Arena! It’s set up to test its reasoning capabilities over visual inputs. Come find it in Battle Mode and vote, we'll see how it stacks up. https://twitter.com/arena/status/2035016588668350792/photo/1
中文: @XiaomiMiMo 的 MiMo V2 Omni 已在 Vision Arena 为您准备好! 其设置是为了测试其推理能力而不是视觉输入。快来在战斗模式和投票中找到它,我们将看看它是如何叠加的。
Arena.ai
RT @Xudong_Lin_AI: Proud of our team that makes the huge leap happen compared to last version but this is just the start. Better models are…
中文: RT @Xudong_Lin_AI:我们团队为实现这一巨大飞跃而感到自豪,但这一起点已经不为人分。更好的模型是......
Arena.ai
RT @AlibabaGroup: Proud moment! 😎 Qwen 3.5 Max Preview is bringing the heat! ✅ #3 Math ✅ Top 10 Arena Expert ✅ Top 15 Overall Big thanks t…
中文: RT @阿里巴巴集团:自豪时刻!😎 Qwen 3.5 Max Preview 带来热度! ✅ #3 数学 ✅ 十大体育馆专家 ✅ 整体排名前15 非常感谢......
Arena.ai
GPT-5.4 Mini High is available in the Code Arena. Setup with the Codex Harness, @OpenAI’s latest model is ready for your real-world, agentic web development tasks. Come test it out and we'll see how it stacks up soon. https://twitter.com/arena/status/2034710747738165452/photo/1
中文: GPT-5.4 Mini High 可在 Code Arena 购买。 使用 Codex Harness 进行设置,@OpenAI 的最新模型已准备就绪,可完成您真实、智能的网页开发任务。 来测试一下,我们很快就会看到它的叠加。
Arena.ai
Grok 4.20 Beta Reasoning makes @xAI a top 5 lab in Vision Arena. Scoring 1240, this model ranks #11 across all Vision models today. Congrats to the @xAI team for this milestone! https://twitter.com/arena/status/2034676243212484736/photo/1
中文: Grok 4.20 Beta Reasoning 使 @xAI 成为 Vision Arena 中排名前五的实验室。 该模型在1240上评分,在如今所有视觉模型中排名第11位。 祝贺@xAI团队迎来这一里程碑!
Arena.ai
RT @MicrosoftAI: Meet MAI‑Image‑2. Built with creatives, for real creative work. Ranked #5 on @arena’s text‑to‑image leaderboard. Available…
中文: RT @MicrosoftAI:认识 MAI-Image-2。采用创意人造,用于真正的创造性工作。在@arena的文本版排行榜上排名第5。可用...
Arena.ai
RT @mustafasuleyman: Our new image generator MAI-Image-2 is out! Available now on MAI Playground for everything from lifelike realism to de…
中文: RT @mustafasuleyman:我们的新图像生成器 MAI-Image-2 现已发布!现已在MAI游乐场上推出,适用于从逼真逼真到无尽的各种事物。
Arena.ai
Let’s dive deeper into the massive improvements between MAI-Image-2 vs. MAI-Image-1 by @MicrosoftAI. MAI-Image-2 shows significant gains across all sub-categories for Text-to-Image: Gains across all 7 sub-categories in order of magnitude: - Text Rendering (+115 pts) -… https://twitter.com/arena/status/2034661341370384447/photo/1
中文: 让我们更深入地探讨MAI-Image-2与MAI之间的重大改进。MAI-Image-1 由 @MicrosoftAI 提供。 MAI-Image-2 显示了所有文本转图像子类别的显著增长: 所有7个子类别的收益,按数量级排列: - 文本渲染(+115 分量) - . . .
Arena.ai
MAI-Image-2 debuts at #5 in the Image Arena! Highlights: - #5 in Text-to-Image overall - #5 for 3D Imaging & Modeling, Cartoon, Anime & Fantasy, Photorealistic & Cinematic Imagery, Art and Portraits - #6 for Product, Branding & Commercial Design Congrats to the @MicrosoftAI… https://twitter.com/arena/status/2034660389284360585/photo/1
中文: MAI-Image-2 首演于图片画场第5名! 亮点: - 全文第5 - 3D成像及摄影模式第5名;模特、卡通、动漫和安普特;幻想、摄影真实与影子;电影影像、艺术与肖像 - 产品、品牌与安普;商业设计第6名 恭喜@MicrosoftAI......
Arena.ai
RT @Alibaba_Qwen: Pretty proud of this one! 😎 Qwen 3.5 Max Preview just hit #3 in Math, Top 10 in Arena Expert, and Top 15 overall! We're…
中文: RT @阿里巴巴_Qwen:为这个感到非常自豪!😎 Qwen 3.5 Max Preview 刚刚在数学方面排名第三,在 Arena Expert 中排名前十,整体排名也排名前15位! 我们是......
Arena.ai
With the preview of Qwen 3.5 Max Preview by @Alibaba_Qwen, we’re looking back at past Qwen Max variants to see how far it has progressed. Where Qwen 3.5 Max sees the largest gains vs. Qwen 3 Max: - Text Overall (+45pts) - Creative Writing (+57pts) - Math (+49pts) -… https://twitter.com/arena/status/2034658045113065603/photo/1
中文: 通过 @Alibaba_Qwen 预览版 Qwen 3.5 Max 预览版,我们回顾以往的 Qwen Max 版本,以了解其进展如何。 其中Qwen 3.5 Max的涨幅最大。Qwen 3 最大值: - 全文(+45pts) - 创意写作(+57分) - 数学(+49pts) - . . .
Arena.ai
Qwen 3.5 Max Preview has landed in top 10 for Arena Expert and top 15 for Text Arena. It shows particular strength in Math. Highlights: - #3 Math - #10 Expert - #15 Text Arena - Top 20 for Writing, Literature & Language, Life, Physical, & Social Science, Entertainment, Sports,… https://twitter.com/arena/status/2034653740465336407/photo/1
中文: Qwen 3.5 Max Preview 已进入 Arena Expert 前十名,Text Arena 前15名。 它在数学中显示出特别的力量。 亮点: - #3 数学 - 第10名专家 - #15 文本竞技场 - 写作、文学及语言学前20名;语言、生活、身体、体育;社会科学、娱乐、体育;
Arena.ai
MiniMax M2.7 is ranked #8 in Code Arena. It’s also the most cost-efficient of the top 10 at $0.30 / $1.20 per MToken. Congrats to the team at @MiniMax_AI 👏 https://twitter.com/arena/status/2034397085022462451/photo/1
中文: MiniMax M2.7 在 Code Arena 中排名第 8 位。 也是前十名中最具成本效益的,每台MToken售价为0.30美元/ 1.20美元。 祝贺团队:@MiniMax_AI 👏
Arena.ai
When LLMs are unreliable, developers build scaffolding: Retries. Judges. Multi-step pipelines. When LLMs become reliable, you just prompt and ship. That’s been the real unlock over the last year. We break down what changed with AI capability expert @petergostev on YouTube:…
中文: 当LLM不可靠时,开发者会建造脚手架:重制。法官。多步骤管道。 当LLM变得可靠时,你只需提示并发货即可。 那是过去一年里真正的解锁。 我们通过YouTube上的人工智能功能专家@petergostev来分解变化:......
Arena.ai
The key to building a leaderboard that can’t be gamed is starting with the right structural foundations: neutrality and rigorous methodology. Arena scores are powered by a constant stream of real-world prompts from millions of users across the globe comparing responses from the… https://twitter.com/arena/status/2034330545954713876/video/1
中文: 构建一个无法玩得力的排行榜的关键,是从正确的结构基础开始:中立性和严谨的方法论。 竞技场的得分由来自全球数百万用户的源源不断的实时提示提供,可对比来自以下网址的回复。
Arena.ai
MiniMax M2.7 - the latest from @MiniMax_AI is ready for you in the Text and Code Arena! Let's see how it stacks up to real-world use. In Text Arena, we'll soon be able to compare its performance across multiple key categories like: Math, Coding, Creative Writing, Expert and… https://twitter.com/arena/status/2034300510086496329/photo/1
中文: MiniMax M2.7——最新动态:@MiniMax_AI 可在文本和代码竞技场为您准备! 让我们看看它是如何叠加到现实世界使用的。 在 Text Arena 中,我们将很快能够将其表现比对于数学、编程、创意写作、专家和......
Arena.ai
Did you know? We’re funding independent research in AI evaluation and measurement—up to $50k per project. The Q1 deadline to apply for Arena’s Academic Partnerships Program is March 31. https://twitter.com/arena/status/2034294095150215182/photo/1
中文: 你知道吗?我们正在资助人工智能评估与测量领域的独立研究,每个项目高达5万美元。 申请Arena学术合作项目的第一季度截止日期为3月31日。
Arena.ai
GPT 5.4 Mini and Nano by @OpenAI are available in the Text and Vision Arena! Check them out and don't forget to vote, we'll see how they stack up on the leaderboards. https://twitter.com/arena/status/2033994249264599321/photo/1
中文: @OpenAI 的 GPT 5.4 Mini 和 Nano 可在 Text 和 Vision Arena 上观看! 查看他们并别忘了投票,我们将看看他们如何在排行榜上叠加。
Arena.ai
Today we’re launching the Video Edit Arena to evaluate the frontier capability of video models! - #1 Grok-Imagine-Video, @xAI - #2 Kling-o3-pro, @Kling_ai - #3 Kling-o1-pro, @Kling_ai - #4 Gen4-aleph, @Runwayml The leaderboard is powered by thousands of real-world community… https://twitter.com/arena/status/2033981066873319610/photo/1
中文: 今天我们将推出视频编辑竞技场,以评估视频模型的前沿性能! - #1 Grok-Imagine-Video,@xAI - #2 克林-o3-pro,@Kling_ai - #3 克林-o1-pro,@Kling_ai - #4 代阿利芙,@Runwayml 排行榜由成千上万个现实世界的社区提供支持......
Arena.ai
Customize your Arena leaderboard. Everyone's real-world use for AI differs. Select the columns and data that matters most to you: - Rank Spread - Model Organization - License - Total Votes - Price ($/MToken) - Max Context https://twitter.com/arena/status/2033951859136938202/video/1
中文: 自定义您的Arena排行榜。 每个人在人工智能领域的实际使用都有所不同。选择对您最重要的列和数据: - 排名点差 - 模型组织 - 许可证 - 总票数 - 价格(美元/代币) - 最大背景
Arena.ai
Can AI tell when a question is total nonsense, or does it just make up an answer? @petergostev tested 80 models with nonsense questions. Some pushed back. Others confidently invented fake metrics and kept going. All of them were ranked on the "BS Bench". One surprise: thinking… https://twitter.com/arena/status/2033710089983660448/video/1
中文: 人工智能能判断一个问题何时完全是胡说八道,还是只是构成一个答案? @petergostev 测试了80款模型,并提出了无稽之谈。有些人推倒了。其他人则自信地发明了虚假指标,并不断推进。他们均在“BS Bench”上排名。 一个惊喜:思考......
Arena.ai
How often are users unhappy with the answers that top AI models give? This can give us a real world view of how the frontier shifts throughout time. We have looked back to 2023 and traced back how often users rated both responses in Battle Mode to be bad (limited to Top 25… https://twitter.com/arena/status/2033668985095647256/photo/1
中文: 用户对顶级人工智能模型给出的答案感到不满的频率是多少?这可以让我们真实世界地了解边境如何在时间上不断变化。 我们回顾了2023年,回顾了用户在战斗模式中将这两个回复评定为差的频率(仅限于前25名)。
Arena.ai
Grok 4.20 Beta Reasoning has landed #7 for Text Arena & #28 for Code Arena. The model is on par with DeepSeek-v3.2- thinking and Qwen3.5-122b-a10b in Code Arena's agentic webdev tasks. More Highlights: - #7 in Text Arena overall tied with GPT-5.4-high - top 10 in Math,… https://twitter.com/arena/status/2033652123800588419/photo/1
中文: Grok 4.20 Beta Reasoning 已登陆 Code Arena 第 7 号 Text Arena & 第28名。 该模型与 DeepSeek-v3.2- 思维模式和 Qwen3.5-122b-a10b 在 Code Arena 的代理 WebDev 任务中相当。 更多精彩内容: - 文字竞技场第7名,整体与GPT-5.4高位并列 - 数学前十名......
Arena.ai
Arena leaderboards now include Price and Context. - Price is shown as input / output cost per 1M tokens, and context shows the maximum context window. Compare Arena scores based on what matters for your use case. https://twitter.com/arena/status/2032579517177540787/video/1
中文: 竞技场排行榜现在包括普莱斯和普阿普特。 - 价格显示为每1M代币的输入/输出成本,上下文显示最大上下文窗口。 根据您的使用情况比较竞技场的分数。
Arena.ai
RT @nbrichtova: Once upon a time we dropped an anonymous model on Arena. 🍌🍌🍌 It quickly became the most-voted model in the platform’s hist…
中文: RT @nbrichto:从前我们在体育馆投放了一位匿名模特。🍌🍌🍌 它迅速成为该平台中投票最多的模式......
Arena.ai
An anonymous image model appeared on Arena on Aug 12, 2025 and quickly became the most-voted model in Arena's history. The codename: Nano Banana. It was later revealed to be built on Google Gemini and publicly released on Aug 26, 2025. We sat down with Lead Engineer Yue to… https://twitter.com/arena/status/2032198835997720886/video/1
中文: 一张匿名图片模型于2025年8月12日出现在Arena上,并迅速成为Arena历史上投票最多的模特。 代号:纳罗香蕉。 后来发现它基于谷歌双子座打造,并于2025年8月26日公开发布。 我们与首席工程师岳坐下来......
Arena.ai
An anonymous image model appeared on Arena on Aug 12, 2025 and quickly became the most-voted model in Arena's history. The codename: Nano Banana. It was later revealed to be built on Google Gemini and publicly released on Aug 26, 2025. We sat down with Lead Engineer Yue to… https://twitter.com/arena/status/2032198645983166619/video/1
中文: 一张匿名图片模型于2025年8月12日出现在Arena上,并迅速成为Arena历史上投票最多的模特。 代号:纳诺香蕉。 后来发现它基于谷歌双子座打造,并于2025年8月26日公开发布。 我们与首席工程师岳坐下来......
Arena.ai
An anonymous image model appeared on Arena on Aug 12, 2025 and quickly became the most-voted model in Arena's history. The codename: Nano Banana. It was later revealed to be built on Google Gemini and publicly released on Aug 26, 2025. We sat down with Lead Engineer Yue to… https://twitter.com/arena/status/2032198600995061813/video/1
中文: 一张匿名图片模型于2025年8月12日出现在Arena上,并迅速成为Arena历史上投票最多的模特。 代号:纳米香蕉。 后来发现它基于谷歌双子座打造,并于2025年8月26日公开发布。 我们与首席工程师岳坐下来......
Arena.ai
GPT-5.4-high has landed in the Code Arena top 6. Setup with the Codex Harness, @OpenAI’s latest model is on par with Gemini 3.1 Pro Preview for real-world web development tasks. Highlights: - top 6 in WebDev overall - #6 for Multi-File React - top 10 for Single-File HTML https://twitter.com/arena/status/2032126328842117612/photo/1
中文: GPT-5.4 高位已进入 Code Arena 前六名。 使用 Codex Harness 进行安装,@OpenAI 的最新版本与 Gemini 3.1 Pro 预览版在实际网页开发任务中相当。 亮点: - 网页开发(WebDev)整体排名前六 - 多文件 React 第 6 页 - 单文件HTML前10名
Arena.ai
GPT-5.4 and GPT-5.4-High by @OpenAI both sit in the top 5 for Arena Expert where rankings are based only on expert-level prompts. But beyond that similarity, let's take a closer look at specific domains and categories. Where 5.4-High has the most gains over 5.4: -… https://twitter.com/arena/status/2031849889249030583/photo/1
中文: GPT-5.4 和 GPT-5.4-High by @OpenAI 均位居 Arena Expert 前五名,该排名仅基于专家级提示。 但除了这种相似性之外,让我们更仔细地了解一下具体的领域和类别。 5.4-High 的涨幅最大: - . . .
Arena.ai
With Search AI, the real challenge isn’t retrieval - it’s reasoning about which sources to trust and how to incorporate them. Deep dive on how AI search actually works with our Research Engineer Logan in thread ↓
中文: 使用搜索人工智能,真正的挑战不在于检索——而在于如何信任哪些来源以及如何整合它们。 深入了解人工智能搜索在线下与研究工程师洛根的实际工作原理
Arena.ai
GPT-5.4 by @OpenAI lands tied #2 on Document Arena and in top 5 for Arena Expert. Document Arena Highlight: - #2 tied with Claude Sonnet 4.6 Text Arena Highlights: - #5 for Arena Expert - top 10 in Business, Management, & Financial Ops and Writing, Literature, & Language… https://twitter.com/arena/status/2031826221756268710/photo/1
中文: @OpenAI 的 GPT-5.4 在 Document Arena 上排名第二,并列 Arena Expert 的前五名。 文件竞技场亮点: - 第2名与克劳德·索内特并列4.6 文本竞技场亮点: - 5号体育馆专家 - 商业、管理、环境领域前十名;财务运营与写作、文学、文学与语言学;
Nemotron 3 Super by @NVIDIAAI ranks #37 among open models on Expert Arena. Super places in the top 50 open models in Text Arena and expands the Nemotron 3 family of open models for agentic AI applications. The family includes Nano, Super, and Ultra. Nano was released last… https://twitter.com/arena/status/2031763981963284680/photo/1
中文: Nemotron 3 Super 由 @NVIDIAAI 在 Expert Arena 的开放模型中排名第 37 位。 在 Text Arena 排名前 50 的开放模型中位居首位,并扩展了 Nemotron 3 系列的开放模型,用于代理人工智能应用。 家族包括Nano、Super和Ultra。Nano 于上次发布......
First impressions of GPT-5.4-High by @OpenAI with AI Capabilities Lead @petergostev. How does it compare to GPT-5.4-Medium? Find out on Arena's YouTube: https://www.youtube.com/watch?v=4T9_deFRI30
中文: @OpenAI 的 GPT-5.4-High 与人工智能功能的首次展示 领先于 @petergostev。 与GPT-5.4-Medium相比如何? 请访问 Arena 的 YouTube:
RT @ml_angelopoulos: Cool to see @arena on here 💪
中文: RT @ml_angelopolos:在这里看到@arena 的 很酷 💪
Claude Sonnet 4.6 lands at #2 on Document Arena. The top three models for document analysis and long-form reasoning are now all from @AnthropicAI. - #1 Opus 4.6 - #2 Sonnet 4.6 - #3 Opus 4.5 Ranking are all powered by anonymous side-by-side evaluations on user-uploaded PDFs… https://twitter.com/arena/status/2031012090681663717/photo/1
中文: 克劳德·索内特4.6号在文件竞技场的2号着陆。文档分析和长篇推理的前三大模型现在都来自 @AntropicAI。 - 第1名 4.6 - 第2号十四行诗 4.6 - 第3名 Opus 4.5 排名均采用用户上传的PDF文件的匿名并排评估...
PixVerse V5.6 by @PixVerse_ has landed in the Video Arena top 15. - #15 for Image-to-Video - #15 for Text-to-Video https://twitter.com/arena/status/2030076606757359751/photo/1
中文: @PixVerse_ 的 PixVerse V5.6 已进入视频竞技场第15名。 - 视频视频第15名 - 第15名,适用于文本视频
We’re still waiting to see what the community thinks about GPT-5.4 in Code Arena. We took it for a spin, now it’s your turn. https://twitter.com/arena/status/2030049237908787274/video/1
中文: 我们仍在等待社区对Code Arena中GPT-5.4的看法。我们把它转过来了,现在轮到你了。
GPT-5.4 High by @OpenAI has landed in the top 10 Text Arena. Let’s dig into why. Overall the latest model is much more rounded than the previous GPT-5.2-High, with significant improvements across quite a large number of categories. Below are where it has made the largest gains:… https://twitter.com/arena/status/2030018716440924225/photo/1
中文: @OpenAI 的 GPT-5.4 High 已进入前十名文本体育馆。让我们深入探讨一下原因。 总体而言,最新型号比之前的GPT-5.2-High更全面,在众多类别中均取得了显著改进。以下是取得最大进展的地方:
GPT-5.4-high is now in the Text Arena, tied with Gemini-3-Pro. Highlights: - Top 3 in Creative Writing, and top 10 in Instruction Following, Hard Prompts. - Top 6 for Occupational categories: Writing, Literature & Language, Entertainment, Sports & Media, Business, Management &… https://twitter.com/arena/status/2029648008602857694/photo/1
中文: GPT-5.4 高点现已位于 Text 球馆,与 Gemini-3-Pro 并列。 亮点: - 创意写作前三名,指导性指导前十名,精彩提示。 - 职业类别前六名:写作、文学与宣传、语言、娱乐、体育与体育、媒体、商业、管理及活动
AI needs better evaluations. Today we’re announcing Arena’s Academic Partnerships Program to fund independent academic research in AI evaluation and measurement. ▫️Up to $50K/project. Q1 Deadline: March 31, 2026. See more in thread for details and how to apply 👇 https://twitter.com/arena/status/2021268433619374336/photo/1
中文: 人工智能需要更好的评估。 今天,我们宣布将推出Arena的学术合作项目,以资助人工智能评估与测量领域的独立学术研究。 ▫ 项目金额高达 5 万美元。第一季度截止日期:2026年3月31日。 详情请查看更多内容 👇