In case you didn’t notice: Agent Arena doesn’t have a voting mechanism. So how do we calculate the scores? The answer is causal inference. Agents are multi-stage systems where the orchestrator and harness work together to produce the end result. We developed a method called… https://twitter.com/ml_angelopoulos/status/2064028763697127844/photo/1
中文: 如果没有注意到:Agent Arena 没有投票机制。那么我们如何计算分数呢? 答案是因果推断。代理是多阶段系统,其中编排器和线束协同工作以产生最终结果。我们开发了一种名为......
Excited to join Stanford CS 153 Office Hours tomorrow: Friday, May 1st at 12PM PT with @AnjneyMidha and @Mabb0tt. Live Q&A. Come say hi - link in thread.
中文: 很高兴明天与@AnjneyMidha和@Mabb0tt共同加入斯坦福CS 153办公时间:5月1日星期五下午12点。 实时问答快说你好——链接在线中。
Why GPT-5.5 is lower than Claude? The answer is simple: Code Arena currently only supports frontend/web development tasks, where GPT-5.5 is weakest. Full-stack app development and GitHub integration will land in a couple months. Next time we'll be clearer that this leaderboard… https://twitter.com/ml_angelopoulos/status/2048888792438939707/photo/1
中文: 为什么GPT-5.5比克劳德低? 答案很简单:Code Arena 目前仅支持前端/网页开发任务,其中 GPT-5.5 最弱。全栈应用开发和 GitHub 集成将在几个月内完成。 下次我们会更清楚地说明这个排行榜......
Cool to see this work from @Jsjcl293905 , @davidsimchilevi, and @WillWeiSun deriving efficiency bounds and new estimators for model rankings built on Arena's human preference data. There’s a lot of foundational work to be done to continue to improve the statistical foundations of…
中文: 很高兴看到 @Jsjcl293905、@davidsimchilevi 和 @WillWeiSun 的作品,其使用基于 Arena 的人类偏好数据,为模型排名提供了效率限制和新的估算器。需要做大量基础工作,以继续改善......的统计基础
Our team at @arena is solving one of the most important problems in AI: Evaluation. We have some of the top researchers and engineers in the world working together to measure intelligence in the real world and build the infrastructure to ensure it is deployed reliably. Since the… https://twitter.com/ml_angelopoulos/status/2038660899171619279/photo/1
中文: 我们@arena 的团队正在解决人工智能领域最重要的问题之一:评估。我们拥有全球顶尖的研究人员和工程师,共同测量现实世界中的智能,并建设基础设施,以确保其部署可靠。 自从......起:
Cool to see @arena on here 💪
中文: 在这里看到@arena 的酷酷 💪