stevibe
I built a macOS app for benchmarking local LLMs. 6 test suites. Multiple providers. One workspace. Open source. There are hundreds of local models now. New ones every week. How do you actually pick one? Leaderboards test for general ability. But if you're building an agent… https://twitter.com/stevibe/status/2043736487917981701/video/1
中文: 我为本地 LLM 构建了一个 macOS 应用程序。 6 间测试套件。多个供应商。一个工作区。开源。 现在有数百个本地模型。每周都有新的。你究竟如何挑选一个? 排行榜测试一般能力。但如果你正在建立代理人......
stevibe
So we know Gemma 4 is good at tool calling, but what about web coding? I threw 4 UI screenshots at three Gemma 4 models and said rebuild this, one shot, no hand-holding, just image in, code out. Model lineup: - E4B - 26B A4B (MoE) - 31B Dense (skipped the E2B this round) Let… https://twitter.com/stevibe/status/2040039108748177706/video/1
中文: 所以我们知道Gemma 4擅长工具调用,但网页编码呢? 我向三款Gemma 4型号投出了4张用户界面截图,并说重建了这个,一个镜头,没有手持,只有图像,代码也显示。 模特阵容: - E4B - 26B A4B(MoE) - 31B 密集 (本轮跳过E2B) 让......
stevibe
Which local models can actually handle tool calling? I built a framework to find out. 15 scenarios. 12 tools. Mocked responses. Temperature 0. No cherry-picking. Tested every Qwen3.5 size from 0.8B to 397B, and since some of you asked after the distillation tests: yes, I… https://twitter.com/stevibe/status/2036809734611988818/video/1
中文: 哪些本地模型实际上可以处理工具调用? 我建立了一个框架来查明事实。 15个场景。12个工具。模拟反应。温度0。不采摘。 测试了每份Qwen3.5尺寸,从0.8B到397B,由于蒸馏测试后有些人会问:是的,我......