斯坦福 CooperBench 实证：双 AI 组队协作效果不及单模型，社会化协同智能成行业瓶颈

核心摘要

斯坦福依托 CooperBench 基准开展 AI 代码智能体协同专项试验，得出反常识结论：两个 AI 模型联合作业的任务完成度显著低于单个模型独立运行。研究在 ICLR 学术研讨会发布，试验搭建 650 余项覆盖 Python、TypeScript、Go、Rust 四种编程语言的工程任务，特意选用极易出现代码冲突的协作场景，赋予双智能体代码修改、运行指令、实时对话沟通权限。

实测发现头部代码 AI 组队后性能近乎腰斩，性能短板集中在中等难度任务区间，也是理论上多模型协作最容易发挥优势的区间。即便开通互相通讯通道，也无法改善协作效率，AI 虽具备流利英文对话能力，但无法完成代码位置、语义层面协同协商，频繁出现无视同伴风险提醒、覆写对方代码、无效闲聊、不兑现任务约定等协作崩坏问题。

研究明确，优化提示词无法根治协作缺陷，本质缺失社会化交互训练。课题组提出优化路径：新增协作收益导向的训练目标、建立履约校验与契约约束机制、完善代码合并定期核验、丰富多模态协同通讯手段。研究者总结，现有大模型仅掌握文本语言，没有学会语言在社交协作场景的落地逻辑。本研究由斯坦福 HAI 资助。

原文节选

It’s the curse of coordination. A single model is better than two agents sharing the work. Today’s best coding agents lose nearly half their capability when paired up to share work.

The team built more than 650 real-world software engineering tasks across four coding languages designed to trigger coordination conflicts. Agents were allowed to edit code, run commands and message each other in real time, yet mutual communication brought almost no improvement to final results.

Agents often ignore critical warnings and overwrite partners’ code despite explicit reminders. Language fluency frequently masks coordination failures rather than fixing them. Researchers argue prompt tuning cannot fix the gap; AI needs targeted training for social collaboration intelligence.