这是用户在 2025-3-8 23:23 为 https://vintagedata.org/blog/posts/model-is-the-product 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

The Model is the Product

There were a lot of speculation over the past years about what the next cycle of AI development could be. Agents? Reasoners? Actual multimodality?
过去几年,关于人工智能发展的下一个周期可能是什么,有很多猜测。Agents?Reasoners?真正的多模态?

I think it's time to call it: the model is the product.
我认为是时候宣布了:模型就是产品。

All current factors in research and market development push in this direction.
当前研究和市场发展的所有因素都朝着这个方向推进。

  • Generalist scaling is stalling. This was the whole message behind the release of GPT-4.5: capacities are growing linearly while compute costs are on a geometric curve. Even with all the efficiency gains in training and infrastructure of the past two years, OpenAI can't deploy this giant model with a remotely affordable pricing.
    通用模型的扩展正在停滞。这就是 GPT-4.5 发布背后的全部信息:能力在呈线性增长,而计算成本却在几何曲线上攀升。尽管过去两年在训练和基础设施方面取得了所有效率提升,OpenAI 仍无法以相对可承受的价格部署这个庞大的模型。
  • Opinionated training is working much better than expected. The combination of reinforcement learning and reasoning means that models are suddenly learning tasks. It's not machine learning, it's not base model either, it's a secret third thing. It's even tiny models getting suddenly scary good at math. It's coding model no longer just generating code but managing an entire code base by themselves. It's Claude playing Pokemon with very poor contextual information and no dedicated training.
    观点明确的训练效果远超预期。强化学习与推理的结合意味着模型突然学会了任务。这不是机器学习,也不是基础模型,而是一种神秘的第三种事物。甚至小型模型在数学上突然变得异常出色。编码模型不再只是生成代码,而是自己管理整个代码库。Claude 在非常有限的上下文信息且没有专门训练的情况下玩《精灵宝可梦》。
  • Inference cost are in free fall. The recent optimizations from DeepSeek means that all the available GPUs could cover a demand of 10k tokens per day from a frontier model for… the entire earth population. There is nowhere this level of demand. The economics of selling tokens does not work anymore for model providers: they have to move higher up in the value chain.
    推理成本正在急剧下降。DeepSeek 最近的优化意味着,所有可用的 GPU 可以满足一个前沿模型每天 10k tokens 的需求,覆盖整个地球人口。目前没有这种级别的需求。对于模型提供商来说,出售 token 的经济模式已经不再适用:他们必须向价值链的上游移动。

This is also an uncomfortable direction. All investors have been betting on the application layer. In the next stage of AI evolution, the application layer is likely to be the first to be automated and disrupted.
这也是一个令人不安的方向。所有投资者都在押注应用层。在 AI 进化的下一阶段,应用层很可能首先被自动化和颠覆。

Shapes of models to come.
未来模型的形态。

Over the past weeks, we have seen two prime example of this new generation of models as a product: OpenAI's DeepResearch and Claude Sonnet 3.7.
在过去的几周里,我们看到了这一代模型作为产品的两个典型例子:OpenAI 的 DeepResearch 和 Claude Sonnet 3.7。

I've read a lot of misunderstandings about DeepResearch, which isn't helped by the multiplication of open and closed clones. OpenAI has not built a wrapper on top of O3. They have trained an entirely new model, able to perform search internally, without any external calls, prompts or orchestration:
我读到了很多关于 DeepResearch 的误解,这些误解并没有因为开放和封闭克隆的增多而得到缓解。OpenAI 并没有在 O3 之上构建一个封装器。他们训练了一个全新的模型,能够在内部执行搜索,无需任何外部调用、提示或编排:

The model learned the core browsing capabilities (searching, clicking, scrolling, interpreting files) (…) and how to reason to synthetize a large number of websites to find specific pieces of information or write comprehensive reports through reinforcement learning training on these browsing tasks.
模型通过学习核心浏览能力(搜索、点击、滚动、解释文件)(…)以及如何推理合成大量网站以找到特定信息或通过强化学习训练在这些浏览任务中撰写全面报告。

DeepResearch is not a standard LLM, nor a standard chatbot. It's a new form of research language model, explicitly designed to perform search tasks end to end. The difference is immediately striking to everyone using it seriously: the model generate lengthy reports with consistent structure and underlying source analysis process. In comparison as Hanchung Lee underlined all the other DeepSearch, including the Perplexity and Google variant, are just your usual models with a few twists:
DeepResearch 不是一个标准的 LLM,也不是一个标准的聊天机器人。它是一种新型的研究语言模型,专门设计用于端到端执行搜索任务。对于认真使用它的人来说,差异立即显而易见:该模型生成结构一致且包含底层源分析过程的长篇报告。相比之下,正如 Hanchung Lee 强调的那样,所有其他 DeepSearch,包括 Perplexity 和 Google 的变体,都只是带有一些变化的常规模型。

Google’s Gemini and Perplexity’s chat assistants also offer “Deep Research” features, but neither has published any literature on how they optimized their models or systems for the task or any substaintial quantitative evaluations (…) We will make an assumption that the fine-tuning work done is non-substantial.
Google 的 Gemini 和 Perplexity 的聊天助手也提供了“深度研究”功能,但两者均未发布任何关于如何为任务优化其模型或系统的文献,也未进行任何实质性的定量评估(……)我们将假设所进行的微调工作并不实质性。

Anthropic has been laying their current vision ever clearer. In December, they introduced a controversial but, to my mind, correct definition of agent models. Similarly to DeepSearch, an agent has to perform the targeted tasks internally: they "dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks".
Anthropic 一直在更清晰地阐述他们当前的愿景。去年 12 月,他们引入了一个有争议但在我看来正确的代理模型定义。与 DeepSearch 类似,代理必须在内部执行目标任务:它们“动态地指导自己的流程和工具使用,保持对任务完成方式的控制”。

What most agent startups are currently building is not agents, it's workflows, that is "systems where LLMs and tools are orchestrated through predefined code paths." Workflows may still bring some value, especially for vertical adaptations. Yet, to anyone currently working in the big labs it's strikingly obvious that all major progress in autonomous systems will be through redesigning the models in the first place.
大多数代理初创公司目前构建的并不是代理,而是工作流,即“通过预定义代码路径编排LLMs和工具的系统”。工作流可能仍会带来一些价值,尤其是在垂直适应方面。然而,对于目前在大实验室工作的任何人来说,显而易见的是,自主系统的所有重大进展都将首先通过重新设计模型来实现。

We had a very concrete demonstration of this with the release of Claude 3.7, a model primarily trained with complex code use cases in mind. All the workflow adaptation like Devin had a major boost on SWE benchmarks.
我们在 Claude 3.7 的发布中看到了一个非常具体的例子,这个模型主要针对复杂的代码使用场景进行训练。像 Devin 这样的工作流适应在 SWE 基准测试中有了显著的提升。

To give it another example done at a much smaller scale: at Pleias we're currently working on automating RAG. Current RAG systems are a lot of interconnected yet brittle workflows: routing, chunking reranking, query interpretation, query expansion, source contextualization, search engineering. With the evolving training tech stack, there is a real potential to bundle all theses processes in two separate yet interconnected models, one for data preparation and the other for search/retrieval/report generation. This requires an elaborated synthetic pipeline and entirely new reward functions for reinforcement learning. Actual training, actual research.
再举一个规模小得多的例子:在 Pleias,我们目前正在致力于自动化 RAG。当前的 RAG 系统由许多相互连接但脆弱的工作流程组成:路由、分块重排、查询解释、查询扩展、源上下文化、搜索工程。随着训练技术栈的不断演进,将这些流程捆绑到两个独立但相互连接的模型中具有真正的潜力,一个用于数据准备,另一个用于搜索/检索/报告生成。这需要一个精心设计的合成管道和全新的强化学习奖励函数。实际的训练,实际的研究。

What all this all means in practice: displacing complexity. Training anticipates a wide range of actions and edge cases, so that deployment becomes much more simple. But in this process most of the value is now created and, likely in the end, captured by the model trainer. In short, what Claude aims to disrupt and replace the current workflows like this basic "agent" system from llama index:
这一切在实践中意味着什么:取代复杂性。训练预见了广泛的动作和边缘情况,因此部署变得更加简单。但在这个过程中,大部分价值现在被创造出来,并最终可能被模型训练者捕获。简而言之,Claude 旨在颠覆并取代像 llama index 中的这种基本“代理”系统的工作流程:

Llama Index Basic Agent

With this:  与此:

Claude Agent

Training or being trained on.
训练或被训练中。

To reassert: the big labs are not advancing with an hidden agenda. While they can be opaque at time, they laying it all in the open: they will bundle, they will go up the application layer and they will attempt to capture most of the value there. And the commercial consequences are quite clear. Naveen Rao, the Gen AI VP of Databricks, phrased it quite well:
重申:大实验室并没有隐藏的议程在推进。虽然他们有时可能不透明,但他们将所有事情都公开了:他们会捆绑,他们会上升到应用层,并试图在那里捕获大部分价值。商业后果非常明确。Databricks 的 Gen AI 副总裁 Naveen Rao 说得很好:

all closed AI model providers will stop selling APIs in the next 2-3 years. Only open models will be available via APIs (…) Closed model providers are trying to build non-commodity capabilities and they need great UIs to deliver those. It's not just a model anymore, but an app with a UI for a purpose.
所有封闭的 AI 模型提供商将在未来 2-3 年内停止销售 API。只有开放模型将通过 API 提供(……)封闭模型提供商正在尝试构建非商品化能力,他们需要出色的 UI 来交付这些能力。这不再仅仅是一个模型,而是一个带有 UI 的应用程序,用于特定目的。

So what is happening right now is just a lot of denial. The honeymoon period between model providers and wrappers is over. Things could evolve in both direction:
所以现在的情况只是大量的否认。模型提供者和封装者之间的蜜月期已经结束。事情可能会朝两个方向发展:

  • Claude Code and DeepSearch are early technical and product experiments in this direction. You will notice that DeepSearch is not available through an API, only used to create value for the premium subscriptions. Claude Code is a minimalistic terminal integration. Weirdly enough, while Claude 3.7 works perfectly in Claude Code, Cursor struggles with it and I've already seen several high end users cancelling their subscriptions as a result. Actual LLM agents doesn't care about pre-existing workflows: they replace it.
    Claude Code 和 DeepSearch 是这一方向的早期技术和产品实验。你会注意到 DeepSearch 并未通过 API 提供,仅用于为高级订阅创造价值。Claude Code 是一个极简的终端集成。奇怪的是,虽然 Claude 3.7 在 Claude Code 中运行完美,但 Cursor 却难以应对,我已经看到几位高端用户因此取消了订阅。真正的 LLM 代理并不关心现有的工作流程:它们会取而代之。
  • The most high profile wrapper are now scrambling to become hybrid AI training companies. They do have some training capacities, though very little advertised. One of Cursor main assets is their small autocompletion model. WindSurf has their internal cheap code model, Codium. Perplexity always relied on home classifiers for routing and recently pivoted to train their own DeepSeek variant for search purposes.
    最引人注目的包装公司现在正争先恐后地转型为混合型 AI 训练公司。他们确实拥有一些训练能力,尽管很少宣传。Cursor 的主要资产之一是他们的小型自动补全模型。WindSurf 拥有他们内部廉价的代码模型 Codium。Perplexity 一直依赖自家的分类器进行路由,最近则转向训练他们自己的 DeepSeek 变体用于搜索目的。
  • For smaller wrappers, not much will change, except likely increased reliance on agnostic inference providers if the big labs entirely let go of this market. I do also expect to see much more focus on UI which is still dramatically underestimated, as even more generalist models are likely to bundle common deployment takss, especially for RAG.
    对于较小的封装器,除了如果大型实验室完全放弃这个市场,可能会增加对不可知推理提供商的依赖外,不会有太大变化。我也预计会看到更多对用户界面的关注,这一点仍然被严重低估,因为即使是更通用的模型也可能会捆绑常见的部署任务,尤其是对于 RAG。

In short the dilemma for most successful wrappers is simple: training or being trained on. What they are doing right now is both free market research for the big labs but, even, as all outputs is ultimately generated through model providers, free data design and generation.
简而言之,大多数成功包装者面临的困境很简单:是训练还是被训练。他们现在所做的既是为大实验室进行的免费市场调研,甚至,由于所有输出最终都是通过模型提供商生成的,也是免费的数据设计和生成。

What will happen afterwards is anyone guess. Successful wrappers do have the advantage of knowing their vertical well and accumiulating a lot of precious user feedbacks. Yet, in my experience, it's easier to go down from the model to application layers than building an entirely new training capacities from scratch. Wrappers may not have been helped by their investors either. From what I overheard, there is such a negative polarization against training, they almost have to hide what is going to be their most critical value: neither cursor small nor codium are properly documented at this moment.
之后会发生什么谁也说不准。成功的封装器确实有了解自身垂直领域并积累大量宝贵用户反馈的优势。然而,根据我的经验,从模型层向下应用到应用层比从头开始构建全新的训练能力要容易得多。封装器可能也没有得到投资者的帮助。据我所知,对训练存在如此负面的两极分化,他们几乎不得不隐藏即将成为他们最关键价值的东西:无论是 cursor small 还是 codium 目前都没有得到适当的记录。

Reinforcement learning was not priced in.
强化学习并未被定价。

This brings me to the actual painful part: currently all AI investments are correlated. Funds are operating under the following assumptions:
这让我想到了实际痛苦的部分:目前所有的 AI 投资都是相关的。基金们正在以下假设下运作:

  • The real value lay exclusively in an application layer independent from the model layer that is best positioned to disrupt existing market.
    真正的价值完全在于一个独立于模型层的应用层,它最有可能颠覆现有市场。
  • Model providers will only sell tokens at an ever lowering price, making wrappers in turn more profitable.
    模型提供商只会以越来越低的价格出售代币,从而使包装器变得更加有利可图。
  • Close models wrapping will satisfy all the existing demands, even in regulated sectors with long lasting concerns over external dependencies.
    紧密的模型封装将满足所有现有需求,即使在对外部依赖长期担忧的受监管行业也是如此。
  • Building any training capacity is just a waste of time. This does not include only pre-training but all forms of training.
    构建任何训练能力都只是浪费时间。这不仅包括预训练,还包括所有形式的训练。

I'm afraid this increasingly look like an adventurous bet and an actual market failure to accurately price the latest technical developments, especially in RL. In the current economic ecosystem, venture funds are meant to find uncorrelated investments. They will not beat S&P500 but that's not what larger institutional investors are looking for: they want to bundle risks, ensure that in a bad year at least some things will work out. Model training is like a textbook perfect example for this: lots of potential for disruption in a context where most western economies are on course for a recession. And yet model trainers can't raise, or at least not in the usual way. Prime Intellect is one of the few new western ai training companies that has a clear potential to become a frontier lab. Yet, despite their achievements including the training of the first decentralized LLM, they struggled to raise more than your usual wrapper.
恐怕这越来越像是一场冒险的赌注,也是市场在准确评估最新技术发展(尤其是在 RL 领域)方面的实际失败。在当前的经济生态系统中,风险基金旨在寻找不相关的投资。它们不会超越标普 500 指数,但这不是大型机构投资者所追求的:他们希望捆绑风险,确保在糟糕的年份至少有一些事情会成功。模型训练就像是一个教科书般的完美例子:在大多数西方经济体正走向衰退的背景下,具有巨大的颠覆潜力。然而,模型训练者却无法筹集资金,至少无法以通常的方式筹集。Prime Intellect 是少数几家有潜力成为前沿实验室的西方新 AI 训练公司之一。然而,尽管他们取得了包括训练第一个去中心化LLM在内的成就,他们却难以筹集到比普通包装公司更多的资金。

Beyond that, aside from the big lab, the current training ecosystem is very tiny. You can count all theses companies on your hands: Prime Intellect, Moondream, Arcee, Nous, Pleias, Jina, the HuggingFace pretraining team (actually tiny)… Along with a few more academic actors (Allen AI, Eleuther…) they build and support most of the current open infrastructure for training. In Europe, I know that at least 7-8 LLM projects will integrate the Common Corpus and some of the pretraining tools we developed at Pleias — and the rest will be fineweb, and likely post-training instruction sets from Nous or Arcee.
除此之外,除了大型实验室,当前的训练生态系统非常小。你可以用手指数出这些公司:Prime Intellect、Moondream、Arcee、Nous、Pleias、Jina、HuggingFace 预训练团队(实际上很小)……再加上一些学术机构(Allen AI、Eleuther……),他们构建并支持了当前大部分开放的训练基础设施。在欧洲,我知道至少有 7-8 个LLM项目将集成 Common Corpus 以及我们在 Pleias 开发的一些预训练工具——其余的将是 fineweb,可能还会使用 Nous 或 Arcee 的训后指令集。

There is something deeply wrong in the current funding environment. Even OpenAI senses it now. Lately, there was some felt irritation at the lack of "vertical RL" in the current Silicon Valley startup landscape. I believe the message comes straight from Sam Altman and will likely result in some adjustment in the next YC batch but pinpoint to a larger shift: soon the big labs select partners won't be API customers but associated contractors involved in the earlier training stage.
当前的融资环境存在严重问题。就连 OpenAI 现在也意识到了这一点。最近,硅谷初创企业生态中缺乏“垂直 RL”的现象引发了一些不满。我相信这一信息直接来自 Sam Altman,并可能导致下一批 YC 项目中的一些调整,但更指向一个更大的转变:很快,大型实验室选择的合作伙伴将不再是 API 客户,而是参与早期训练阶段的关联承包商。

If the model is the product, you cannot necessarily build it alone. Search and code are easy low hanging fruits: major use cases for two years, the market is nearly mature and you can ship a new cursor in a few months. Now many of the most lucrative AI uses cases in the future are not at this advanced stage of development — typically, think about all these rule based system that still rule most of the world economy… Small dedicated teams with a cross-expertise and a high level of focus may be best positioned to tackle this— eventually becoming potential acquihire once the initial ground work is done. We could see the same pipeline in the UI side. Some preferred partner, getting exclusive API access to close specialized models, provided they get on the road for business acquisition.
如果模型就是产品,你未必能独自构建它。搜索和代码是容易摘取的果实:两年的主要用例,市场已接近成熟,你可以在几个月内推出新的光标。如今,许多未来最有利可图的 AI 用例尚未达到这一高级发展阶段——通常,想想那些仍然主导着世界经济的基于规则的系统……拥有跨领域专业知识和高度专注的小型团队可能最适合应对这一挑战——一旦初步工作完成,他们可能成为潜在的收购对象。我们可以在 UI 方面看到同样的流程。一些首选合作伙伴,获得对封闭专用模型的独家 API 访问权限,前提是他们踏上业务收购之路。

I haven't mentioned DeepSeek, nor Chinese labs so far. Simply because DeepSeek is already one step further: not model as a product, but as a universal infrastructure layer. Like OpenAI and Anthropic, Lian Wenfeng lays his plans in the open:
我还没有提到 DeepSeek,也没有提到中国的实验室。仅仅是因为 DeepSeek 已经更进一步:不是将模型作为产品,而是作为通用的基础设施层。像 OpenAI 和 Anthropic 一样,连文锋公开了他的计划:

We believe that the current stage is an explosion of technological innovation, not an explosion of applications (…) If a complete upstream and downstream industrial ecosystem is formed, then there is no need for us to make applications ourselves. Of course, there is no obstacle for us to make applications if needed, but research and technological innovation will always be our first priority.
我们相信当前阶段是技术创新的爆发,而非应用的爆发(……)如果形成了完整的上下游产业生态系统,那么我们就不需要自己开发应用。当然,如果需要,我们开发应用也没有障碍,但研究和技术创新将始终是我们的首要任务。

At this stage, working only on applications is like "fighting the next wars with last war generals". I'm afraid we're at the point where many in the west are not even aware the last war is over.
在这个阶段,只专注于应用程序就像“用上一场战争的将军打下一场战争”。恐怕我们正处于一个许多西方人甚至没有意识到上一场战争已经结束的境地。