
A week after the famous, or infamous, OpenAI Dev Day, we at Confident AI released JudgementalGPT — an LLM agent built using OpenAI's Assistants API, specifically designed for the purpose of evaluating other LLM applications. What initially started off as an experimental idea quickly turned into a prototype that we were eager to ship as we received feedback from users that JudgementalGPT gave more accurate and reliable results when compared to other state-of-the-art LLM-based evaluation approaches such as G-Eval.
在 OpenAI 那场闻名遐迩或毁誉参半的开发者大会一周后,我们 Confident AI 团队推出了 JudgementalGPT——一款基于 OpenAI Assistants API 构建的LLM代理,专为评测其他LLM应用而设计。最初这只是一个实验性想法,但随着用户反馈显示 JudgementalGPT 相比 G-Eval 等当前最先进的LLM评测方法能提供更准确可靠的结果,它迅速转化为我们迫切想要发布的原型。
Understandably, knowing that Confident AI is the world's first open-source evaluation infrastructure for LLMs, many demanded more transparency into how JudgementalGPT was built after our initial public release:
可以理解的是,当得知 Confident AI 是全球首个面向LLMs的开源评测基础设施后,许多人在我们首次公开发布后要求更透明地了解 JudgementalGPT 的构建细节:
I thought it's all open source, but it seems like JudgementalGPT, in particular, is a black box for users. It would be great if we had more knowledge on how this is built.
我原以为都是开源的,但 JudgementalGPT 对用户来说似乎尤其像个黑箱。如果能了解更多它的构建原理就太好了。
So here you go, dear anonymous internet stranger, this article is dedicated to you.
所以亲爱的匿名网友,这篇文章就是献给你的。
Limitations of LLM-based Evaluations
基于LLM评测方法的局限性
The authors of G-Eval state that:
G-Eval 的作者们指出:
Conventional reference-based metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially for tasks that require creativity and diversity.
传统基于参考的指标,如 BLEU 和 ROUGE,已被证明与人类判断的相关性较低,尤其是在需要创造力和多样性的任务中。
For those who don't already know, G-Eval is a framework that utilizes Large Language Models (LLMs) with chain-of-thought (CoT) processing to evaluate the quality of generated texts in a form-filling paradigm, and if you've ever tried implementing a version of your own, you'll quickly find that using LLMs for evaluation presents its own set of problems:
对于尚不了解的人来说,G-Eval 是一个利用大型语言模型(LLMs)结合思维链(CoT)处理、以填表范式评测生成文本质量的框架。如果你曾尝试自行实现一个版本,很快就会发现使用LLMs进行评测会带来一系列新问题:
- Unreliability - although G-Eval uses a low-precision grading scale (1–5), which makes it easier for interpretation, these scores can vary a lot even under the same evaluation conditions. This variability is due to an intermediate step in G-Eval that dynamically generates steps for later evaluation, which increases the stochasticity of evaluation scores (and is also why providing an initial seed value doesn't help).
不可靠性——尽管 G-Eval 采用了一种低精度的评分量表(1-5 分),这便于解释,但即使在相同的评测条件下,这些分数也可能波动很大。这种变异性源于 G-Eval 中一个动态生成后续评测步骤的中间环节,该环节增加了评分结果的随机性(这也是为何提供初始种子值并无帮助的原因)。
- Inaccuracy - for certain tasks, one digit usually dominates (e.g., 3 for a grading scale of 1–5 using gpt-3.5-turbo). A way to get around this problem would be to take the probabilities of output tokens from an LLM to normalize the scores and take their weighted summation as the final score. But, unfortunately, this isn't an option if you're using OpenAI's GPT models as an evaluator, since they deprecated the logprobs parameter a few months ago.
不准确性——对于某些任务,某个数字通常会占据主导(例如在使用 gpt-3.5-turbo 的 1-5 分量表中,3 分常见)。解决此问题的一个方法是利用LLM的输出令牌概率对分数进行归一化处理,并取其加权和作为最终得分。但遗憾的是,若使用 OpenAI 的 GPT 模型作为评测器,则无法采用此方法,因为数月前他们已弃用了 logprobs 参数。
In fact, another paper that explored LLM-as-a-judge pointed out that using LLMs as an evaluator is flawed in several ways. For example, GPT-4 gives preferential treatment to self-generated outputs, is not very good at math (but neither am I), and is prone to verbosity bias. Verbosity bias means it favors longer, verbose responses instead of accurate, shorter alternatives. (In fact, an initial study has shown that GPT-4 exhibits verbosity bias 8.75% of the time)
事实上,另一篇探讨LLM作为评委的论文指出,使用LLMs作为评测工具存在多方面缺陷。例如,GPT-4 会偏袒自身生成的输出内容,数学能力欠佳(虽然我也不擅长),且容易产生冗长偏好。所谓冗长偏好,是指它更倾向于选择冗长繁复的回应,而非准确简洁的答案。(初步研究显示,GPT-4 出现冗长偏好的概率达 8.75%)
Can you see how this becomes a problem if you're trying to evaluate a summarization task?
你是否能看出,在尝试评测摘要任务时,这会如何成为一个问题?
OpenAI Assistants offers a workaround to existing problems
OpenAI 助手为现有问题提供了变通方案
Here's a surprise — JudgementalGPT isn't composed of one evaluator built using the new OpenAI Assistant API, but multiple. That's right, behind the scenes, JudgementalGPT is a proxy for multiple assistants that perform different evaluations depending on the evaluation task at hand. Here are the problems JudgementalGPT was designed to solve:
这里有个意外发现——JudgementalGPT 并非由单一使用新版 OpenAI 助手 API 构建的评测器组成,而是多个。没错,幕后真相是,JudgementalGPT 实为多个助手的代理,它们根据当前评测任务的不同执行各类评测。以下是 JudgementalGPT 旨在解决的问题:
- Bias — we're still experimenting with this (another reason for close-sourcing JudgementalGPT!), but assistants have the ability to write and execute code using the code interpreter tool, which means that, with a bit of prompt engineering, it can account for tasks that are more prone to logical fallacies, such as asserting coding or math problems, or tasks that require more factuality rather than giving preferential treatment to its own outputs.
偏见——我们仍在对此进行实验(这也是 JudgementalGPT 暂未开源的另一原因!),但助手能够使用代码解释器工具编写并执行代码,这意味着通过适当的提示工程,它可以处理那些更容易出现逻辑谬误的任务,比如验证编码或数学问题,或者需要更多事实依据而非偏袒自身输出的任务。 - Reliability — since we no longer require LLMs to dynamically generate CoTs/evaluation steps, we can enforce a set of rules for specific evaluation tasks. In other words, since we've pre-defined multiple sets of evaluation steps based on the evaluation task at hand, we have removed the biggest parameter contributing to stochasticity.
可靠性——由于我们不再依赖LLMs动态生成思维链/评测步骤,我们可以为特定评测任务强制执行一套规则。换言之,由于我们已根据当前评测任务预定义了多组评测步骤,我们消除了导致随机性的最大参数。 - Accuracy — having a set of pre-defined evaluation steps for different tasks also means we can provide more guidance based on what we as humans actually expect from each evaluator and quickly iterate on the implementation based on user feedback.
准确性——为不同任务预设评测步骤还意味着,我们能基于人类对每位评测者的实际期望提供更多指导,并根据用户反馈快速迭代实施方案。
Another insight we gained while integrating G-Eval into our open-source project DeepEval was the realization that LLM-generated evaluation steps tend to be arbitrary and generally does not help in providing guidance for evaluation. Some of you might also wonder what happens when JudgementalGPT can't find a suitable evaluator for a particular evaluation task. For this edge case, we default back to G-Eval. Here's a quick architecture diagram on how JudgementalGPT works:
在将 G-Eval 集成到我们的开源项目 DeepEval 过程中,我们获得的另一项洞见是:由LLM生成的评测步骤往往显得随意,通常无法为评测提供有效指导。部分读者可能还好奇,当 JudgementalGPT 无法为特定评测任务找到合适的评测者时会发生什么。针对这种边缘情况,我们会默认回退至 G-Eval。以下是 JudgementalGPT 运作方式的简要架构图:


As I'm writing this article, I discovered recent paper introducing Prometheus, "a fully open-source LLM that is on par with GPT-4's evaluation capabilities when the appropriate reference materials (reference answer, score rubric) are accompanied", which also requires evaluation steps to be explicitly defined instead.
在撰写本文时,我发现了一篇近期介绍 Prometheus 的论文,称其“是一个完全开源的LLM,在配备适当参考资料(参考答案、评分标准)的情况下,其评测能力可与 GPT-4 相媲美”,同时也要求明确界定评测步骤。
Confident AI: The DeepEval LLM Evaluation Platform
自信 AI:DeepEval LLM 评测平台
The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.
领先的云端评测和测试LLM应用程序平台,原生支持 DeepEval。
.png)
对LLM应用进行回归测试和评测。
.png)
轻松进行提示与模型的 A/B 测试。
.png)
在云端编辑和管理数据集。
.png)
LLM 可观测性与在线评测。
.png)
可公开分享的测试报告。
.png)
自动化收集人类反馈。
Got Red? Safeguard LLM Systems Today with Confident AI
The leading platform to red-team LLM applications for your organization, powered by DeepTeam.
.png)
.png)
.png)
.png)
.png)
.png)
Still, problems with LLM-based evaluation lingers
然而,基于LLM的评测问题依然存在
One unresolved issue pertains to the accuracy challenges stemming from the predominance of a single digit in evaluation scores. This phenomenon, theoretically, isn't exclusive to older models and is likely to affect advanced versions like gpt-4–1106-preview as well. So, I'm keeping an open mind about how this might affect JudgementalGPT. We're really looking forward to more research that'll either back up what we think or give us a whole new perspective — either way, I'm all ears.
一个未解决的问题涉及因评分中单一数字占主导地位而引发的准确性挑战。理论上,这一现象并非旧模型独有,很可能也会影响如 gpt-4–1106-preview 这样的高级版本。因此,我对这可能如何影响 JudgementalGPT 持开放态度。我们热切期待更多研究,无论是证实我们的观点还是提供全新视角——无论如何,我都洗耳恭听。
Lastly, there can still be intricacies involved in defining our own set of evaluators. For example, just like how G-Eval isn't a one-size-fits-all solution, neither is summarization, or relevancy. Any metric that is subject to interpretability is guaranteed to disappoint users who expect something different (click here to learn everything about LLM evaluation metrics). For now, the best solution would be to have users clearly define their evaluation criteria to rid LLMs of any evaluation ambiguity.
最后,定义我们自己的评测器集合仍可能涉及复杂性。例如,正如 G-Eval 并非万能解决方案一样,摘要或相关性评测同样如此。任何依赖解释性的指标都注定会让期待不同结果的用户失望(点击此处了解关于LLM评测指标的一切)。目前,最佳解决方案是让用户明确定义其评测标准,以消除LLMs中的任何评测模糊性。
Conclusion 结论
At the end of the day, there's no one-size-fits-all solution for LLM-based evaluations, which is why engineers/data scientists are frequently disappointed by non-human evaluation scores. However, by defining specific and concise evaluation steps for different use cases, LLMs are able to navigate ambiguity better, as they are provided more guidance into what a human might expect for different evaluation criteria.
归根结底,基于LLM的评测并不存在放之四海而皆准的解决方案,这也是工程师和数据科学家常对非人工评测分数感到失望的原因。然而,通过为不同用例定义具体而简洁的评测步骤,LLMs能够更好地应对模糊性,因为它们能获得更多关于人类对不同评测标准可能期望的指导。
P.S. By now, those of you who read between the lines will probably know the key to building a better evaluator is to tailor them for specific use cases, and OpenAI's new Assistant API along with its code interpreter functionality is merely the icing on the cake (and a good marketing strategy!).
附言:至此,善于从字里行间解读信息的读者或许已经明白,构建更优评测器的关键在于针对特定用例进行定制,而 OpenAI 新推出的助手 API 及其代码解释器功能不过是锦上添花(外加一项出色的营销策略!)。
So, dear anonymous internet stranger, I hope you're satisfied, and till next time.
那么,亲爱的匿名网友,希望你满意,下次再见。
Confident AI: The DeepEval LLM Evaluation Platform
The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.
.png)
.png)
.png)
.png)
.png)
.png)
Got Red? Safeguard LLM Systems Today with Confident AI
The leading platform to red-team LLM applications for your organization, powered by DeepTeam.
.png)
.png)
.png)
.png)
.png)
.png)
Confident AI: The DeepEval LLM Evaluation Platform
The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.
.png)
.png)
.png)
.png)
.png)
.png)
Got Red? Safeguard LLM Systems Today with Confident AI
The leading platform to red-team LLM applications for your organization, powered by DeepTeam.
.png)
.png)
.png)
.png)
.png)
.png)
Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an “aha!” moment, who knows?
想一起头脑风暴如何评测你的LLM(应用)吗?来我们的 Discord 提问吧,说不定我能给你一个“灵光乍现”的瞬间呢?
Confident AI: The DeepEval LLM Evaluation Platform
自信 AI:DeepEval LLM 评测平台
The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.
领先的云端LLM应用评测与测试平台,专为 DeepEval 原生打造。
.png)
回归测试并评测LLM应用。
.png)
轻松进行 A|B 测试提示与模型。
.png)
在云端编辑和管理数据集。
.png)
LLM 在线评测的可观测性
.png)
可公开分享的测试报告。
.png)
自动化收集人类反馈。
Got Red? Safeguard LLM Systems Today with Confident AI
The leading platform to red-team LLM applications for your organization, powered by DeepTeam.
.png)
.png)
.png)
.png)
.png)
.png)