Using Reinforcement Learning for
Load Testing of Video Games
使用强化学习进行视频游戏的负载测试

Rosalia Tufano 罗萨利亚·图法诺 SEART @ Software Institute
SEART @ 软件研究所
Università della Svizzera italiana
瑞士意大利大学Switzerland 瑞士 , Simone Scalabrino STAKE Lab STAKE 实验室
University of Molise 摩利塞大学Italy 意大利 , Luca Pascarella 卢卡·帕斯卡雷拉 SEART @ Software Institute
软件研究所
Università della Svizzera italiana
瑞士意大利大学Switzerland 瑞士 , Emad Aghajani SEART @ Software Institute
SEART @ 软件研究所
Università della Svizzera italianaSwitzerland 瑞士 , Rocco Oliveto STAKE Lab STAKE 实验室
University of Molise 摩利塞大学Italy 意大利 and Gabriele Bavota 加布里埃莱·巴沃塔 SEART @ Software Institute
软件研究所
Università della Svizzera italiana
瑞士意大利大学Switzerland 瑞士

(2022; 2022)

Abstract. 摘要。

Different from what happens for most types of software systems, testing video games has largely remained a manual activity performed by human testers. This is mostly due to the continuous and intelligent user interaction video games require. Recently, reinforcement learning (RL) has been exploited to partially automate functional testing. RL enables training smart agents that can even achieve super-human performance in playing games, thus being suitable to explore them looking for bugs. We investigate the possibility of using RL for load testing video games. Indeed, the goal of game testing is not only to identify functional bugs, but also to examine the game’s performance, such as its ability to avoid lags and keep a minimum number of frames per second (FPS) when high-demanding 3D scenes are shown on screen. We define a methodology employing RL to train an agent able to play the game as a human while also trying to identify areas of the game resulting in a drop of FPS. We demonstrate the feasibility of our approach on three games. Two of them are used as proof-of-concept, by injecting artificial performance bugs. The third one is an open-source 3D game that we load test using the trained agent showing its potential to identify areas of the game resulting in lower FPS.
与大多数软件系统不同，测试视频游戏在很大程度上仍然是由人类测试人员执行的手动活动。这主要是因为视频游戏需要持续和智能的用户交互。最近，强化学习（RL）已被利用来部分自动化功能测试。RL 使训练智能代理成为可能，甚至可以在玩游戏时实现超人类的表现，因此适合探索游戏以寻找错误。我们调查了使用 RL 进行视频游戏的负载测试的可能性。事实上，游戏测试的目标不仅是识别功能性错误，还要检查游戏的性能，例如在屏幕上显示高需求的 3D 场景时，其避免延迟并保持每秒帧数（FPS）的最低数量的能力。我们定义了一种方法，利用 RL 训练一个能够像人类一样玩游戏的代理，同时尝试识别导致 FPS 下降的游戏区域。我们在三款游戏上展示了我们方法的可行性。其中两款游戏被用作概念验证，注入了人为的性能错误。第三个是一个开源的 3D 游戏，我们使用训练有素的代理进行负载测试，展示其识别游戏中导致 FPS 降低的区域的潜力。

^†^†copyright: acmcopyright 版权：acmcopyright^†^†journalyear: 2022 期刊年份：2022^†^†conference: The 44th International Conference on Software Engineering; May 21–29, 2022; Pittsburgh, PA, USA
会议: 第 44 届国际软件工程大会; 2022 年 5 月 21 日至 29 日; 美国宾夕法尼亚州匹兹堡^†^†booktitle: 44th International Conference on Software Engineering (ICSE ’22), May 21–29, 2022, Pittsburgh, USA
书名: 第 44 届国际软件工程大会 (ICSE ’22), 2022 年 5 月 21 日至 29 日, 美国匹兹堡^†^†price: 15.00 价格: 15.00^†^†isbn: XXXX ISBN：XXXX^†^†journalyear: 2022 期刊年份：2022^†^†copyright: acmcopyright 版权：acmcopyright^†^†conference: 44th International Conference on Software Engineering; May 21–29, 2022; Pittsburgh, PA, USA
会议: 第 44 届国际软件工程大会; 2022 年 5 月 21 日至 29 日; 美国宾夕法尼亚州匹兹堡^†^†booktitle: 44th International Conference on Software Engineering (ICSE ’22), May 21–29, 2022, Pittsburgh, PA, USA
书名: 第 44 届国际软件工程大会 (ICSE ’22), 2022 年 5 月 21 日至 29 日, 美国宾夕法尼亚州匹兹堡^†^†price: 15.00 价格: 15.00^†^†doi: 10.1145/3510003.3510625
doi：10.1145/3510003.3510625^†^†isbn: 978-1-4503-9221-1/22/05
isbn：978-1-4503-9221-1/22/05

1. Introduction 1.介绍

The video game market is expected to exceed $200 billion in value in 2023 (mar, [n.d.]). In such a competitive market, releasing high-quality games and, consequently, ensuring a great user experience, is fundamental. However, the unique characteristics of video games (from hereon, games) make their quality assurance process extremely challenging. Indeed, besides inheriting the complexity of software systems, games development and maintenance require a diverse set of skills covered by many stakeholders, including graphic designers, story writers, developers, AI (Artificial Intelligence) experts, etc.
预计 2023 年，视频游戏市场价值将超过 2000 亿美元（mar，[n.d.]）。在如此竞争激烈的市场中，发布高质量的游戏，从而确保出色的用户体验，是至关重要的。然而，视频游戏（以下简称游戏）的独特特点使得它们的质量保证过程极具挑战性。事实上，除了继承软件系统的复杂性外，游戏的开发和维护还需要一系列技能，涵盖了许多利益相关者，包括图形设计师、编剧、开发人员、人工智能（AI）专家等。

Also, games can hardly benefit from testing automation techniques (Pascarella et al., 2018), since even just exploring the total space available in a given game level requires an intelligent interaction with the game itself. For example, in a racing game, identifying a bug that manifests when the finish line is crossed requires a player able to successfully drive the car for the whole track (i.e., requires the ability to drive the car). Thus, random exploration is not a viable option here.
此外，游戏几乎无法从测试自动化技术中受益（Pascarella 等，2018 年），因为即使只是探索给定游戏关卡中可用的全部空间，也需要与游戏本身进行智能交互。例如，在赛车游戏中，要识别在越过终点线时出现的错误，需要一名能够成功驾驶整个赛道的玩家（即需要有驾驶汽车的能力）。因此，在这里随机探索不是一个可行的选择。

Therefore, it comes without surprise that game testing is largely a manual process. Zheng et al. (Zheng et al., 2019) report that 30 human testers were employed for testing one of the games used in their study. Also, the challenges in testing games have been stressed by Lin et al. (Lin et al., 2016), who showed that 80% of the 50 popular games they studied have been subject to urgent updates.
因此，毫不奇怪游戏测试主要是一个手动过程。郑等人（Zheng et al., 2019）报告称，在他们的研究中使用的一款游戏中雇用了 30 名人类测试员。此外，林等人（Lin et al., 2016）强调了测试游戏的挑战，他们显示他们研究的 50 款热门游戏中有 80%需要紧急更新。

To support developers with game testing, researchers have proposed several techniques. These include approaches to test the stability of game servers (e.g., by generating high packet loads) (Jung et al., 2005; Bum Hyun Lim et al., 2006; Cho et al., 2010), model-based testing (Iftikhar et al., 2015) using domain modeling for representing the game and UML state machines for behavioral modeling, as well as techniques specifically designed for testing board games (Smith et al., 2009; De Mesentier Silva et al., 2017). When looking at recent techniques aimed at proposing more general testing frameworks, those exploiting Reinforcement Learning (RL) are on the rise. This is due to the remarkable results achieved by RL-based techniques in playing games with super-human performance reported in the literature (Baker et al., 2019; AI, 2019; Hessel et al., 2018; Mnih et al., 2013; Mnih et al., 2015; Vinyals et al., 2017).
为了支持开发人员进行游戏测试，研究人员提出了几种技术。这些技术包括测试游戏服务器稳定性的方法（例如，通过生成高数据包负载）（Jung 等，2005 年；Bum Hyun Lim 等，2006 年；Cho 等，2010 年），基于模型的测试（Iftikhar 等，2015 年）使用领域建模表示游戏和 UML 状态机进行行为建模，以及专门设计用于测试棋盘游戏的技术（Smith 等，2009 年；De Mesentier Silva 等，2017 年）。在研究更通用的测试框架的最新技术时，利用强化学习（RL）的技术正在兴起。这是因为 RL 技术在文献中报告的超人类表现游戏中取得的显著成果（Baker 等，2019 年；AI，2019 年；Hessel 等，2018 年；Mnih 等，2013 年；Mnih 等，2015 年；Vinyals 等，2017 年）。

RL is a machine learning technique aimed to train smart agents able to interact with a given environment (e.g., a game) and to take decisions to achieve a goal (e.g., win the game). RL is based on the simple idea of trial and error: The agent performs actions in the environment (of which it only has a partial representation) and receives a reward that allows it to assess its past actions/behavior with respect to the desired goal.
RL 是一种机器学习技术，旨在训练能够与给定环境（例如游戏）进行交互并做出决策以实现目标（例如赢得游戏）的智能代理。RL 基于试错的简单理念：代理在环境中执行动作（只有部分表示），并获得奖励，从而评估其过去的行动/行为与期望目标的关系。

Recently, researchers started using RL not only to play games but also to test them and, in general, to improve their quality. The common idea behind these approaches is to reduce the human effort in playtesting (i.e., the process of testing a new game to look for bugs before releasing it to the market) using intelligent agents. RL-based agents have been used to help game designers, for example, in balancing crucial parameters of the game (e.g., power-up item effects, characters abilities) (Zhao et al., 2019; Pfau et al., 2020; Zook et al., 2014) and in testing the game difficulty (Gudmundsson et al., 2018; Stahlke et al., 2020). Also, RL-based agents have been used to look for bugs in games (Pfau et al., 2017; Bergdahl et al., 2020; Zheng et al., 2019; Ariyurek et al., 2021).
最近，研究人员开始使用强化学习不仅用于玩游戏，还用于测试游戏，并且一般来说，用于提高游戏质量。这些方法背后的共同想法是利用智能代理减少游戏测试中的人力工作量（即在将新游戏发布到市场之前测试游戏以查找错误的过程）。基于强化学习的代理已被用于帮助游戏设计师，例如平衡游戏的关键参数（例如，增强道具效果，角色能力）（Zhao 等，2019 年；Pfau 等，2020 年；Zook 等，2014 年），以及测试游戏难度（Gudmundsson 等，2018 年；Stahlke 等，2020 年）。此外，基于强化学习的代理还被用于寻找游戏中的错误（Pfau 等，2017 年；Bergdahl 等，2020 年；Zheng 等，2019 年；Ariyurek 等，2021 年）。

While agents are usually trained to play a game with the goal of winning, the aforementioned works trained the agent to not only advance in the game but also to explore it to search for bugs. For example, Ariyurek et al. (Ariyurek et al., 2021) combine RL and Monte Carlo Tree Search (MCTS) to find issues in the behavior of a game, given its design constraints and game scenario graph (provided by the game developer). The ICARUS framework (Pfau et al., 2017) is able to identify crashes and blockers bugs (e.g., the game get stuck for a certain amount of time) while the agent is playing. Similarly, the approach by Zheng et al. (Zheng et al., 2019), also exploiting RL, can identify bugs spotted by the agent during training (e.g., crashes).
尽管代理通常接受训练以玩游戏并达到获胜的目标，但上述作品训练代理不仅要在游戏中取得进展，还要探索游戏以寻找错误。例如，Ariyurek 等人（Ariyurek 等人，2021）结合强化学习和蒙特卡洛树搜索（MCTS）来发现游戏行为中的问题，考虑到其设计约束和游戏场景图（由游戏开发人员提供）。ICARUS 框架（Pfau 等人，2017）能够在代理玩游戏时识别崩溃和阻碍性错误（例如，游戏卡住一段时间），而郑等人（Zheng 等人，2019）的方法也利用强化学习，可以识别代理在训练过程中发现的错误（例如，崩溃）。

While these approaches pioneered the use of RL for game testing, they are mostly aimed at testing functional (e.g., finding crashes) or design-related (e.g., level design) aspects. However, these are not the only types of bug developers look for in playtesting.
尽管这些方法开创了使用强化学习进行游戏测试的先河，但它们主要旨在测试功能性（例如，查找崩溃）或与设计相关的（例如，关卡设计）方面。然而，这些并不是开发人员在游戏测试中寻找的唯一类型的错误。

In a recent survey, Politowski et al. (Politowski et al., 2021) reported that for two out of the five games they considered (i.e., League of Legends by Riot and Sea of Thieves by Rare) developers partially automated game performance checks (e.g., frame-rate). Similarly, Naughty Dog used specialized profiling tools¹¹1https://youtu.be/yH5MgEbBOps?t=3494
在最近的一项调查中，Politowski 等人（Politowski 等人，2021 年）报告称，在他们考虑的五款游戏中有两款（即由 Riot 开发的英雄联盟和由 Rare 开发的海盗之海）的开发者部分自动化了游戏性能检查（例如帧率）。同样，Naughty Dog 使用了专门的分析工具 ¹ for finding which parts of a given scene caused a drop in the number of frames per second (FPS) in The Last of Us. Truelove et al. (Truelove et al., 2021) report that game developers agree that Implementation response problems (among which, performance-related ones) may severely impact the game experience. Also, Li et al. (Li et al., 2021) observed that players frequently complain about performance issues in game reviews.
来查找《最后生还者》中哪些场景部分导致了每秒帧数（FPS）下降。Truelove 等人（Truelove 等人，2021 年）报告称，游戏开发者一致认为实施响应问题（其中包括与性能相关的问题）可能严重影响游戏体验。此外，Li 等人（Li 等人，2021 年）观察到玩家经常在游戏评论中抱怨性能问题。

Significance of research contribution. Despite such a strong evidence about the importance of detecting performance issues in video games, to the best of our knowledge no previous work introduced automated approaches for load testing video games. We present RELINE (Reinforcement lEarning for Load testINg gamEs), an approach exploiting RL to train agents able to play a given game while trying to load test it with the goal of minimizing its FPS. The agent is trained using a reward function enclosing two objectives: The first aims at teaching the agent how to advance in (and possibly win) the game. The second rewards the agent when it manages to identify areas of the game exhibiting low FPS. The output of RELINE is a report showing to developers the identified areas in the game being negative outliers in terms of FPS, accompanied by videos of the gameplays exhibiting the issue. Such “reports” can simplify the identification and reproduction of performance issues, that are often reported in open-source 3D games (see e.g., (dwa, [n.d.]d; 3dc, [n.d.]; geo, [n.d.]; dwa, [n.d.]b)) and that, in some cases, are challenging to reproduce (see e.g., (dwa, [n.d.]a, [n.d.]e)), even requiring special instructions for their reporting (dwa, [n.d.]c).
研究贡献的重要性。尽管有关检测视频游戏性能问题重要性的强有力证据，据我们所知，以前没有任何工作引入自动化方法来进行视频游戏的负载测试。我们提出了 RELINE（Reinforcement lEarning for Load testINg gamEs），这是一种利用强化学习来训练代理程序能够玩给定游戏并尝试通过负载测试来最小化其 FPS 的方法。代理程序使用一个奖励函数进行训练，其中包含两个目标：第一个目标旨在教导代理程序如何在游戏中前进（并可能赢得游戏）。第二个目标是当代理程序成功识别游戏中存在低 FPS 的区域时给予奖励。RELINE 的输出是一份报告，向开发人员展示游戏中被识别为 FPS 方面的负面异常值的区域，并附带展示存在问题的游戏过程的视频。这种“报告”可以简化性能问题的识别和重现，这些问题通常在开源 3D 游戏中报告（例如，参见（dwa，[n.d.]d；3dc，[n.d.]；geo，[n.d.]；dwa，[n.d.]b）），有时很难重现（例如，参见（dwa，[n.d.]a，[n.d.]）。]e)), 甚至要求对其进行特殊报告说明（dwa，[n.d.]c）。

We experiment RELINE with three games. The first two are simple 2D games that we use as a proof-of-concept. In particular, we injected in the games artificial “performance bugs” (Delgado-Pérez et al., 2021) to check whether the agent is able to spot them. We show that the agent trained using RELINE can identify the injected bugs more often than (i) a random agent, and (ii) a RL-based agent only trained to play the game. Then, we use RELINE to load test an open-source 3D game (supertuxkart, [n.d.]), showing its ability to identify areas of the game being negative outliers in terms of FPS.
我们用三款游戏进行了 RELINE 实验。前两款是简单的 2D 游戏，我们将它们用作概念验证。特别是，我们在游戏中注入了人工“性能错误”（Delgado-Pérez 等，2021 年），以检查代理是否能够发现它们。我们展示了使用 RELINE 训练的代理能够更经常地识别注入的错误，而不是（i）随机代理，和（ii）仅训练玩游戏的基于 RL 的代理。然后，我们使用 RELINE 对一个开源的 3D 游戏（supertuxkart，[n.d.]）进行负载测试，展示其能够识别游戏中 FPS 方面的负面异常区域。

Code and data from our study are publicly available (Tufano, 2021).
我们研究的代码和数据是公开可用的（Tufano，2021 年）。

Refer to caption — Figure 1. *RELINE* overview
图 1. RELINE 概述

2. RL to Load Test Video Games
2. RL 加载测试视频游戏

In this section we explain, from an abstract perspective, the idea behind RELINE. We describe in the study designs how we instantiated RELINE to the different games we experiment with (e.g., details about the adopted RL models).
在本节中，我们从抽象的角度解释 RELINE 背后的理念。我们在研究设计中描述了我们如何将 RELINE 实例化到我们进行实验的不同游戏中（例如，有关采用的 RL 模型的详细信息）。

RELINE requires three main components: the game to load test, a RL model, representing the agent that must learn how to play the game while load testing it, and a reward function, used to reward the agent so that it can evaluate the worth of its actions for reaching the desired goal (i.e., playing while load testing). The RL model is trained through the 4-step loop depicted in Fig. 1 (see the circled numbers). The continuous lines represent steps performed at each iteration of the loop, while the dashed ones are only performed after a first iteration has been run (i.e., after the agent performed at least one action in the game). When the first episode (i.e., a run of the game) of the training starts (step 1), at each time step $\tau$ the game provides its state $s_{\tau}$ . Such a state can be, for example, a set of frames or a numerical vector representing what is happening in the game (e.g., the agent’s position). The RL model takes as input $s_{\tau}$ (step 2) and provides as output the action $a_{\tau}$ to perform in the game (step 3). When the agent has no experience in playing the game at the start of the training, the weights of the neural network in the RL model are randomly initialized, producing random actions. The action $a_{\tau}$ is executed in the game (step 4), which, in turn, generates the subsequent state $s_{\tau+1}$ .
RELINE 需要三个主要组件：要加载测试的游戏，代表代理程序的 RL 模型，该代理程序必须学会如何玩游戏并在加载测试时进行测试，以及奖励函数，用于奖励代理程序，以便它可以评估其行动的价值，以达到期望的目标（即在加载测试时进行游戏）。RL 模型通过图 1 中描述的 4 步循环进行训练（请参见圈出的数字）。实线代表循环的每次迭代执行的步骤，而虚线仅在第一次迭代运行后执行（即在代理程序在游戏中执行至少一个动作后执行）。当训练的第一集（即游戏运行）开始时（步骤 1），在每个时间步 $\tau$ 中，游戏提供其状态 $s_{\tau}$ 。这样的状态可以是，例如，一组帧或表示游戏中发生的事情的数值向量（例如，代理的位置）。RL 模型将 $s_{\tau}$ 作为输入（步骤 2），并提供要在游戏中执行的动作 $a_{\tau}$ 作为输出（步骤 3）。当代理在训练开始时没有玩游戏的经验时，RL 模型中的神经网络权重被随机初始化，产生随机动作。在游戏中执行动作 $a_{\tau}$ （第 4 步），进而生成后续状态 $s_{\tau+1}$ 。

After the first iteration (i.e., after having received at least one $a_{\tau}$ ), the game also produces, at each iteration, the data needed to compute the reward function. In RELINE we collect (i) the information needed to assess how well the agent is playing the game (e.g., time since the episode started and the episode score), and (ii) the FPS at time $\tau$ . It is required that the game developer instruments the game and provide APIs through which RELINE can acquire such pieces of information. We assume that this requires a minor effort.
在第一次迭代之后（即至少收到一个 $a_{\tau}$ 之后），游戏还会在每次迭代时产生计算奖励函数所需的数据。在 RELINE 中，我们收集（i）评估代理玩游戏表现的所需信息（例如，自从剧集开始以来的时间和剧集得分），以及（ii）时间 $\tau$ 的 FPS。需要游戏开发人员为游戏添加仪器，并提供 API，通过这些 API，RELINE 可以获取这些信息。我们假设这需要一点努力。

The reward function aims at training an agent that is able to (i) play the game, thanks to the information indicating how well the agent is playing, and (ii) identify low-FPS areas, thanks to the information about the FPS. The output of the reward function is a number representing the reward obtained by the agent at time $\tau$ . In RELINE, the reward function for a given action is composed of two sub-functions: A game reward function, depending on how good the action is in the game ( $\mathit{rg}_{\tau}$ ), and a performance reward function, depending on how the action impacts the performance ( $\mathit{rp}_{\tau}$ ).
奖励函数旨在训练一个能够（i）玩游戏的代理，这要归功于指示代理玩得有多好的信息，以及（ii）识别低 FPS 区域，这要归功于 FPS 信息。奖励函数的输出是一个数字，代表代理在时间 $\tau$ 获得的奖励。在 RELINE 中，给定动作的奖励函数由两个子函数组成：一个游戏奖励函数，取决于动作在游戏中的表现有多好（ $\mathit{rg}_{\tau}$ ），以及一个性能奖励函数，取决于动作如何影响性能（ $\mathit{rp}_{\tau}$ ）。

We combine such functions in $r_{\tau}=\mathit{rg}_{\tau}+\mathit{rp}_{\tau}$ . The game reward function clearly depends on the game under test: A function designed for a racing game likely makes no sense for a role-playing game. In general, defining the reward function for learning to play should be performed by considering (i) what the goal of the game is (e.g., drive on a track), and (ii) which information the game provides about the “successful behavior of the player” (e.g., is there a score?). Even if less intuitive, the performance reward function is game-dependent as well: Assuming a tiny FPS drop (e.g.,-1%), the reward can be small for a role-playing game, in which it likely does not affect the whole experience, while it should be high for an action game, in which it could even cause the (unfair) player’s defeat. Unlike the game reward function, we expect however minor changes to be required to adapt the performance reward function to a different video game (i.e., tuning of the thresholds to use).
我们在 $r_{\tau}=\mathit{rg}_{\tau}+\mathit{rp}_{\tau}$ 中结合这些功能。游戏奖励功能显然取决于正在测试的游戏：为竞速游戏设计的功能可能对角色扮演游戏毫无意义。一般来说，定义学习玩法的奖励功能应该考虑以下因素：（i）游戏的目标是什么（例如，在赛道上驾驶），以及（ii）游戏提供了关于“玩家成功行为”的哪些信息（例如，是否有得分？）。即使不太直观，性能奖励功能也取决于游戏：假设有微小的 FPS 下降（例如，-1%），对于角色扮演游戏，奖励可能很小，因为这可能不会影响整个体验，而对于动作游戏，奖励应该很高，因为这甚至可能导致（不公平的）玩家失败。与游戏奖励功能不同，我们预计需要对性能奖励功能进行微小更改，以适应不同的视频游戏（即，调整使用的阈值）。

The state $s_{\tau}$ , the action $a_{\tau}$ , and the reward $r_{\tau}$ are then stored in an experience buffer. When enough experience has been accumulated, it is used to update the network weights. How experience is stored and used to update the network depends on the used RL model.
状态 $s_{\tau}$ ，动作 $a_{\tau}$ 和奖励 $r_{\tau}$ 然后存储在经验缓冲区中。当积累了足够的经验时，它被用来更新网络权重。如何存储经验并用于更新网络取决于使用的 RL 模型。

The episode ends when a final state is reached. Again, the definition of the final state depends on the game, and it could be based on a timeout (e.g., each episode lasts at most 90 seconds) or on a specific condition that must be met (e.g., the agent crosses the finish line). Once the episode ends, the game is reinitialized and the loop restarts. The training is performed for a number of episodes sufficient to observe a convergence in the total reward achieved by an agent during an episode (e.g., if the trained agent obtains a reward of 100 for ten consecutive episodes the training is stopped).
当达到最终状态时，该情节结束。同样，最终状态的定义取决于游戏，可能基于超时（例如，每个情节持续时间最多 90 秒）或基于必须满足的特定条件（例如，代理穿过终点线）。一旦情节结束，游戏将被重新初始化，循环重新开始。进行训练的情节数量足以观察代理在情节期间获得的总奖励的收敛（例如，如果经过训练的代理在连续十个情节中获得 100 的奖励，则停止训练）。

3. Preliminary Study: Injecting Artificial Performance Issues
3.初步研究：注入人工性能问题

This preliminary study aims at assessing the ability of RELINE in identifying artificial “performance bugs” (Delgado-Pérez et al., 2021) we simulate in two 2D games. It is important to highlight that the goal of this study is only to demonstrate the applicability of RELINE for load testing games as a proof-of-concept. A case study on a 3D open-source game is presented in Section 4.
这项初步研究旨在评估 RELINE 在识别我们在两款 2D 游戏中模拟的人工“性能错误”（Delgado-Pérez 等，2021）的能力。重要的是要强调，这项研究的目标仅仅是为了证明 RELINE 在负载测试游戏方面的适用性作为概念验证。第 4 节介绍了一个关于 3D 开源游戏的案例研究。

3.1. Study Design 3.1.研究设计

We select two 2D games, CartPole (Car, [n.d.]) and MsPacman (Pac, [n.d.]). The former — Fig. 2-(a) — is a dynamic system in which an unbalanced pole is attached to a moving cart, and the player must move the cart to balance the pole and keep it in a vertical position.
我们选择了两款 2D 游戏，CartPole（Car，[n.d.]）和 MsPacman（Pac，[n.d.]）。前者——图 2-(a)——是一个动态系统，其中一个不平衡的杆连接在移动的小车上，玩家必须移动小车来平衡杆并使其保持垂直位置。

The player loses if the pole is more than 12 degrees from vertical or the cart moves too far from the center. The latter — Fig. 2-(b) — is the classic Pac-Man game in which the goal is to eat all dots without touching the ghosts. Both games employ simple 2D graphics which bound the player’s possible moves in only one (e.g., left and right, for CartPole) or two (e.g., left, right, up, and down, for MsPacman) dimensions. This is one of the reasons we selected these games for assessing whether a RL-based agent that learned how to play them can also be trained to look for artificial “performance bugs” we injected. Also, both games are integrated in the popular Gym Python toolkit (Gym, [n.d.]) developed by OpenAI (Brockman et al., 2016).
如果杆离垂直线超过 12 度或小车离中心太远，玩家就会失败。后者——图 2-(b)——是经典的吃豆人游戏，目标是吃掉所有点而不碰触鬼魂。这两款游戏都采用简单的 2D 图形，限制了玩家可能的移动方式，只能在一个（例如，左右，对于 CartPole）或两个（例如，左、右、上、下，对于 MsPacman）维度上移动。这是我们选择这些游戏的原因之一，以评估一个基于 RL 的代理程序是否能够学会玩这些游戏，同时也能够被训练来寻找我们注入的人工“性能错误”。此外，这两款游戏都集成在由 OpenAI（Brockman 等人，2016）开发的流行 Gym Python 工具包（Gym，[n.d.]）中。

Gym can be used for developing and comparing RL-based agents in playing games. It acts as a middle layer between the environment (the game) and the agent (a virtual player). In particular, Gym collects and executes actions (e.g., go left, go right) generated by the agent and returns to it the new state of the environment (i.e., screenshots) with additional information such as the score in the episode. Gym comes with a set of integrated arcade games including the two we used in this preliminary study.
Gym 可用于开发和比较基于 RL 的代理在玩游戏中。它充当环境（游戏）和代理（虚拟玩家）之间的中间层。具体来说，Gym 收集并执行代理生成的动作（例如，向左走，向右走），并将新的环境状态（即屏幕截图）与额外信息（例如本集的得分）返回给代理。Gym 配备了一组集成的街机游戏，包括我们在这项初步研究中使用的两款游戏。

3.1.1. Bug Injection 3.1.1. Bug 注入

We injected two artificial “performance bugs” in CartPole and four in MsPacman. The idea behind them is simple: When the agent visits specific areas for the first time during a game, the bugs reveal themselves (simulation of heavy resource loading). A natural way of achieving this goal would have been to introduce the bugs in the source code of the game and to implement the logic to spot FPS drops in the agent accordingly. This, however, would have slowed down the training of the agent. Therefore, we chose to use a more practically sound approach, inspired by the simulation of Heavy-Weight Operation (HWO) operator for performance mutation testing (Delgado-Pérez et al., 2021): We directly assume that the agents observe the bugs when they visit the designated areas and act accordingly.
我们在 CartPole 中注入了两个人工“性能错误”，在 MsPacman 中注入了四个。它们背后的想法很简单：当代理在游戏中第一次访问特定区域时，错误就会显现出来（模拟大量资源加载）。实现这一目标的一种自然方式本应是在游戏源代码中引入错误，并相应地实现逻辑以检测代理中的 FPS 下降。然而，这样做会减慢代理的训练速度。因此，我们选择了一种更为实际的方法，受到性能变异测试中重量级操作（HWO）操作员的启发（Delgado-Pérez 等，2021 年）：我们直接假设代理在访问指定区域时观察到错误并相应地采取行动。

In CartPole, the agent can only move on the $x$ axis (i.e., left or right). When the game starts, the agent is in position $x=0$ (i.e., center of the axis) and it can change its position towards positive (by moving right) or negative (left) $x$ values. The two bugs we injected manifest when $x\in[-0.50,-0.45]$ and $x\in[0.45,0.50]$ — dashed lines in Fig. 2-(a). We use intervals rather than specific values (e.g.,-0.45) because the position of the agent is a float: if it moves to position -0.450001, we want to reward it during the training for having found the injected bug. Concerning MsPacman, we assume that a performance bug manifests when the agent enters the four gates indicated by the white arrows in Fig. 2-(b).
在 CartPole 中，代理只能沿着 $x$ 轴移动（即左或右）。游戏开始时，代理位于位置 $x=0$ （即轴的中心），它可以朝着正方向（向右移动）或负方向（向左移动）改变位置 $x$ 值。我们注入的两个错误在 $x\in[-0.50,-0.45]$ 和 $x\in[0.45,0.50]$ 时显现出来 — 图 2-(a)中的虚线。我们使用区间而不是具体值（例如，-0.45），因为代理的位置是一个浮点数：如果它移动到位置-0.450001，我们希望在训练过程中奖励它找到了注入的错误。关于 MsPacman，我们假设当代理进入图 2-(b)中白色箭头指示的四个门时，性能错误会显现出来。

As detailed in Section 3.1.4, we assess the extent to which RELINE is able to identify the bugs we injected while playing the games. To have a baseline, we compare its results with those of a RL-based agent only trained to play each of the two games (from hereon, rl-baseline), and with a random agent. Since RELINE will be trained with the goal of identifying the bugs (details follow), we expect it to adapt its behavior to not only successfully play the game, but to also exercise more often the “buggy” areas of the games.
如第 3.1.4 节所述，我们评估 RELINE 能够在玩游戏时识别我们注入的错误的程度。为了有一个基准，我们将其结果与仅训练玩两款游戏的基于 RL 的代理（以下简称 rl-baseline）以及随机代理进行比较。由于 RELINE 将被训练以识别错误（详细内容将在后文介绍），我们期望它调整其行为，不仅成功玩游戏，还要更频繁地练习游戏中的“错误”区域。

3.1.2. Learning to Play: RL Models and Game Reward Functions
3.1.2.学习玩游戏：RL 模型和游戏奖励函数

We trained the rl-baseline agent (i.e., the one only trained to learn how to play) for CartPole using the cross-entropy method (Rubinstein and Kroese, 2004) as RL model. We choose this method because, despite its simplicity, it has been shown to be effective in applications of RL to small environments such as CartPole (Lapan, 2018).
我们使用交叉熵方法（Rubinstein 和 Kroese，2004）作为 RL 模型，为 CartPole 训练了仅训练学习如何玩的 rl-baseline 代理。我们选择这种方法是因为尽管简单，但已被证明在 RL 应用于像 CartPole 这样的小环境中是有效的（Lapan，2018）。

The core of the cross-entropy method is a feedforward neural network (FNN) that takes as input the state of the game and provides as output the action to perform. The state of the game for CartPole is a vector of dimension 4 containing information about the $x$ coordinate of the pole’s center of mass, the pole’s speed, its angle with respect to the platform, and its angular speed. There are two possible actions: go right, go left. Once initialized with random weights, the agent (i.e., the FNN) starts playing while retaining the experience acquired in each episode: The experience is represented by the state, the action, and the reward obtained during each step of the episode.
交叉熵方法的核心是一个前馈神经网络（FNN），它以游戏状态作为输入，并输出要执行的动作。CartPole 的游戏状态是一个包含有关杆质心的 $x$ 坐标、杆的速度、相对于平台的角度和角速度信息的 4 维向量。有两种可能的动作：向右移动，向左移动。一旦用随机权重初始化，代理（即 FNN）开始玩游戏，同时保留每一集中获得的经验：经验由状态、动作和在每一步中获得的奖励表示。

The goal is to keep the pole in balance as long as possible or until the maximum length of an episode (that we set to 1,000 steps) is reached. The game reward function is defined so that the agent receives a +1 reward for each step it manages to keep the pole balanced. The total score achieved is also saved at the end of each episode. After $n=16$ consecutive episodes the agent stops playing, selects the $m=11$ (70%) episodes having the highest score, and uses the experience in those episodes to update the weights of the FNN ( $n$ and $m$ have been set according to (Lapan, 2018)).
目标是尽可能保持杆的平衡，直到达到一个我们设定的最大长度（我们设定为 1,000 步）为止。游戏奖励函数被定义为，代理每成功保持杆平衡一步就获得+1 的奖励。在每一集结束时也会保存所获得的总分。在连续 $n=16$ 集之后，代理停止游戏，选择得分最高的 $m=11$ （70%）集，并使用这些集中的经验来更新 FNN 的权重（ $n$ 和 $m$ 已根据（Lapan, 2018）进行设置）。

Instead, we trained the rl-baseline agent for MsPacman using a Deep Q Network (DQN) (Mnih et al., 2013). In our context, a DQN is a Convolutional Neural Network (CNN) that takes as input a set of contiguous screenshots of the game (in our case 4, as done in previous works (Mnih et al., 2013; Mnih et al., 2015)) representing the state of the game and returns, for each possible action defined in the game (five in this case: go up, go right, go down, go left, do nothing), a value indicating the expected reward for the action given the current state (Q value). The multiple screenshots are needed to provide more information to the model about what is happening in the game (e.g., in which direction the agent is moving). The goal of the DQN is the same as the FNN: selecting the best action to perform to maximize the reward given the current state. Differently from the previous model, the DQN is updated not on entire episodes but by randomly batching “experience instances” among 10k steps saved during the most recent episodes. An “experience instance” is saved after each step $\tau$ , and is represented by the quadruple ( $s_{\tau-1},a_{\tau},s_{\tau},r_{\tau}$ ), where $s_{\tau-1}$ is the input state, $a_{\tau}$ is the action selected by the agent, $s_{\tau}$ is the resulting state obtained by running $a_{\tau}$ in $s_{\tau-1}$ and $r_{\tau}$ is the received reward.
相反，我们使用深度 Q 网络（DQN）（Mnih 等，2013 年）为 MsPacman 训练了 rl-baseline 代理。在我们的情境中，DQN 是一个卷积神经网络（CNN），它以游戏的一组连续截图作为输入（在我们的案例中为 4 张，与之前的作品相同（Mnih 等，2013 年；Mnih 等，2015 年）），代表游戏的状态，并为游戏中定义的每个可能动作（在本例中为五个：向上、向右、向下、向左、不做任何操作）返回一个值，指示给定当前状态时该动作的预期奖励（Q 值）。需要多个截图以向模型提供有关游戏中发生的情况的更多信息（例如，代理移动的方向）。DQN 的目标与 FNN 相同：选择最佳动作以最大化给定当前状态的奖励。与以前的模型不同，DQN 不是在整个剧集上更新，而是通过在最近剧集中保存的 10k 步中随机批处理“经验实例”来更新。每个步骤 $\tau$ 之后保存一个“体验实例”，由四元组（ $s_{\tau-1},a_{\tau},s_{\tau},r_{\tau}$ ）表示，其中 $s_{\tau-1}$ 是输入状态， $a_{\tau}$ 是代理选择的动作， $s_{\tau}$ 是通过在 $s_{\tau-1}$ 中运行 $a_{\tau}$ 获得的结果状态， $r_{\tau}$ 是接收到的奖励。

The CNN is initialized with random weights, and the agent starts playing while retaining the experience of each step. When enough experience instances have been collected (10k in our implementation (Lapan, 2018)), the CNN starts updating at each step selecting a random batch of experience instances. The reward function for MsPacman provides a +1 reward every time the agent eats one of the dots and a 0 reward otherwise.
CNN 使用随机权重初始化，代理开始游戏并保留每个步骤的经验。当收集足够的经验实例时（在我们的实现中为 10k（Lapan，2018）），CNN 开始在每个步骤更新，选择一批随机的经验实例。MsPacman 的奖励函数在代理吃掉一个点时提供+1 的奖励，否则提供 0 的奖励。

3.1.3. Instantiating RELINE: Performance Reward Functions
3.1.3.实例化 RELINE：性能奖励函数

To train RELINE to play while looking for the injected bugs, we use a simple performance reward function: In both the games, we give a reward of +50 every time the agent, during an episode, spots one of the injected artificial bugs. As previously mentioned, the bugs reveal themselves only the first time the agent visits each buggy position; this means that the performance-based reward is given at most twice for CartPole and four times for MsPacman.
为了训练 RELINE 在寻找注入的错误时进行游戏，我们使用了一个简单的性能奖励函数：在两个游戏中，每当代理在一个 episode 中发现一个注入的人工错误时，我们就给予+50 的奖励。正如之前提到的，错误只在代理第一次访问每个错误位置时显现出来；这意味着基于性能的奖励最多为 CartPole 两次，MsPacman 四次。

3.1.4. Data Collection and Analysis
3.1.4. 数据收集与分析

We compare RELINE against the two previously mentioned baselines: rl-baseline and the random agent. Both RELINE and rl-baseline have been trained for 3,200 episodes on CartPole and 1,000 on MsPacman. The different numbers are due to differences in the games and in the RL model we exploited. In both cases, we used a number of episodes sufficient for rl-baseline to learn how to play (i.e., we observed a convergence in the score achieved by the agent in the episodes).
我们将 RELINE 与前面提到的两个基线进行比较：rl-baseline 和随机代理。RELINE 和 rl-baseline 在 CartPole 上训练了 3,200 个 episode，在 MsPacman 上训练了 1,000 个 episode。不同的数字是由于游戏和我们利用的 RL 模型的差异。在两种情况下，我们使用了足够的 episode 数量让 rl-baseline 学会如何玩（即，我们观察到代理在 episode 中达到的分数收敛）。

Once trained, the agents have been run on both games for additional 1,000 episodes, storing the performance bugs they managed to identify in each episode. Since different trainings could result in models playing the game following different strategies, we repeated this process ten times. This means that we trained 10 different models for both RELINE and rl-baseline and, then, we used each of the 10 models to play additional 1,000 episodes collecting the spotted performance bugs. Similarly, we executed random agent 10 times for 1,000 episodes each. In this case, no training was needed.
一旦训练完成，代理人在两个游戏中额外运行了 1,000 个回合，存储了他们在每个回合中成功识别的性能缺陷。由于不同的训练可能导致模型按照不同的策略玩游戏，我们重复了这个过程十次。这意味着我们为 RELINE 和 rl-baseline 分别训练了 10 个不同的模型，然后使用这 10 个模型中的每一个来进行额外的 1,000 个回合，收集被发现的性能缺陷。类似地，我们对随机代理人进行了 10 次 1,000 个回合的执行。在这种情况下，不需要训练。

We report descriptive statistics (mean, median, and standard deviation) of the number of performance bugs identified in the 1,000 played episodes by the three approaches. A high number of episodes in which an approach can spot the injected bugs indicate its ability to look for performance bugs while playing the game.
我们报告了三种方法在 1,000 个播放的剧集中识别到的性能缺陷数量的描述性统计数据（均值、中位数和标准偏差）。一个方法能够发现注入的缺陷的大量剧集表明其在玩游戏时寻找性能缺陷的能力。

Table 1. Number of episodes (out of 1,000) in which RELINE, rl-baseline, and the random agent identify the injected bugs.
表 1. RELINE、rl-baseline 和随机代理程序在 1,000 个剧集中识别到注入缺陷的剧集数量。

Game	#Injected	#Bugs	RELINE			rl-baseline			random agent 随机代理
Game	Bugs	found	mean	median	stdev	mean	median	stdev	mean	median	stdev
CartPole	2	1	965	984	47	715	706	107	12	11	4
CartPole	2	2	102	47	177	5	1	7	0	0	0
MsPacman	4	1	971	989	59	700	680	228	24	23	5
		2	966	985	63	356	343	169	17	16	3
		3	914	941	87	114	80	90	1	1	1
		4	879	907	106	25	23	17	1	1	1

3.2. Preliminary Study Results
3.2.初步研究结果

Table 1 shows for each of the two games (CartPole and MsPacman) the number $k$ of artificial bugs we injected and, for each of the three techniques (i.e., RELINE, rl-baseline, and the random agent), descriptive statistics of the number of episodes (out of 1,000) they managed to identify at least $n$ of the injected bugs, with $n$ going from 1 to $k$ at steps of 1.
表 1 显示了两款游戏（CartPole 和 MsPacman）中我们注入的人工漏洞数量 $k$ ，以及三种技术（即 RELINE、rl-baseline 和随机代理）对于每个注入的漏洞至少识别出 $n$ 个的描述性统计数据， $n$ 从 1 到 $k$ ，步长为 1。

For both games, it is easy to see that the random agent is rarely able to identify the bugs. Indeed, this agent plays without any strategy as it is able to identify bugs only by chance in a few episodes out of the 1,000 it plays. This is also due to the fact that the random agent quickly looses the played episodes due to its inability to play the game. This confirms that these approaches are not suitable for testing video games.
对于这两款游戏，很容易看出随机代理很少能够识别漏洞。事实上，这个代理没有任何策略，只能在 1000 个游戏中的少数几个中偶然识别出漏洞。这也是因为随机代理由于无法玩游戏而迅速失去了玩的机会。这证实了这些方法不适用于测试视频游戏。

Concerning CartPole, both RELINE and rl-baseline are able to spot at least one of the two bugs in several of the 1,000 episodes. The median is 984 for RELINE and 706 for rl-baseline. The success of rl-baseline is soon explained by the characteristics of CartPole: Considering where we injected the bugs — see Fig. 2-(a) — by playing the game it is likely to discover at least one bug (e.g., if the player tends to move towards left, the bug on the left will be found). What it is instead unlikely to happen by chance is to find both bugs within the same episode. We found that it is quite challenging, even for a human player, to move the cart first towards one side (e.g., left) and, then, towards the other side (right) without losing due to the pole moving more than 12 degrees from vertical. As it can be seen in Table 1, RELINE succeeds in this, on average, for 102 episodes out of 1,000 (median 47), as compared to the 5 (median 1) of rl-baseline. This indicates that RELINE is pushed by the reward function to explore the game looking for the injected bugs, even if this makes playing the game more challenging. Similar results have been achieved on MsPacman.
关于 CartPole，RELINE 和 rl-baseline 都能在 1000 个回合中至少发现两个中的一个 bug。RELINE 的中位数为 984，rl-baseline 为 706。rl-baseline 的成功很快被 CartPole 的特性解释：考虑我们注入 bug 的位置——见图 2-(a)——通过玩游戏，很可能会发现至少一个 bug（例如，如果玩家倾向于向左移动，左侧的 bug 将被发现）。而在同一回合内发现两个 bug 的概率是不太可能的。我们发现，即使对于人类玩家来说，先将小车移向一侧（例如左侧），然后再移向另一侧（右侧），而不让杆离垂直线超过 12 度也是相当具有挑战性的。正如表 1 所示，RELINE 在 1000 个回合中平均成功了 102 次（中位数 47），而 rl-baseline 只有 5 次（中位数 1）。这表明 RELINE 受奖励函数的推动，探索游戏以寻找注入的 bug，即使这使得游戏更具挑战性。在 MsPacman 上也取得了类似的结果。

In this case, the DQN is effective in allowing RELINE to play while exercising the points in the game in which we injected the bugs. Indeed, on average, RELINE was able to spot all four injected bugs in 879 out of the 1,000 played episodes (median=907), while rl-baseline could achieve such a result only in 25 episodes.
在这种情况下，DQN 在允许 RELINE 进行游戏同时锻炼注入错误的游戏点时是有效的。事实上，平均而言，RELINE 能够在 1,000 个游戏回合中的 879 个回合中发现所有四个注入的错误（中位数=907），而 rl-baseline 只能在 25 个回合中实现这样的结果。

4. Case Study: Load Testing an Open Source Game
4.案例研究：负载测试开源游戏

We run a case study to experiment the capability of RELINE in load testing an open-source 3D game. Differently from our preliminary study (Section 3), we do not inject artificial bugs. Instead, we aim at finding parts of the game resulting in FPS drops.
我们进行了一个案例研究，以实验 RELINE 在负载测试开源 3D 游戏中的能力。与我们的初步研究（第 3 部分）不同，我们没有注入人为错误。相反，我们的目标是找到导致 FPS 下降的游戏部分。

4.1. Study Design 4.1.研究设计

For this study, we use a 3D kart racing game named SuperTuxKart (supertuxkart, [n.d.]) — see Fig. 2-(c). This game has been selected due to the following reasons. First, we wanted a 3D game in which, as compared to a 2D game, FPS drops are more likely because of the more complex rendering procedures. Second, SuperTuxKart is popular open-source project that counts, at the time of writing, over 3k stars on GitHub. Third, it is available an open-source wrapper that simplifies the implementation of agents for SuperTuxKart (PyS, [n.d.]).
在这项研究中，我们使用了一款名为 SuperTuxKart 的 3D 卡丁车竞速游戏（supertuxkart，[无日期]）—见图 2-(c)。选择这款游戏有以下原因。首先，我们希望选择一款 3D 游戏，与 2D 游戏相比，由于更复杂的渲染过程，FPS 下降更有可能发生。其次，SuperTuxKart 是一个流行的开源项目，在撰写本文时，在 GitHub 上拥有超过 3k 个星标。第三，有一个开源的包装器可简化为 SuperTuxKart 实现代理的过程（PyS，[无日期]）。

The existence of a wrapper like the one we used is crucial since it allows, for example, to advance in the game frame by frame (thus simplifying the generation of the inputs to the RL model), to execute actions (e.g., throttle or brake), and to acquire game internals (e.g., kart centering, distance to the finish line). Also, using this wrapper, it is possible to compute the time needed by the game to render each frame and, consequently, calculate the FPS. Finally, the wrapper allows to have simplified graphics (e.g., removing particle effects, like rain, that could make the training more challenging).
我们使用的包装器的存在至关重要，因为它允许例如逐帧进行游戏（从而简化 RL 模型输入的生成）、执行操作（例如，油门或刹车）以及获取游戏内部信息（例如，卡丁车居中、距离终点线的距离）。此外，使用这个包装器，可以计算游戏渲染每一帧所需的时间，从而计算 FPS。最后，这个包装器允许简化图形（例如，去除粒子效果，如雨，这可能会使训练更具挑战性）。

4.1.1. Learning to Play: RL Models and Game Reward Functions
4.1.1.学习玩耍：RL 模型和游戏奖励函数

The training of the rl-baseline agent has been performed using the DQN model previously applied in MsPacman.
rl-baseline 代理的训练已经使用了先前在 MsPacman 中应用的 DQN 模型进行。

We use the previously mentioned PySuperTuxKart (PyS, [n.d.]) to make the agent interact with the game. For the sake of speeding up the training, the screenshots extracted from the game have been resized to 200x150 pixels and converted in grayscale before they are provided as input to the model. Moreover, as previously done for MsPacman, multiple (four) screenshots are fed to the model at each step. Thus, the representation of the state of the game provided to the model is a 4 $\times$ 200 $\times$ 150 tensor. The details of the model and its implementation are available in our replication package (Tufano, 2021).
我们使用了前面提到的 PySuperTuxKart（PyS，[n.d.]）来使代理与游戏互动。为了加快训练速度，从游戏中提取的屏幕截图已被调整为 200x150 像素并在提供给模型之前转换为灰度。此外，与先前为 MsPacman 所做的一样，在每个步骤中向模型提供多个（四个）屏幕截图。因此，提供给模型的游戏状态表示是一个 4 $\times$ 200 $\times$ 150 张量。模型及其实现的详细信息可在我们的复制包中找到（Tufano，2021）。

A critical part of the learning process is the definition of the game reward function. Being SuperTuxKart a racing game, an option could have been to penalize the agent for each additional step required to finish the game. Consequently, to maximize the final score, the agent would have been pushed to reduce the number of steps and, therefore, to drive as fast as possible towards the finish line. However, considering the non-trivial size of the game space, such a reward function would have required a long training time. Thus, we took advantage of the information that can be extracted from the game to help the agent in the learning process.
学习过程的关键部分是定义游戏奖励函数。作为一款赛车游戏，一个选择可能是惩罚代理完成游戏所需的每一步。因此，为了最大化最终得分，代理将被迫减少步数，因此尽可能快地驶向终点。然而，考虑到游戏空间的庞大，这样的奖励函数将需要很长的训练时间。因此，我们利用可以从游戏中提取的信息来帮助代理进行学习过程。

SuperTuxKart provides two coordinates indicating where the agent is in the game: path_done and centering.
SuperTuxKart 提供了两个坐标，指示代理在游戏中的位置：path_done 和 centering。

The former indicates the path traversed by the agent from the starting line of the track, while the latter represents the distance of the agent from the center of the track. In particular, centering equals 0 if the agent is at the center of the track, and it moves away from zero as the agent moves to either side: going towards right results in positive values of the centering value, going left in negative values. We indicate these coordinates with $x$ (centering) and $y$ (path_done), and we define $\delta_{y}$ as the path traversed by the agent in a specific step: Given $y_{i}$ the value for path_done at step $i$ , we compute $\delta_{y}$ as $y_{i}-y_{i-1}$ . Basically, $\delta_{y}$ measures how fast the agent is advancing towards the finish line.
前者表示代理从赛道起点走过的路径，而后者表示代理距离赛道中心的距离。特别地，如果代理位于赛道中心，则居中等于 0，随着代理向任一侧移动，它远离零：向右移动导致居中值为正值，向左移动导致负值。我们用 $x$ （居中）和 $y$ （路径完成）表示这些坐标，并定义 $\delta_{y}$ 为代理在特定步骤中走过的路径：给定 $y_{i}$ 为步骤 $i$ 中路径完成的值，我们计算 $\delta_{y}$ 为 $y_{i}-y_{i-1}$ 。基本上， $\delta_{y}$ 衡量代理朝着终点线前进的速度。

Given $x$ and $\delta_{y}$ for a given step $i$ , we compute the reward function as follows:
给定特定步骤 $i$ 的 $x$ 和 $\delta_{y}$ ，我们按以下方式计算奖励函数：

\mathit{rg}_{i}=\begin{cases}-1&\text{if }|x|>\theta\\ \max(\min(\delta_{y},M),0)&\text{otherwise}\end{cases}

First, if the agent goes too far from the center of the track ( $|x|>\theta$ ), we penalize it with a negative reward. Otherwise, if the agent is close to the center ( $|x|\leq\theta$ ), we can have two scenarios: (i) if it is not moving towards the finish line ( $\delta_{y}\leq 0$ ), we do not give any reward (the minimum reward is 0); (ii) if it is moving in the right direction ( $\delta_{y}>0$ ), we give a reward proportional to the speed at which it is advancing ( $\delta_{y}$ ), up to a maximum of M.
首先，如果代理人离赛道中心太远（ $|x|>\theta$ ），我们会用负奖励来惩罚它。否则，如果代理人靠近中心（ $|x|\leq\theta$ ），我们可以有两种情况：（i）如果它没有朝着终点线移动（ $\delta_{y}\leq 0$ ），我们不给予任何奖励（最低奖励为 0）；（ii）如果它朝着正确的方向移动（ $\delta_{y}>0$ ），我们会根据其前进速度给予奖励（ $\delta_{y}$ ），最高不超过 M。

In our experimental setup, we set $\theta=20$ because it roughly represents the double of $|x|$ when the agent approaches the sides of the road in the level, and $M=10$ as it is the same maximum reward also given by the performance reward function, as we explain below. Finally, we reward the agent when it crosses the finish line with an additional $+1,000$ bonus.
在我们的实验设置中，当代理人靠近道路边缘时，我们设置 $\theta=20$ ，因为这大致代表了 $|x|$ 的两倍，同时 $M=10$ 也是相同的最大奖励，这也是由性能奖励函数给出的，我们将在下文中解释。最后，当代理人越过终点线时，我们会额外奖励 $+1,000$ 。

4.1.2. Instantiating RELINE: Performance Reward Function
4.1.2.实例化 RELINE：性能奖励函数

To define the performance reward function of RELINE for SuperTuxKart, the first step to perform is to define a way to reliably capture the FPS of the game during the training. In this way, we can reward the agent when it manages to identify low-FPS points. As previously said, we use PySuperTuxKart to interact with the game and such a framework keeps the game frozen while the other instructions of RELINE (e.g., the identification of the action to execute) are run. Since the framework runs the game in the same process in which we run RELINE and since we do not use threads, we can safely use a simple method for computing the time needed to render the four frames: We get the system time before ( $T_{\mathit{before}}$ ) and after ( $T_{\mathit{after}}$ ) we trigger the rendering of the frames and we compute the time needed at step $i$ as $\mathit{rT}_{i}=T_{\mathit{after}}-T_{\mathit{before}}$ . Such a value is negatively correlated with the FPS (higher rendering time means lower FPS).
要为 SuperTuxKart 定义 RELINE 的性能奖励函数，执行的第一步是定义一种可靠地捕获游戏 FPS 的方法。通过这种方式，当代理设法识别出低 FPS 点时，我们可以奖励代理。正如之前所说，我们使用 PySuperTuxKart 与游戏交互，这样一个框架会在 RELINE 的其他指令（例如执行的动作的识别）运行时保持游戏冻结。由于框架在我们运行 RELINE 的同一进程中运行游戏，且我们不使用线程，因此我们可以安全地使用一种简单的方法来计算渲染四帧所需的时间：我们在触发帧渲染之前（ $T_{\mathit{before}}$ ）和之后（ $T_{\mathit{after}}$ ）获取系统时间，并在步骤 $i$ 计算所需的时间为 $\mathit{rT}_{i}=T_{\mathit{after}}-T_{\mathit{before}}$ 。这样的值与 FPS 呈负相关（更长的渲染时间意味着更低的 FPS）。

The performance reward function we use is the following:
我们使用的性能奖励函数如下：

\mathit{rp}_{i}=\begin{cases}10&\text{if }|x|\leq\theta\land\mathit{rT}_{i}>t\\ 0&\text{otherwise}\end{cases}

We give a performance-based reward of 10 when the agent takes more than $t$ milliseconds to render the frames at a given point (causing an FPS drop). We explain the tuning of $t$ later. We do not give such a reward when $|x|>\theta$ (the kart is far from the center) since we want the agent to spot issues in positions that are likely to be explored by real players (i.e., reasonably close to the track).
当代理在给定点渲染帧时花费超过 $t$ 毫秒（导致 FPS 下降）时，我们会给予基于表现的奖励 10。我们稍后会解释 $t$ 的调整。当 $|x|>\theta$ （卡丁车远离中心）时，我们不会给予这样的奖励，因为我们希望代理能够发现那些可能被真实玩家探索的位置上的问题（即，距离赛道相对较近）。

Finally, in RELINE we do not give a fixed $+1,000$ bonus reward when the agent crosses the finish line but we assign a bonus computed as $10\times\sum_{i=1}^{\mathit{steps}}rp_{i}$ , i.e., proportional to the total performance-based reward accumulated by the agent in the episode. This is done to push the agent to visit more low-FPS points during an episode.
最后，在 RELINE 中，当代理越过终点线时，我们不会给予固定的 $+1,000$ 奖励，而是分配一个根据 $10\times\sum_{i=1}^{\mathit{steps}}rp_{i}$ 计算的奖励，即，与代理在该回合中累积的基于表现的奖励总数成比例。这样做是为了促使代理在一个回合中访问更多低 FPS 点。

4.1.3. Data Collection and Analysis
4.1.3.数据收集与分析

As done in our preliminary study, we compare RELINE with rl-baseline (i.e., the agent only trained to play the game) and with a random agent.
正如我们在初步研究中所做的那样，我们将 RELINE 与 rl-baseline（即，仅训练玩游戏的代理）和随机代理进行比较。

Training rl-baseline and RELINE. While we used different reward functions for the two RL agents, we applied the same training process for both of them. We trained each model for 2,300 episodes, with one episode having a maximum duration of 90 seconds or ending when the agent crosses the finish line of the racing track (the agent is required to perform a single lap). We set the 90 seconds limit since we observed that, by manually playing the game, $\sim$ 70 seconds are sufficient to complete a lap. The 2,300 episodes threshold has been defined by computing the average reward obtained by the two agents every 100 episodes and by observing when a plateau was reached by both agents. We found 2,300 episodes to be a good compromise for both agents (graphs plotting the reward function are available in the replication package (Tufano, 2021)).
训练 rl-baseline 和 RELINE。虽然我们为这两个 RL 代理使用了不同的奖励函数，但我们对它们都应用了相同的训练过程。我们为每个模型训练了 2,300 个剧集，每个剧集的最长持续时间为 90 秒，或者当代理越过赛车赛道的终点线时结束（代理需要完成一圈）。我们设置了 90 秒的限制，因为我们观察到，通过手动玩游戏， $\sim$ 70 秒就足以完成一圈。通过计算每 100 个剧集两个代理获得的平均奖励，并观察两个代理何时达到平稳状态，我们发现 2,300 个剧集对于两个代理都是一个很好的折衷点（绘制奖励函数的图表可在复制包（Tufano, 2021）中找到）。

The trained rl-baseline agent has been used to define the threshold $t$ needed for the RELINE’s reward function (i.e., for identifying when the agent found a low-FPS point and should be rewarded).
训练过的 rl-baseline 代理已被用来定义 RELINE 奖励函数所需的阈值 $t$ （即，用于确定代理何时发现低 FPS 点并应获得奖励）。

In particular, once trained, we run rl-baseline for 300 episodes, storing the time needed by the game to render the subsequent four frames after every action recommended by the model.²²2Since we wanted to measure the frames rendering time in a standard scenario in which the agent was driving the kart, we stopped an episode if the agent got stuck against some obstacle.
由于我们想要在标准场景中测量帧渲染时间，其中代理正在驾驶卡丁车，如果代理卡在某个障碍物上，我们就会停止一个情节。
具体来说，一旦训练完成，我们运行 rl-baseline 进行 300 个情节，存储游戏在模型推荐的每个动作后渲染后续四帧所需的时间。 ² This resulted in a total of 48,825 data points $s_{FPS}$ , representing the standard FPS of the game in a scenario in which the player is only focused on completing the race as fast as possible.
这导致了总共 48,825 个数据点 $s_{FPS}$ ，代表了游戏的标准 FPS，在这种情况下，玩家只关注尽快完成比赛。

Starting from the 48,825 $s_{FPS}$ data points collected in the 300 episodes played by the trained rl-baseline agent, we apply the five- $\sigma$ rule (Grafarend, 2006) to compute a threshold able to identify outliers. The five- $\sigma$ rule states that in a normal distribution (such as $s_{FPS}$ ) 99.99% of observed data points lie within five standard deviations from the mean. Thus, anything above this value can be considered as an outlier in terms of milliseconds needed to render the frames. For this reason, we compute $t_{b}=mean(s_{FPS})+5\times sd(s_{FPS})$ as a candidate base threshold to identify low-FPS points. However, $t_{b}$ cannot be directly used as the $t$ value of our reward function. Indeed, we observed that the time needed for rendering frames during the RELINE’s training is slightly higher as compared to the time needed when the trained rl-baseline agent is used to play the game. This is due to the fact that the load on the server (and in particular on the GPU) is higher during training. To overcome this issue, we perform the following steps.
从训练过的 rl-baseline 代理玩的 300 集中收集的 48,825 个数据点开始，我们应用五个标准差准则（Grafarend, 2006）来计算一个能够识别异常值的阈值。五个标准差准则规定，在正态分布（如 $s_{FPS}$ ）中，观察到的数据点中有 99.99%位于距离均值五个标准差之内。因此，任何超过这个值的数据点都可以被视为在渲染帧所需的毫秒数方面的异常值。基于此，我们计算 $t_{b}=mean(s_{FPS})+5\times sd(s_{FPS})$ 作为识别低 FPS 点的候选基准阈值。然而， $t_{b}$ 不能直接用作我们奖励函数的 $t$ 值。事实上，我们观察到，在 RELIN 的训练期间，渲染帧所需的时间略高于使用训练过的 rl-baseline 代理玩游戏时所需的时间。这是因为服务器（尤其是 GPU）的负载在训练期间更高。为了解决这个问题，我们执行以下步骤。

At the beginning of the training, we run 100 warmup episodes in which we collect the time needed to render the four frames after each action performed by the agent. Then, we compute the first ( $Q^{\mathit{tr}}_{1}$ ) and the third ( $Q^{\mathit{tr}}_{3}$ ) quartile of the obtained distribution and compare them to the $Q_{1}$ and $Q_{3}$ of the distribution obtained in the 300 episodes used to define $t_{b}$ (i.e., those played by the trained rl-baseline agent). During the warmup episodes, the agent selects the action to perform almost randomly (it still has to learn): Therefore, it would not be able to explore a substantial area of the game (i.e., of the racing track), thus not providing a distribution of timings comparable with the ones obtained when the trained rl-baseline agent that played the 300 episodes. For this reason, during the 100 warmup episodes of the training, the action to perform is not chosen by the agent currently under training, but by the trained rl-baseline agent (i.e., the same used in the 300 episodes). This does not impact in any way the load on the server that remains the one we have during the training of RELINE since the only change we have is to ask for the action to perform to the rl-baseline agent rather than to the one under training. However, the whole training procedure (e.g., capturing the frames and updating the network) stays the same.
在培训开始时，我们进行了 100 个热身阶段，在这些阶段中，我们收集了代理执行每个动作后渲染四帧所需的时间。然后，我们计算所得分布的第一（ $Q^{\mathit{tr}}_{1}$ ）和第三（ $Q^{\mathit{tr}}_{3}$ ）四分位数，并将其与用于定义 $t_{b}$ 的 300 个阶段中所得分布的 $Q_{1}$ 和 $Q_{3}$ 进行比较（即由训练过的 rl-baseline 代理播放的那些阶段）。在热身阶段，代理几乎是随机选择要执行的动作（它仍然需要学习）：因此，它将无法探索游戏的大部分区域（即赛道），因此无法提供与训练过的 rl-baseline 代理播放 300 个阶段时所获得的时间分布相媲美的分布。因此，在培训的 100 个热身阶段中，要执行的动作不是由当前正在接受培训的代理选择的，而是由训练过的 rl-baseline 代理选择的（即在 300 个阶段中使用的相同代理）。这并不会对服务器的负载产生任何影响，因为在训练 RELINE 期间，服务器的负载保持不变，因为我们唯一的变化是要求 rl-baseline 代理执行操作，而不是正在训练的代理。然而，整个训练过程（例如，捕获帧和更新网络）保持不变。

We compute the additional “cost” brought by the training in rendering the frames during the game using the formula $\delta=max(Q^{\mathit{tr}}_{1}-Q_{1},Q^{\mathit{tr}}_{3}-Q_{3})$ . We use the first and third quartiles since they represent the boundaries of the central part of the distribution, i.e., they should be quite representative of the values in it. We took as $\delta$ the maximum of the two differences to be more conservative in assigning rewards when the agent identifies low-FPS points. The final value $t$ we use in our reward function when training RELINE to load test SuperTuxKart is defined as: $t=t_{b}+\delta=18.36$ .³³3We identify as low-FPS points the ones in which the FPS is lower than 218. Such a number is still very high, more than enough for any human player, in practice. Note that we run the game using high-performance hardware and, most importantly, with the lowest graphic settings. The equivalent in normal conditions would be much lower.
我们将 FPS 低于 218 的点标识为低 FPS 点。这个数字仍然非常高，对于任何人类玩家来说都足够了。请注意，我们使用高性能硬件运行游戏，而且最重要的是，使用最低的图形设置。在正常条件下，相应的数值会低得多。
我们通过使用公式 $\delta=max(Q^{\mathit{tr}}_{1}-Q_{1},Q^{\mathit{tr}}_{3}-Q_{3})$ 来计算在游戏过程中训练带来的额外“成本”，用于渲染帧。我们使用第一和第三四分位数，因为它们代表分布中心部分的边界，即它们应该相当代表其中的值。我们将两个差异的最大值作为 $\delta$ ，以便在代理识别低 FPS 点时更加保守地分配奖励。我们在训练 RELINE 以加载测试 SuperTuxKart 时在我们的奖励函数中使用的最终值 $t$ 定义为： $t=t_{b}+\delta=18.36$ 。 ³

Thus, if RELINE is able, during the training, to identify a point in the game requiring more than $t$ milliseconds to render four frames, then it receives a reward as explained in Section 4.1.2.
因此，如果 RELINE 能够在训练期间识别出游戏中需要超过 $t$ 毫秒来渲染四帧的点，那么它将根据第 4.1.2 节中的说明获得奖励。

The training of rl-baseline took $\sim$ 3 hours, while RELINE requires substantially more time due to the fact that, after each step performed by the agent, we collect and store information about the time needed to render the frames (this is done million of times). This pushed the training for RELINE up to $\sim$ 30 hours.
rl-baseline 的训练花费了 $\sim$ 3 小时，而 RELINE 需要更多的时间，因为在代理执行每一步之后，我们会收集并存储有关渲染帧所需时间的信息（这样做了数百万次）。这使得 RELINE 的训练时间延长到 $\sim$ 30 小时。

Reliability of Time Measurements. It is important to clarify that the FPS of the game can be impacted by the hardware specifications and the current load of the machine running it. In other words, running the same game on two different machines or on the same machine in two different moments can result in variations of the FPS. For this reason, all the experiments have been performed on the same server, equipped with 2 x 64 Core AMD 2.25GHz CPUs, 512GB DDR4 3200MHz RAM, and an nVidia Tesla V100S 32GB GPU. Also, the process running the training of the agents or the collection of the 48,825 $s_{FPS}$ with the trained rl-baseline agent was the only process running on the machine besides those handled by the operating system (Ubuntu 20.04). On top of that, the process was always run using the chrt --rr 1 option, that in Linux maximizes the priority of the process, reducing the likelihood of interruptions.
时间测量的可靠性。重要的是要澄清，游戏的 FPS 可能会受到硬件规格和运行游戏的机器当前负载的影响。换句话说，在两台不同的机器上运行相同的游戏，或者在同一台机器上的两个不同时刻运行游戏，可能会导致 FPS 的变化。因此，所有实验都是在同一台服务器上进行的，该服务器配备有 2 个 64 核 AMD 2.25GHz CPU、512GB DDR4 3200MHz RAM 和一块 nVidia Tesla V100S 32GB GPU。此外，用于训练代理或收集 48,825 个 $s_{FPS}$ 的过程与经过训练的 rl-baseline 代理是除了由操作系统（Ubuntu 20.04）处理的进程之外，机器上唯一运行的进程。此外，该进程始终使用 chrt --rr 1 选项运行，该选项在 Linux 中最大化了进程的优先级，减少了中断的可能性。

Despite these precautions, it is still possible that variations are observed in the FPS not due to issues in the game, but to external factors (e.g., changes in the load of the machine). To verify the reliability of the collected FPS data, we run a constant agent performing always the same actions in the game for 300 episodes. The set of actions has been extracted from one of the episodes played by the rl-baseline agent, that was able to successfully conclude the race. Then, we plotted the time needed by the game to render the four frames following each action made by the agent. Since we are playing 300 times exactly the same episode, we expect to observe the same trend in terms of FPS for each game. If this is the case, it means that the way we are measuring the FPS is reliable enough to reward the agent when low-FPS points are identified.
尽管采取了这些预防措施，仍然有可能观察到 FPS 的变化，这并不是由于游戏本身的问题，而是由于外部因素（例如，机器负载的变化）。为了验证收集的 FPS 数据的可靠性，我们运行一个恒定的代理程序，在游戏中执行相同的动作 300 次。这组动作是从 rl-baseline 代理程序播放的一个剧集中提取出来的，该代理程序能够成功完成比赛。然后，我们绘制了游戏在代理程序执行每个动作后渲染四帧所需的时间。由于我们完全相同地播放了 300 次剧集，我们期望观察到每场比赛的 FPS 趋势相同。如果是这种情况，这意味着我们测量 FPS 的方式足够可靠，可以在识别到低 FPS 点时奖励代理程序。

Fig. 3 shows the achieved results: The $y$ -axis represents the milliseconds needed to render four frames in response to an agent’s action ( $x$ -axis) performed in a specific part of the game. While, as expected, small variations are possible, the overall trend is quite stable: Points of the game requiring longer time to render frames are consistently showing across the 300 episodes, resulting in a clear trend. We also computed the Spearman’s correlation (Spearman, 1904) pairwise across the 300 distributions, adjusting the obtained $p$ -values using the Holm’s correction (Holm, 1979).
图 3 显示了实现的结果： $y$ 轴表示渲染四帧所需的毫秒数，以响应代理的动作（ $x$ 轴）在游戏的特定部分执行。尽管可能会有一些小的变化，但总体趋势相当稳定：需要更长时间来渲染帧的游戏点在 300 个剧集中持续显示，呈现出明显的趋势。我们还计算了斯皮尔曼相关系数（Spearman, 1904），在 300 个分布中成对进行调整，使用 Holm's 校正（Holm, 1979）来调整获得的 $p$ 值。

We found all correlations to be statistically significant (adjusted $p$ -values $<$ 0.05) with a minimum $\rho$ =0.77 (strong correlation) and a median $\rho$ =0.91 (very strong correlation). This confirms the common FPS trends across the 300 episodes.
我们发现所有相关性都具有统计学意义（调整后的 $p$ 值 $<$ 0.05），最小 $\rho$ =0.77（强相关）和中位数 $\rho$ =0.91（非常强相关）。这证实了在 300 个剧集中存在的常见 FPS 趋势。

Running the Three Techniques to Spot Low-FPS Areas. After the 2,300 training episodes, we assume that both the RL-based agents learned how to play the game, and that RELINE also learned how to spot low-FPS points. Then, as also done in our preliminary study, we train both agents for additional 1,000 episodes, storing the time needed to render the frames in every single point they explored during each episode (where a point is represented by its coordinates, i.e., centering= $x$ and path_done= $y$ ). We do the same also with the random agent.
运行三种技术来发现低 FPS 区域。在进行了 2,300 个训练周期后，我们假设基于 RL 的代理程序都学会了如何玩游戏，而 RELINE 也学会了如何发现低 FPS 点。然后，与我们的初步研究中一样，我们为两个代理程序额外进行了 1,000 个周期的训练，存储了在每个周期内它们探索的每个点渲染帧所需的时间（其中一个点由其坐标表示，即 centering= $x$ 和 path_done= $y$ ）。我们还对随机代理程序执行相同操作。

Data Analysis. The output of each of the three agents is a list of points with the milliseconds each of them required to render the subsequent frames. Since each agent played 1,000 episodes, it is possible that the same point is covered several times by an agent, with slightly different FPS observed (as previously explained, small variations in FPS are possible and expected across different episodes). We classify as low-FPS points those that required more than $t$ milliseconds to render the four subsequent frames more than 50% of times they have been covered by an agent.
数据分析。三个代理程序的输出是一个点列表，其中包含它们分别渲染后续帧所需的毫秒数。由于每个代理程序进行了 1,000 个周期的游戏，可能同一个点会被一个代理程序多次覆盖，观察到稍有不同的 FPS（如前所述，FPS 的小变化在不同周期中是可能且预期的）。我们将那些在被代理程序覆盖超过 50%的时间内需要超过 $t$ 毫秒来渲染四个后续帧的点分类为低 FPS 点。

This means that, if across the 1,000 episodes a point $p$ is exercised 100 times by an agent, at least 51 times the threshold $t$ must be exceeded to consider $p$ as a low-FPS point. In practice, a developer using RELINE for identifying low-FPS points could use a higher threshold to increase the reliability of the findings. However, for the sake of this empirical study, we decided to be conservative.
这意味着，如果在 1,000 个情节中，代理人对一个点 $p$ 进行了 100 次操作，那么至少有 51 次必须超过阈值 $t$ 才能将 $p$ 视为低 FPS 点。在实践中，使用 RELINE 来识别低 FPS 点的开发人员可以使用更高的阈值来提高发现的可靠性。然而，为了这项经验研究的缘故，我们决定保守一些。

Then, we compare the characteristics of the low-FPS points identified by the three approaches. Specifically, we analyze: (i) how many different low-FPS points each approach identified; (ii) the number of times each low-FPS point has been exercised by each agent in the 1,000 episodes; (iii) the confidence of the identified points (i.e., the percentage of times an exercised point resulted in low FPS). Given the low-FPS points identified by each agent, we draw violin plots showing the distribution of timings needed to render the frames when the agent exercised them (the higher the timings, the lower the FPS). We compare these distributions using Mann-Whitney test (Conover, 1998) with $p$ -values adjustment using the Holm’s correction (Holm, 1979). We also estimate the magnitude of the differences by using the Cliff’s Delta ( $d$ ), a non-parametric effect size measure (Grissom and Kim, 2005) for ordinal data. We follow well-established guidelines to interpret the effect size: negligible for $|d|<0.10$ , small for $0.10\leq|d|<0.33$ , medium for $0.33\leq|d|<0.474$ , and large for $|d|\geq 0.474$ (Grissom and Kim, 2005).
然后，我们比较了三种方法识别出的低 FPS 点的特征。具体来说，我们分析：(i) 每种方法识别出了多少个不同的低 FPS 点；(ii) 每个低 FPS 点在 1,000 个 episode 中被每个代理练习的次数；(iii) 识别点的置信度（即练习点导致低 FPS 的百分比）。鉴于每个代理识别出的低 FPS 点，我们绘制小提琴图，显示代理在练习时渲染帧所需的时间分布（时间越长，FPS 越低）。我们使用 Mann-Whitney 检验（Conover, 1998）比较这些分布，使用 Holm's 校正（Holm, 1979）进行 $p$ 值调整。我们还通过使用 Cliff's Delta（ $d$ ），一种用于序数数据的非参数效应大小测量方法（Grissom 和 Kim, 2005）来估计差异的大小。我们遵循既定的指导方针来解释效应大小： $|d|<0.10$ 为微不足道， $0.10\leq|d|<0.33$ 为小， $0.33\leq|d|<0.474$ 为中等， $|d|\geq 0.474$ 为大（Grissom 和 Kim, 2005）。

4.2. Study Results 4.2.研究结果

Fig. 4 summarizes the main findings of our case study. Fig. 4-(a) shows the distribution of time needed to render the game frames (i.e., our proxy for FPS) for four groups of points. The first violin plot on the left (i.e., Regular FPS) shows the timing for points that have never resulted in a drop of FPS in any of the 3,000 episodes played by the three agents (1,000 each). These serve as baseline to better interpret the low-FPS points exercised by the agents. The other three violin plots show the distributions of timing for the low-FPS points identified by RELINE (blue), rl-baseline (green), and the random agent (red).
图 4 总结了我们案例研究的主要发现。图 4-(a)展示了四组点所需的渲染游戏帧时间的分布（即我们用于 FPS 的代理）。左侧的第一个小提琴图（即常规 FPS）显示了这三个代理玩的 3,000 个剧集中从未导致 FPS 下降的点的时间（每个代理 1,000 个）。这些作为基准，有助于更好地解释代理练习的低 FPS 点。其他三个小提琴图显示了 RELINE（蓝色）、rl-baseline（绿色）和随机代理（红色）识别的低 FPS 点的时间分布。

Below each violin plot we report the number of low-FPS points identified by each agent and descriptive statistics (average, median, min, max) of the confidence for the low-FPS points. A 100% confidence means that all times that a low-FPS point has been exercised in the 1,000 episodes played by the agent it required more than $t=18.36$ milliseconds to render the subsequent frames. The $t$ threshold is represented by the red horizontal line. On average, RELINE exercised each low-FPS point 89 times in the 1,000 episodes, against the 210 of rl-baseline and the 829 of the random agent (the same point can be exercised multiple times in an episode).
在每个小提琴图下，我们报告了每个代理识别的低 FPS 点的数量以及低 FPS 点置信度的描述统计数据（平均值、中位数、最小值、最大值）。100%的置信度意味着在代理玩的 1,000 个剧集中，每次低 FPS 点被执行时需要超过 $t=18.36$ 毫秒来渲染后续帧。 $t$ 阈值由红色水平线表示。平均而言，RELINE 在 1,000 个剧集中每个低 FPS 点被执行了 89 次，而 rl-baseline 执行了 210 次，随机代理执行了 829 次（同一点可能在一个剧集中被执行多次）。

RELINE identified 173 low-FPS points, as compared to the 33 of rl-baseline and the 90 of the random agent. The confidence is similar for RELINE (median=99%) and rl-baseline (median=94%), while it is lower for the random agent (median=76%). Thus, the low-FPS points identified by the two RL-based agents are, overall, quite reliable. Concerning the number of low-FPS points identified, RELINE identifies more points as compared to rl-baseline (173 vs 33). This is expected since it has the explicit goal of load testing the game, However, what could be surprising at first sight is the high number of low-FPS points identified by the random agent (90). Fig. 4-(b) and Fig. 4-(c) help in interpreting this finding.
RELINE 识别了 173 个低 FPS 点，而 rl-baseline 只有 33 个，随机代理有 90 个。 RELINE 的置信度与 rl-baseline 相似（中位数=99%），而随机代理较低（中位数=76%）。因此，两个基于 RL 的代理识别的低 FPS 点总体上相当可靠。就识别的低 FPS 点数量而言，RELINE 识别的点比 rl-baseline 多（173 比 33）。这是预期的，因为它明确的目标是对游戏进行负载测试。然而，乍一看令人惊讶的是随机代理识别的低 FPS 点数量之多（90）。图 4-(b)和图 4-(c)有助于解释这一发现。

Fig. 4-(b) plots the path_done ( $y$ coordinate) for each low-FPS point identified by each agent, using the same color schema of the violin plots (e.g., blue corresponds to RELINE).
图 4-(b)绘制了每个代理识别的低 FPS 点的 path_done（ $y$ 坐标），使用与小提琴图相同的颜色方案（例如，蓝色对应 RELINE）。

If multiple points fall in the same coordinate (i.e., same path_done but different centering), they are shown with a red border. The scale of the path_done has been normalized between 0 and 100, where 0 corresponds to the starting line of the track and 100 to its finish line. Similarly, Fig. 4-(c) plots the centering ( $x$ coordinate) for the low-FPS points. The line at 0 represents the center of the track, while the continuous lines in position $\sim$ -18 and $\sim$ 18 depict the limits of the track. Finally, the dashed lines represent the area of the game we asked RELINE to explore: based on our reward function, we penalize the agent for going outside the [-20, +20] range that, normalized, corresponds to $\sim$ [-36, +36]. Also rl-baseline is penalized outside of this area.
如果多个点落在相同的坐标上（即，相同的 path_done 但不同的 centering），它们将显示为红色边框。 path_done 的比例已在 0 和 100 之间标准化，其中 0 对应于赛道的起点，100 对应于终点。类似地，图 4-(c)绘制了低 FPS 点的中心（ $x$ 坐标）。 0 处的线代表赛道的中心，而位置 $\sim$ -18 和 $\sim$ 18 处的连续线表示赛道的限制。最后，虚线代表我们要求 RELINE 探索的游戏区域：根据我们的奖励函数，我们惩罚代理器超出[-20，+20]范围，标准化后对应于 $\sim$ [-36，+36]。同时，rl-baseline 在此区域之外也会受到惩罚。

As expected, the random agent is not able to advance in the game: The low-FPS points it identifies are all placed near the starting line — red dots in Fig. 4-(b). This indicates that a random agent can be used to exercise a specific part of a game, but it is not able to explore the game as a player would do. This is also confirmed by the red dots in Fig. 4-(c), with the random agent exploring areas of the game far from the track and that a human player is unlikely to explore. Also, it is worth noting that in SuperTuxKart each episode lasts (based on our setting) 90 seconds if the agent does not cross the finish line. However, as shown in our preliminary study, in other games such as MsPacman a random agent could quickly lose an episode without having the chance to explore the game at all.
如预期，随机代理无法在游戏中取得进展：它识别的低 FPS 点都位于起跑线附近 — 图 4-(b)中的红点。这表明随机代理可以用来练习游戏的特定部分，但无法像玩家那样探索游戏。这也可以从图 4-(c)中的红点得到证实，随机代理探索游戏中远离赛道的区域，而人类玩家不太可能探索这些区域。另外值得注意的是，在 SuperTuxKart 中，每一集（根据我们的设置）持续 90 秒，如果代理没有越过终点线。然而，正如我们的初步研究所显示的，对于其他游戏如 MsPacman，一个随机代理可能会很快失去一集，根本没有机会探索游戏。

The low-FPS points identified by RELINE (blue dots) and by rl-baseline (green) are instead closer to the track and, for what concerns RELINE, they are within or very close the area of the game we ask it to explore — see dashed lines in Fig. 4-(c). Thus, by customizing the reward function, it is possible to define the area of the game relevant for the load testing.
RELINE（蓝点）和 rl-baseline（绿点）识别的低 FPS 点相反，更靠近赛道，就 RELINE 而言，它们在我们要求其探索的游戏区域内或非常靠近 — 见图 4-(c)中的虚线。因此，通过定制奖励函数，可以定义与负载测试相关的游戏区域。

Looking at Fig. 4-(b), we can see that RELINE is also able to identify low-FPS in different areas of the game with, however, a concentration close to the beginning and the end of the game. It is difficult to explain the reason for such a result, but we hypothesize two possible explanations.
从图 4-(b)可以看出，RELINE 也能够识别游戏不同区域的低 FPS，但是集中在游戏开始和结束附近。很难解释这样的结果，但我们假设有两种可能的解释。

First, it is possible that the “central” part of the game simply features less low-FPS areas. This would also be confirmed by the fact that rl-baseline only found one low-FPS point in that part of the game. Also, the training and the reward function could have driven RELINE to explore more the starting and the ending of the game. The starting part is certainly the most explored since, at the beginning of the training, the agent is basically a random agent. Thus, it mostly collects experience about low-FPS points found in the beginning of the game since, similarly to the random agent, it is not able to advance in the game. It is important to remember that the data in Fig. 4 only refers to the 1,000 games played by RELINE after the 2,300 training games, so we are not including the random exploration done at the beginning of the training in Fig. 4. However, once the agent learns several low-FPS points in the starting of the game, it can exercise them again and again to get a higher reward.
首先，游戏的“中心”部分可能只包含较少的低 FPS 区域。这也可以通过 rl-baseline 只在游戏的那部分找到一个低 FPS 点来确认。此外，训练和奖励函数可能驱使 RELINE 更多地探索游戏的开始和结束。开始部分肯定是最受探索的，因为在训练开始时，代理基本上是一个随机代理。因此，它主要收集有关游戏开始时发现的低 FPS 点的经验，因为与随机代理类似，它无法在游戏中前进。重要的是要记住，图 4 中的数据仅指 2,300 个训练游戏后 RELINE 玩的 1,000 个游戏，因此我们没有在图 4 中包括在训练开始时进行的随机探索。然而，一旦代理程序学会了游戏开始时的几个低 FPS 点，它就可以一遍又一遍地练习它们以获得更高的奖励。

Concerning the end of the game, we set a maximum duration of 90 seconds for each game, but we know that a well-trained agent can complete the lap in $\sim$ 70 seconds. It is possible that the agent used the remaining time to better explore the last part of the game before crossing the finish line, thus finding a higher number of low-FPS points in that area. Additional trainings, possibly with a different reward function, are needed to better explain our finding.
关于游戏结束，我们为每局设置了最长 90 秒的持续时间，但我们知道一个训练有素的代理可以在 70 秒内完成一圈。可能是代理在越过终点线前利用剩余时间更好地探索游戏的最后部分，从而在该区域找到更多低 FPS 点。需要进行额外的训练，可能使用不同的奖励函数，以更好地解释我们的发现。

Concerning the violin plots in Fig. 4-(a), we can see that RELINE and rl-baseline exhibit a similar distribution, with RELINE being able to identify some stronger low-FPS points (i.e., longer time to render frames). All distributions have, as expected, the median above the $t$ threshold, with RELINE’s one being higher (24.54 vs 21.69 for rl-baseline and 19.39 for random agent). The highest value of the distributions is 65.92 (60.7 FPS) for RELINE, against 44.81 (89.3 FPS) for rl-baseline and 50.73 (78.8 FPS) for random agent. Remember that all these values represent milliseconds to load frames after an action performed by the agents.
关于图 4-(a)中的小提琴图，我们可以看到 RELINE 和 rl-baseline 表现出类似的分布，其中 RELINE 能够识别一些更强的低 FPS 点（即渲染帧的时间更长）。所有分布的中位数都预期地高于 $t$ 阈值，其中 RELINE 的中位数更高（24.54，而 rl-baseline 为 21.69，随机代理为 19.39）。分布的最高值为 65.92（60.7 FPS）对于 RELINE，而 rl-baseline 为 44.81（89.3 FPS），随机代理为 50.73（78.8 FPS）。请记住，所有这些值都代表代理执行动作后加载帧的毫秒数。

Table 2. Results of Mann-Whitney test (adjusted

p

-value) and Cliff’s Delta (

d

) when comparing the distributions of rendering times — boldface indicates higher times.
表 2. Mann-Whitney 检验结果（调整后的

p

值）和 Cliff's Delta（

d

）比较呈现时间分布时的结果 — 粗体表示更高的时间。

Test	p-value	OR
RELINE $vs$ rl-baseline RELINE $vs$ rl 基线	$<$ 0.001	0.34 (Medium) 0.34（中等）
RELINE $vs$ random agent RELINE $vs$ 随机代理	$<$ 0.001	0.36 (Medium) 0.36 (中等)
rl-baseline $vs$ random agent rl-基线 $vs$ 随机代理	$<$ 0.001	0.16 (Small) 0.16（小）

Table 2 shows the results of the statistical comparisons among the three distributions. In each test, the approach reported in boldface is the one identifying stronger low-FPS points (i.e., more extreme points requiring longer rendering time for their frames). The adjusted $p$ -values report a significant difference ( $p$ -value $<$ 0.001) in favor of RELINE against both rl-baseline and the random agent (in both cases, with a medium effect size). Thus, the low-FPS points identified by RELINE tend to require longer times to render frames. Fig. 2-(c) shows an example of low-FPS point identified by RELINE: Crashing against the sheep results in a drop of FPS.
表 2 显示了三个分布之间的统计比较结果。在每个测试中，以粗体显示的方法是识别出更强低 FPS 点的方法（即，需要更长的渲染时间来渲染它们的帧）。调整后的 $p$ -值报告了有利于 RELINE 的显著差异（ $p$ -值 $<$ 0.001），相对于 rl-baseline 和随机代理（在这两种情况下，效果尺寸中等）。因此，RELINE 识别出的低 FPS 点往往需要更长的时间来渲染帧。图 2-(c)显示了 RELINE 识别出的低 FPS 点的示例：与羊相撞导致 FPS 下降。

Finally, it is worth commenting about the overlap of low-FPS points identified by the three agents. Indeed, RELINE and rl-baseline found 14 low-FPS points in common (i.e., same $x$ and $y$ coordinates), while the overlap is of 11 points for RELINE and random agent, and 10 for rl-baseline and random agent. The most interesting finding of this analysis is that rl-baseline was able to identify only 19 points missed by RELINE, while the latter found 159 points missed by rl-baseline. This supports the role played by the reward function in pushing RELINE to look for low-FPS points.
最后，值得评论的是三个代理人识别的低 FPS 点的重叠。确实，RELINE 和 rl-baseline 发现了 14 个共同的低 FPS 点（即相同的 $x$ 和 $y$ 坐标），而 RELINE 和随机代理之间的重叠点为 11 个，rl-baseline 和随机代理之间的重叠点为 10 个。这项分析最有趣的发现是，rl-baseline 只能识别 RELINEMiss 的 19 个点，而后者发现了 rl-baselineMiss 的 159 个点。这支持了奖励函数在推动 RELINE 寻找低 FPS 点方面发挥的作用。

5. Threats to Validity 5.有效性威胁

Threats to Construct Validity. The main threats to the construct validity of our study are related to the process we adopted in our case study (Section 4) to identify low-FPS points. Based on our experiments, and in particular on the findings reported in Fig. 3, our methodology should be reliable enough to identify variations in FPS. Still, some level of noise can be expected, and for this reason all our analyses have been run at least 300 times, while 1,000 episodes were played by each of the experimented approaches.
构造有效性威胁。我们研究的构造有效性主要威胁与我们在案例研究（第 4 节）中采用的过程有关，以识别低 FPS 点。根据我们的实验，特别是在图 3 中报告的发现，我们的方法应该足够可靠，以识别 FPS 的变化。尽管如此，可以预期会有一定程度的噪音，因此我们所有的分析至少运行了 300 次，而每种实验方法都进行了 1,000 次剧集。

Concerning our preliminary study (Section 3), it is clear that the bugs we injected are not representative of real performance bugs in the subject games. However, they are inspired from a performance mutation operator defined in the literature (Delgado-Pérez et al., 2021). Our preliminary study only serves as a proof-of-concept to verify whether, by modifying the reward function, a RL-based agent would adapt its behavior to look for bugs while playing the game.
关于我们的初步研究（第 3 节），很明显，我们注入的错误并不代表受试游戏中真实的性能错误。然而，它们受到文献中定义的性能变异操作符的启发（Delgado-Pérez 等，2021 年）。我们的初步研究仅仅是作为一个概念验证，以验证通过修改奖励函数，基于 RL 的代理是否会调整其行为以寻找在游戏中玩时的错误。

Threats to Internal Validity. In our case study, to ease the training we did not use the “real” game, but its wrapped version, i.e., PySuperTuxKart (PyS, [n.d.]). While the core game is the same, the version we adopted does not contain the latest updates and it includes additional Python code that may affect the rendering time. We assume that such a time is constant for all the frames since it simply triggers the frame rendering operation in the core game. Besides, we forced the game to run with lowest graphics settings to speed up rendering: For example, we excluded dynamic lighting, anti-aliasing, and shadows. Therefore, the low-FPS points found in PySuperTuxKart may be irrelevant in the original game or with other graphic settings. Also, we applied the five- $\sigma$ rule to define a threshold for defining what a low-FPS point is. The threshold we set might be not indicative of relevant performance issues.
对内部有效性的威胁。在我们的案例研究中，为了简化训练，我们没有使用“真实”游戏，而是其封装版本，即 PySuperTuxKart（PyS，[n.d.]）。虽然核心游戏是相同的，但我们采用的版本不包含最新更新，并且包含可能影响渲染时间的额外 Python 代码。我们假设这样的时间对于所有帧来说是恒定的，因为它只是触发核心游戏中的帧渲染操作。此外，我们强制游戏以最低图形设置运行以加快渲染速度：例如，我们排除了动态照明、抗锯齿和阴影。因此，在 PySuperTuxKart 中发现的低 FPS 点可能在原始游戏中或在其他图形设置下是无关紧要的。此外，我们应用了五- $\sigma$ 规则来定义低 FPS 点的阈值。我们设置的阈值可能不表示相关性能问题。

Still, the goal of our study was to show that once set specific requirements (e.g., the threshold $t$ , the area to explore, etc.), the agent is able to adapt trying to maximize its reward. Thus, we do not expect changes in the threshold to invalidate our findings.
然而，我们研究的目标是展示一旦设定了特定要求（例如，阈值 $t$ ，要探索的区域等），代理能够适应并试图最大化其奖励。因此，我们不希望更改阈值会使我们的发现无效化。

Threats to conclusion validity. In our data analysis we used appropriate statistical procedures, also adopting p-value adjustment when multiple tests were used within the same analysis.
对结论有效性的威胁。在我们的数据分析中，我们使用了适当的统计程序，当在同一分析中使用多个测试时，还采用了 p 值调整。

Threats to External Validity Besides the proof-of-concept study we presented in Section 3, our empirical evaluation of RELINE includes a single game. This does not allow us to generalize our findings. The reasons for such a choice lie in the high effort we experienced as researchers in (i) building the pipeline to interact with the game, (ii) finding and experimenting with a reliable way to capture the FPS, (iii) defining a meaningful reward function that allowed the agent to successfully play the game in the first place and, then, to also spot low-FPS points. These steps were a long trial-and-error process with the most time consuming part being the trainings needed to test the different reward functions we experimented before converging towards the ones presented in this paper. Indeed, testing a new version of a reward function required at least one week of work with the hardware at our disposal (including implementation, training, and data analysis).
对外部有效性的威胁除了我们在第 3 节中提出的概念验证研究外，我们对 RELINE 的实证评估包括一款单人游戏。这不允许我们推广我们的发现。做出这种选择的原因在于我们作为研究人员在以下方面所经历的巨大努力：(i)建立与游戏互动的管道，(ii)找到并尝试一种可靠的方法来捕捉 FPS，(iii)定义一个有意义的奖励函数，使代理能够成功地玩游戏，然后还能发现低 FPS 点。这些步骤是一个漫长的反复试验过程，其中最耗时的部分是我们需要进行的培训，以测试我们在收敛到本文中所呈现的奖励函数之前尝试的不同奖励函数。事实上，测试奖励函数的新版本需要至少一周的工作时间，包括我们可用的硬件（包括实施、培训和数据分析）。

This was also due to the impossibility of using multiple machines or to run multiple processes in parallel on the same server. Indeed, as explained, using the exact same environment to run all our experiments was a study requirement. It is worth noting that, because of similar issues, other state-of-the-art approaches targeting different game properties were experimented with only one game as well (see e.g., (Zook et al., 2014; Pfau et al., 2020; Bergdahl et al., 2020; Wu et al., 2020)). We believe that instantiating RELINE on a new game would be much easier by collaborating with the game developers. While this would only slightly simplify the definition of a meaningful reward function, the original developers of the game could easily provide through APIs all information needed by RELINE (including, e.g., the FPS), cutting away weeks of work.
这也是由于在同一台服务器上无法使用多台机器或并行运行多个进程的不可能性。确实，正如所解释的，使用完全相同的环境来运行所有实验是一项研究要求。值得注意的是，由于类似问题，其他针对不同游戏属性的最新方法也仅在一个游戏上进行了实验（例如，参见（Zook 等，2014 年；Pfau 等，2020 年；Bergdahl 等，2020 年；Wu 等，2020 年））。我们相信，通过与游戏开发人员合作，在新游戏上实例化 RELINE 会更容易。虽然这只会稍微简化有意义的奖励函数的定义，但游戏的原始开发人员可以轻松通过 API 提供 RELINE 所需的所有信息（包括例如 FPS），从而节省数周的工作。

6. Related Work 6.相关工作

Three recent studies (Politowski et al., 2021; Truelove et al., 2021; Li et al., 2021) suggest that finding performance issues in video games is a relevant problem, according to both game developers (Politowski et al., 2021; Truelove et al., 2021) and players (Li et al., 2021). Nevertheless, to the best of our knowledge, no previous work introduced automated approaches for load testing video games. Therefore, in this section, we discuss some important works on the quality assurance of video games in general. We first introduce the approaches defined in the literature for training agents able to automatically play and win a game. Then, we show how such approaches are used for play-testing for (i) finding functional issues and (ii) assessing game/level design (e.g., finding unbalanced levels or mechanics).
最近的三项研究（Politowski 等，2021 年；Truelove 等，2021 年；Li 等，2021 年）表明，在视频游戏中发现性能问题是一个相关的问题，这是根据游戏开发者（Politowski 等，2021 年；Truelove 等，2021 年）和玩家（Li 等，2021 年）的观点。然而，据我们所知，以前没有任何工作介绍过用于加载测试视频游戏的自动化方法。因此，在本节中，我们讨论了一些关于视频游戏质量保证的重要工作。我们首先介绍文献中定义的用于训练代理程序能够自动玩游戏并获胜的方法。然后，我们展示这些方法如何用于游戏测试，以寻找功能问题和评估游戏/关卡设计（例如，寻找不平衡的关卡或机制）。

6.1. Training Agents to Play
6.1.训练代理程序玩游戏

Reinforcement Learning (RL) is widely used to train agents able to automatically play video games. Mnih et al. (Mnih et al., 2013; Mnih et al., 2015) presented the first approach based on high-dimensional sensory input (i.e., raw pixels from the game screen) able to automatically learn how to play a game. The authors used a Convolutional Neural Network (CNN) trained with a variant of Q-learning to train their agent. The proposed approach is able to surpass human expert testers in playing some games from the Atari 2600 benchmark.
强化学习（RL）被广泛应用于训练能够自动玩视频游戏的代理。Mnih 等人（Mnih 等人，2013 年；Mnih 等人，2015 年）提出了第一种基于高维感知输入（即来自游戏屏幕的原始像素）的方法，能够自动学习如何玩游戏。作者使用了一个卷积神经网络（CNN）来训练他们的代理，该网络使用 Q-learning 的变体进行训练。所提出的方法能够在玩 Atari 2600 基准测试中的一些游戏中超越人类专家测试者。

Vinyals et al. (Vinyals et al., 2017) introduced SC2LE, a RL environment based on the game StarCraft II that simplifies the development of specialized agents for a multi-agent environment.
Vinyals 等人（Vinyals 等人，2017 年）介绍了 SC2LE，这是一个基于《星际争霸 II》游戏的 RL 环境，简化了为多智能体环境开发专门代理的过程。

Hessel et al. (Hessel et al., 2018) analyzed six extensions of the DQN algorithm for RL and they reported the combinations that allow to achieve the best results in terms of training time on the Atari 2600 benchmark.
Hessel 等人（Hessel 等人，2018 年）分析了 DQN 算法的六个扩展，他们报告了能够在 Atari 2600 基准测试中以最短的训练时间获得最佳结果的组合。

Baker et al. (Baker et al., 2019) explored the use of RL in a multi-agent environment (i.e., the hide and seek game). They report that agents create self-supervised autocurricula (Leibo et al., 2019), i.e., curricula naturally emerging from competition and cooperation. As a result, the authors found evidence of strategy learning not guided by direct incentives.
Baker 等人（Baker 等人，2019 年）探讨了在多智能体环境中（即捉迷藏游戏）使用 RL 的情况。他们报告称，智能体创建了自监督的自动课程（Leibo 等人，2019 年），即从竞争和合作中自然产生的课程。因此，作者发现了一些策略学习的证据，这些学习并非受到直接激励的指导。

Berner et al. (AI, 2019) reported that state-of-the-art RL techniques were successfully used in OpenAI Five to train an agent able to play Dota 2 and to defeat the world champion in 2019 (Team OG). Finally, Mesentier et al. (De Mesentier Silva et al., 2017) reported that AI agents could be easily trained to explore the states of a board game (Ticket to Ride) performing automated play-testing.
Berner 等人（AI，2019 年）报告称，最先进的 RL 技术已成功应用于 OpenAI Five，训练出一名能够玩 Dota 2 并在 2019 年击败世界冠军（OG 战队）的智能体。最后，Mesentier 等人（De Mesentier Silva 等人，2017 年）报告称，AI 智能体可以轻松训练以探索一个棋盘游戏（通往幸福之路）的状态，执行自动化的游戏测试。

6.2. Testing of Video Games
6.2.视频游戏测试

Functional testing of video games aims at finding unexpected behaviors in a game. Defining the test oracle, i.e., determining if a specific game behavior is defective, is not trivial. Several categories of test oracles were identified to determine if a bug was found: crash (the game stops working) (Pfau et al., 2017; Zheng et al., 2019), stuck (the agent can not win the game) (Pfau et al., 2017; Zheng et al., 2019), game balance (game too easy or too hard) (Zheng et al., 2019), logical (an invalid state is reached) (Zheng et al., 2019), and user experience bugs (related to graphic and sound, e.g., glitches) (Pfau et al., 2017; Zheng et al., 2019). While heuristics can be used to find possible crash-, stuck-, and game-balance-related bugs (Zheng et al., 2019), logical and user-experience bugs may require the developers to manually define an oracle.
视频游戏的功能测试旨在发现游戏中的意外行为。定义测试预言，即确定特定游戏行为是否有缺陷，这并不是一件简单的事情。已经确定了几种测试预言类别，以确定是否发现了错误：崩溃（游戏停止工作）（Pfau 等，2017 年；Zheng 等，2019 年），卡住（代理无法赢得游戏）（Pfau 等，2017 年；Zheng 等，2019 年），游戏平衡（游戏太容易或太难）（Zheng 等，2019 年），逻辑（达到无效状态）（Zheng 等，2019 年）和用户体验错误（与图形和声音相关，例如故障）（Pfau 等，2017 年；Zheng 等，2019 年）。虽然可以使用启发式方法来找到可能与崩溃、卡住和游戏平衡相关的错误（Zheng 等，2019 年），但逻辑和用户体验错误可能需要开发人员手动定义预言。

Iftikhar et al. (Iftikhar et al., 2015) proposed a model-based testing approach for automatically perform black-box testing of platform games. More recent approaches mostly rely on RL.
Iftikhar 等人（Iftikhar 等，2015 年）提出了一种基于模型的测试方法，用于自动执行平台游戏的黑盒测试。最近的方法主要依赖于 RL。

Pfau et al. (Pfau et al., 2017) introduced ICARUS, a framework for autonomous play-testing aimed at finding bugs. ICARUS supports the fully automated detection of crash and stuck bugs, while it also provides semi-supervised support for user-experience bugs.
Pfau 等人（Pfau 等人，2017 年）介绍了 ICARUS，这是一个旨在发现错误的自主游戏测试框架。ICARUS 支持完全自动检测崩溃和卡住错误，同时还为用户体验错误提供半监督支持。

Zheng et al. (Zheng et al., 2019) used Deep Reinforcement Learning (DLR) in their approach, Wuji. Wuji balances the aim of winning the game and exploring the space to find crash, stuck, game balance, and logical bugs in three video games (one simple, Block Maze and two commercial, L10 and NSH).
郑等人（Zheng 等人，2019 年）在他们的方法 Wuji 中使用了深度强化学习（DLR）。Wuji 平衡了赢得比赛和探索空间以发现三个视频游戏（一个简单的 Block Maze 和两个商业的 L10 和 NSH）中的崩溃、卡住、游戏平衡和逻辑错误的目标。

Bergdahl et al. (Bergdahl et al., 2020) defined a DLR-based method which provides support for continuous actions (e.g., mouse or game-pads) and they experimented it with a first-person shooter game.
Bergdahl 等人（Bergdahl 等人，2020 年）定义了一种基于 DLR 的方法，支持连续动作（例如鼠标或游戏手柄），并将其用于第一人称射击游戏的实验。

Wu et al. (Wu et al., 2020) used RL to automatically perform regression testing, i.e., to compare the game behaviors in different versions of a game. They experimented with such an approach on a Massive Multiplayer Online Role-Playing Game (MMORPG).
Wu 等人（Wu 等人，2020 年）使用 RL 自动执行回归测试，即比较游戏在不同版本中的行为。他们在大型多人在线角色扮演游戏（MMORPG）上尝试了这种方法。

Ariyurek et al. (Ariyurek et al., 2021) experimented RL and Monte Carlo Tree Search (MCTS) to define both synthetic agents, trained in a completely automated manner, and human-like agents, trained on trajectories used by human testers.
Ariyurek 等人（Ariyurek 等人，2021 年）尝试了 RL 和蒙特卡洛树搜索（MCTS）来定义既训练方式完全自动化的合成代理，又训练方式基于人类测试人员使用的轨迹的类人代理。

Finally, Ahumada and Bergel (Ahumada and Bergel, 2020) proposed an approach based on genetic algorithms to reproduce bugs in video games by reconstructing the correct sequence of actions that lead to the desired faulty state of the game.
最后，Ahumada 和 Bergel（Ahumada 和 Bergel，2020 年）提出了一种基于遗传算法的方法，通过重现导致游戏出现期望故障状态的正确操作序列来在视频游戏中复制错误。

6.3. Game- and Level-Design Assessment
6.3.游戏和关卡设计评估

One of the main goals of a video game is to provide a pleasant gameplay to the player. Assessing the game balance and other aspects related to game- and level-design is, therefore, of primary importance.
视频游戏的主要目标之一是为玩家提供愉快的游戏体验。因此，评估游戏平衡和与游戏和关卡设计相关的其他方面至关重要。

For this reason, previous work defined several approaches for automatically finding game- and level-design issues in video games. Zook et al. (Zook et al., 2014) proposed an approach based on Active Learning (AL) to help designers performing low-level parameter tuning. They experimented such an approach on a shoot ’em up game.
出于这个原因，先前的工作定义了几种方法，用于自动发现视频游戏中的游戏和关卡设计问题。Zook 等人（Zook 等人，2014 年）提出了一种基于主动学习（AL）的方法，以帮助设计师执行低级参数调整。他们在一个射击游戏上尝试了这种方法。

Gudmundsson et al. (Gudmundsson et al., 2018) introduced an approach based on Deep Learning to learn human-like play-testing from player data. They used a CNN to automatically predict the most natural next action a player would take aiming to estimate difficulty of levels in Candy Crush Saga and Candy Crush Soda Saga.
Gudmundsson 等人（2018 年）提出了一种基于深度学习的方法，从玩家数据中学习类似人类的游戏测试。他们使用 CNN 自动预测玩家可能采取的最自然的下一步行动，旨在估计《糖果传奇》和《糖果传奇苏打》中关卡的难度。

Zhao et al. (Zhao et al., 2019) report four case studies in which they experiment the use of human-like agent trained with RL to predict player interactions with the game and to highlight possible game-design issues. On a similar note, Pfau et al. (Pfau et al., 2020) used deep player behavioral models to represent a specific player population for Aion, a MMORPG. They used such models to estimate the game balance and they showed that they can be used to tune it.

Finally, Stahlke et al. (Stahlke et al., 2020) defined PathOS, a tool aimed at helping developers to simulate players’ interaction with a specific game level, to understand the impact of small design changes.

7. Conclusions and Future Work

We presented RELINE, an approach that uses RL to load test video games. RELINE can be instantiated on different games using different RL models and reward functions.

Our proof-of-concept study performed on two subject systems shows the feasibility of our approach: Given a reward function able to reward the agent when artificial performance bugs are identified, the agent adapts its behavior to play the game while looking for those bugs.

We performed a case study on a real 3D racing game, SuperTuxKart, showing the ability of RELINE to identify areas resulting in FPS drops. As compared to a classic RL agent only trained to play the game, RELINE is able to identify a substantially higher number of low-FPS points (173 vs 33).

Despite the encouraging results, there are many aspects that deserve a deeper investigation and from which our future research agenda stems. First, we plan additional tests on SuperTuxKart to better understand how the agent reacts to changes in the reward function (e.g., is it possible to find more low-FPS points in the central part of the game?). Also, with longer training times it should be possible to train an agent able to play more challenging versions of this game featuring additional 3D effects (e.g., rainy conditions), possibly allowing to find new low-FPS points. We also plan to instantiate RELINE on other game genres (e.g., role-playing games), possibly by cooperating with their developers.

In our replication package (Tufano, 2021), we release the code implementing the models used in our study and the raw data of our experiments.

Acknowledgment

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 851720). Any opinions, findings, and conclusions expressed herein are the authors’ and do not necessarily reflect those of the sponsors.

References

(1)
3dc ([n.d.]) [n.d.]. 3D.City - Performance Issue 42. https://github.com/lo-th/3d.city/issues/42.
Car ([n.d.]) [n.d.]. CartPole. https://gym.openai.com/envs/CartPole-v0/.
dwa ([n.d.]a) [n.d.]a. Dwarfcorp - Performance Issue 583. https://github.com/Blecki/dwarfcorp/issues/583.
dwa ([n.d.]b) [n.d.]b. Dwarfcorp - Performance Issue 64. https://github.com/Blecki/dwarfcorp/issues/64.
dwa ([n.d.]c) [n.d.]c. Dwarfcorp - Performance Issue 711. https://github.com/Blecki/dwarfcorp/issues/711.
dwa ([n.d.]d) [n.d.]d. Dwarfcorp - Performance Issue 904. https://github.com/Blecki/dwarfcorp/issues/904.
dwa ([n.d.]e) [n.d.]e. Dwarfcorp - Performance Issue 966. https://github.com/Blecki/dwarfcorp/issues/966.
geo ([n.d.]) [n.d.]. Geostrike - Performance Issue 214. https://github.com/Webiks/GeoStrike/issues/214.
Gym ([n.d.]) [n.d.]. Gym. https://gym.openai.com/.
Pac ([n.d.]) [n.d.]. MsPacman. https://gym.openai.com/envs/MsPacman-v0/.
PyS ([n.d.]) [n.d.]. PySuperTuxKart. https://github.com/supertuxkart/stk-code.
mar ([n.d.]) [n.d.]. VIDEO GAMES : INDUSTRY TRENDS, MONETISATION STRATEGIES & MARKET SIZE 2020-2025 https://www.juniperresearch.com/researchstore/content-digital-media/video-games-market-report.
Ahumada and Bergel (2020) Tomás Ahumada and Alexandre Bergel. 2020. Reproducing Bugs in Video Games using Genetic Algorithms. In 2020 IEEE Games, Multimedia, Animation and Multiple Realities Conference (GMAX). IEEE, 1–6.
AI (2019) Open AI. 2019. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680 (2019).
Ariyurek et al. (2021) S. Ariyurek, A. Betin-Can, and E. Surer. 2021. Automated Video Game Testing Using Synthetic and Humanlike Agents. IEEE Transactions on Games 13, 1 (2021), 50–67. https://doi.org/10.1109/TG.2019.2947597
Baker et al. (2019) Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and Igor Mordatch. 2019. Emergent tool use from multi-agent autocurricula. arXiv preprint arXiv:1909.07528 (2019).
Bergdahl et al. (2020) J. Bergdahl, C. Gordillo, K. Tollmar, and L. Gisslén. 2020. Augmenting Automated Game Testing with Deep Reinforcement Learning. In 2020 IEEE Conference on Games (CoG). 600–603. https://doi.org/10.1109/CoG47356.2020.9231552
Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym. arXiv:arXiv:1606.01540
Bum Hyun Lim et al. (2006) Bum Hyun Lim, Jin Ryong Kim, and Kwang Hyun Shim. 2006. A load testing architecture for networked virtual environment. In 2006 8th International Conference Advanced Communication Technology, Vol. 1. 5 pp.–848. https://doi.org/10.1109/ICACT.2006.206095
Cho et al. (2010) C. Cho, D. Lee, K. Sohn, C. Park, and J. Kang. 2010. Scenario-Based Approach for Blackbox Load Testing of Online Game Servers. In 2010 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery. 259–265. https://doi.org/10.1109/CyberC.2010.54
Conover (1998) W. J. Conover. 1998. Practical Nonparametric Statistics (3rd edition ed.). Wiley.
De Mesentier Silva et al. (2017) Fernando De Mesentier Silva, Scott Lee, Julian Togelius, and Andy Nealen. 2017. AI as evaluator: Search driven playtesting of modern board games. In WS-17-01 (AAAI Workshop - Technical Report). AI Access Foundation, 959–966. 31st AAAI Conference on Artificial Intelligence, AAAI 2017.
Delgado-Pérez et al. (2021) Pedro Delgado-Pérez, Ana Belén Sánchez, Sergio Segura, and Inmaculada Medina-Bulo. 2021. Performance mutation testing. Software Testing, Verification and Reliability 31, 5 (2021). https://doi.org/10.1002/stvr.1728
Grafarend (2006) E.W. Grafarend. 2006. Linear and Nonlinear Models: Fixed Effects, Random Effects, and Mixed Models. Walter de Gruyter. https://books.google.ch/books?id=uHW2wAEACAAJ
Grissom and Kim (2005) Robert J. Grissom and John J. Kim. 2005. Effect sizes for research: A broad practical approach (2nd edition ed.). Lawrence Earlbaum Associates.
Gudmundsson et al. (2018) Stefan Freyr Gudmundsson, Philipp Eisen, Erik Poromaa, Alex Nodet, Sami Purmonen, Bartlomiej Kozakowski, Richard Meurling, and Lele Cao. 2018. Human-like playtesting with deep learning. In 2018 IEEE Conference on Computational Intelligence and Games (CIG). IEEE, 1–8.
Hessel et al. (2018) Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. 2018. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
Holm (1979) Sture Holm. 1979. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics (1979), 65–70.
Iftikhar et al. (2015) S. Iftikhar, M. Z. Iqbal, M. U. Khan, and W. Mahmood. 2015. An automated model based testing approach for platform games. In 2015 ACM/IEEE 18th International Conference on Model Driven Engineering Languages and Systems (MODELS). 426–435. https://doi.org/10.1109/MODELS.2015.7338274
Jung et al. (2005) YungWoo Jung, Bum-Hyun Lim, Kwang-Hyun Sim, HunJoo Lee, IlKyu Park, JaeYong Chung, and Jihong Lee. 2005. VENUS: The Online Game Simulator Using Massively Virtual Clients. In Systems Modeling and Simulation: Theory and Applications. 589–596.
Lapan (2018) Maxim Lapan. 2018. Deep Reinforcement Learning Hands-On: Apply Modern RL Methods, with Deep Q-Networks, Value Iteration, Policy Gradients, TRPO, AlphaGo Zero and More. Packt Publishing.
Leibo et al. (2019) Joel Z Leibo, Edward Hughes, Marc Lanctot, and Thore Graepel. 2019. Autocurricula and the emergence of innovation from social interaction: A manifesto for multi-agent intelligence research. arXiv preprint arXiv:1903.00742 (2019).
Li et al. (2021) Xiaozhou Li, Zheying Zhang, and Kostas Stefanidis. 2021. A data-driven approach for video game playability analysis based on players’ reviews. Information 12, 3 (2021), 129.
Lin et al. (2016) Dayi Lin, C. Bezemer, and A. Hassan. 2016. Studying the urgent updates of popular games on the Steam platform. Empirical Software Engineering 22 (2016), 2095–2126.
Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).
Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. nature 518, 7540 (2015), 529–533.
Pascarella et al. (2018) Luca Pascarella, Fabio Palomba, Massimiliano Di Penta, and Alberto Bacchelli. 2018. How is video game development different from software development in open source?. In Proceedings of the 15th International Conference on Mining Software Repositories, MSR 2018, Gothenburg, Sweden, May 28-29, 2018, Andy Zaidman, Yasutaka Kamei, and Emily Hill (Eds.). ACM, 392–402.
Pfau et al. (2020) Johannes Pfau, Antonios Liapis, Georg Volkmar, Georgios N Yannakakis, and Rainer Malaka. 2020. Dungeons & replicants: automated game balancing via deep player behavior modeling. In 2020 IEEE Conference on Games (CoG). IEEE, 431–438.
Pfau et al. (2017) Johannes Pfau, Jan David Smeddinck, and Rainer Malaka. 2017. Automated Game Testing with ICARUS: Intelligent Completion of Adventure Riddles via Unsupervised Solving. In Extended Abstracts Publication of the Annual Symposium on Computer-Human Interaction in Play (CHI PLAY ’17 Extended Abstracts). 153?164.
Politowski et al. (2021) Cristiano Politowski, Fabio Petrillo, and Yann-Gäel Guéhéneuc. 2021. A Survey of Video Game Testing. arXiv preprint arXiv:2103.06431 (2021).
Rubinstein and Kroese (2004) Reuven Y. Rubinstein and Dirk P. Kroese. 2004. The Cross Entropy Method: A Unified Approach To Combinatorial Optimization, Monte-Carlo Simulation (Information Science and Statistics). Springer-Verlag.
Smith et al. (2009) Adam M. Smith, Mark J. Nelson, and Michael Mateas. 2009. Computational Support for Play Testing Game Sketches. In Proceedings of the Fifth AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE’09). AAAI Press, 167?172.
Spearman (1904) C. Spearman. 1904. The Proof and Measurement of Association Between Two Things. American Journal of Psychology 15 (1904), 88–103.
Stahlke et al. (2020) Samantha N. Stahlke, Atiya Nova, and Pejman Mirza-Babaei. 2020. Artificial Players in the Design Process: Developing an Automated Testing Tool for Game Level and World Design. Proceedings of the Annual Symposium on Computer-Human Interaction in Play (2020).
supertuxkart ([n.d.]) supertuxkart. [n.d.]. https://github.com/supertuxkart.
Truelove et al. (2021) Andrew Truelove, Eduardo Santana de Almeida, and Iftekhar Ahmed. 2021. We’ll Fix It in Post: What Do Bug Fixes in Video Game Update Notes Tell Us?. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 736–747.
Tufano (2021) Rosalia Tufano. 2021. https://github.com/RosaliaTufano/rlgameauthors.
Vinyals et al. (2017) Oriol Vinyals, Timo Ewalds, Sergey Bartunov, P. Georgiev, A. S. Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich Küttler, J. Agapiou, Julian Schrittwieser, John Quan, Stephen Gaffney, S. Petersen, K. Simonyan, T. Schaul, H. V. Hasselt, D. Silver, T. Lillicrap, Kevin Calderone, Paul Keet, Anthony Brunasso, D. Lawrence, Anders Ekermo, J. Repp, and Rodney Tsing. 2017. StarCraft II: A New Challenge for Reinforcement Learning. ArXiv abs/1708.04782 (2017).
Wu et al. (2020) Yuechen Wu, Yingfeng Chen, Xiaofei Xie, Bing Yu, Changjie Fan, and Lei Ma. 2020. Regression Testing of Massively Multiplayer Online Role-Playing Games. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 692–696.
Zhao et al. (2019) Yunqi Zhao, Igor Borovikov, Ahmad Beirami, Jason Rupert, Caedmon Somers, Jesse Harder, Fernando de Mesentier Silva, John Kolen, Jervis Pinto, Reza Pourabolghasem, Harold Chaput, James Pestrak, Mohsen Sardari, Long Lin, Navid Aghdaie, and Kazi A. Zaman. 2019. Winning Isn’t Everything: Training Human-Like Agents for Playtesting and Game AI. CoRR abs/1903.10545 (2019). http://arxiv.org/abs/1903.10545
Zheng et al. (2019) Y. Zheng, X. Xie, T. Su, L. Ma, J. Hao, Z. Meng, Y. Liu, R. Shen, Y. Chen, and C. Fan. 2019. Wuji: Automatic Online Combat Game Testing Using Evolutionary Deep Reinforcement Learning. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 772–784.
Zook et al. (2014) Alexander Zook, Eric Fruchter, and Mark O. Riedl. 2014. Automatic playtesting for game parameter tuning via active learning. ArXiv abs/1908.01417 (2014).

Using Reinforcement Learning for Load Testing of Video Games使用强化学习进行视频游戏的负载测试

Abstract. 摘要。

1. Introduction 1.介绍

2. RL to Load Test Video Games2. RL 加载测试视频游戏

3. Preliminary Study: Injecting Artificial Performance Issues3.初步研究：注入人工性能问题

3.1. Study Design 3.1.研究设计

3.1.1. Bug Injection 3.1.1. Bug 注入

3.1.2. Learning to Play: RL Models and Game Reward Functions3.1.2.学习玩游戏：RL 模型和游戏奖励函数

3.1.3. Instantiating RELINE: Performance Reward Functions3.1.3.实例化 RELINE：性能奖励函数

3.1.4. Data Collection and Analysis3.1.4. 数据收集与分析

3.2. Preliminary Study Results3.2.初步研究结果

4. Case Study: Load Testing an Open Source Game4.案例研究：负载测试开源游戏