[2310.18859] 1 引言

marginparsep has been altered.
marginparsep 已被修改。
topmargin has been altered.
顶边距已被修改。
marginparwidth has been altered.
边注宽度已被修改。
marginparpush has been altered.
marginparpush 已被修改。
The page layout violates the ICML style. Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you. We’re not able to reliably undo arbitrary changes to the style. Please remove the offending package(s), or layout-changing commands and try again.
页面布局违反了ICML样式。请不要更改页面布局，或包含像geometry、savetrees或fullpage这样会为您更改布局的包。我们无法可靠地撤销对样式的任意更改。请移除有问题的包或更改布局的命令，然后再试一次。

SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
SiDA：灵感来自稀疏性的数据感知服务，用于高效且可扩展的大型专家混合模型

Anonymous Authors¹
匿名作者

Abstract 摘要

Mixture-of-Experts (MoE) has emerged as a favorable architecture in the era of large models due to its inherent advantage, i.e., enlarging model capacity without incurring notable computational overhead. Yet, the realization of such benefits often results in ineffective GPU memory utilization, as large portions of the model parameters remain dormant during inference. Moreover, the memory demands of large models consistently outpace the memory capacity of contemporary GPUs. Addressing this, we introduce SiDA (Sparsity-inspired Data-Aware), an efficient inference approach tailored for large MoE models. SiDA judiciously exploits both the system’s main memory, which is now abundant and readily scalable, and GPU memory by capitalizing on the inherent sparsity on expert activation in MoE models. By adopting a data-aware perspective, SiDA achieves enhanced model efficiency with a neglectable performance drop. Specifically, SiDA attains a remarkable speedup in MoE inference with up to $3.93\times$ throughput increasing, up to $75\%$ latency reduction, and up to $80\%$ GPU memory saving with down to $1\%$ performance drop. This work paves the way for scalable and efficient deployment of large MoE models, even in memory-constrained systems.
在大模型时代，混合专家模型（MoE）因其固有优势——在不显著增加计算开销的情况下扩大模型容量——而成为一种受欢迎的架构。然而，实现这些好处往往会导致GPU内存利用率低下，因为在推理过程中大部分模型参数保持不活跃。此外，大型模型对内存的需求持续超过现代GPU的内存容量。针对这一问题，我们引入了SiDA（受稀疏性启发的数据感知），这是一种为大型MoE模型量身定制的高效推理方法。SiDA巧妙地利用了系统的主内存——现在主内存已经变得充裕且易于扩展——以及GPU内存，通过利用MoE模型中专家激活的固有稀疏性。通过采用数据感知的视角，SiDA实现了模型效率的显著提升，同时性能损失可以忽略不计。具体来说，SiDA在MoE推理中实现了显著的加速，达到了高达 $3.93\times$ 的吞吐量增加、高达 $75\%$ 的延迟减少和高达 $80\%$ 的GPU内存节省，同时性能损失降至 $1\%$ 。这项工作为大型MoE模型的可扩展和高效部署铺平了道路，即使是在内存受限的系统中也是如此。

^†^†footnotetext: ¹Anonymous Institution, Anonymous City, Anonymous Region, Anonymous Country. Correspondence to: Anonymous Author <anon.email@domain.com>.
匿名机构，匿名城市，匿名地区，匿名国家。联系人：匿名作者。

1 Introduction 一、引言

Recently, rapid advances in large models with shocking performance have surprised the community in several areas, such as vision (Ramesh et al., 2022; Kirillov et al., 2023; Saharia et al., 2022), language (Brown et al., 2020; OpenAI, 2023; Smith et al., 2022), decision making (Yang et al., 2023), and robotics (Vemprala et al., 2023). For example, GPT-4 has demonstrated its capability that is comparable or even exceeds human-level understanding on several tasks (OpenAI, 2023), and DALLE $\cdot$ 2 can generate astonishing high-quality images. The outstanding performance of large models heavily relies on the outrageous number of parameters, namely the scaling law (Kaplan et al., 2020). Broadly speaking, the scaling law asserts that as the model size increases, various characteristics such as training loss, test performance, and the amount of required data exhibit predictable scaling behaviors.
近期，在多个领域内，大型模型取得的快速进展以及令人震惊的性能已经让整个社区感到惊讶，这些领域包括视觉（Ramesh 等人，2022年；Kirillov 等人，2023年；Saharia 等人，2022年）、语言（Brown 等人，2020年；OpenAI，2023年；Smith 等人，2022年）、决策制定（Yang 等人，2023年）以及机器人技术（Vemprala 等人，2023年）。例如，GPT-4 展示了其在多项任务上与人类水平相当乃至超越人类理解能力的潜力（OpenAI，2023年），而 DALLE 2 能够生成令人惊叹的高质量图像。大型模型的卓越性能在很大程度上依赖于庞大的参数数量，即所谓的规模定律（Kaplan 等人，2020年）。广义上讲，规模定律指出，随着模型大小的增加，训练损失、测试性能和所需数据量等各种特性表现出可预测的规模行为。

Refer to caption — Figure 1: Diagram Showcasing the Architecture of MoE-based Transformers. Within each MoE layer only a limited number of experts are activated for inference.
图1：展示基于MoE的变换器架构的示意图。在每个MoE层中，仅有限数量的专家被激活用于推理。

Mixture-of-Experts (MoE), a classical model architecture, enjoys the advantage that naturally fits the era of large models. MoE can improve the model’s performance by drastically increasing the number of parameters while only incurring little computational overhead. Although the number of parameters involved in the forward pass of an MoE model remains almost unchanged, research (Fedus et al., 2022) suggests that augmenting parameter counts using the MoE architecture still conforms to the scaling law. Encouraged by the advantage, many MoE-based large models have been proposed and achieved overwhelming performance in computer vision (Li et al., 2023a; Riquelme et al., 2021; Xue et al., 2022), natural language processing (Shazeer et al., 2017; Fedus et al., 2022), Specifically, the Sparsely-Gated Mixture-of-Experts (Shazeer et al., 2017) layer scales LSTM models to 137 billion parameters, which improves the model capacity by $1000\times$ with marginal computational overhead increase. Switch Transformers (Fedus et al., 2022) scale to 1.6 trillion parameters with the same perplexity as T5-XXL (Raffel et al., 2020) while $4\times$ speedup during inference. However, the success of MoE comes with sacrifices in effective GPU memory utilization, incurring large memory occupation while only a small fraction of parameters residing in the memory are effective for inference of the current batch. Fig. 1 depicts the architecture of MoE-based transformers, where only a small portion of experts are activated in each MoE layer during each inference.
专家混合模型（MoE），一种经典的模型架构，自然而然地适应了大模型时代的优势。MoE能够通过大幅增加参数数量来提升模型性能，同时只引入少量的计算开销。尽管MoE模型前向传播中涉及的参数数量几乎保持不变，但研究（Fedus等，2022年）表明，使用MoE架构增加参数数量仍然符合规模化定律。受到这一优势的鼓舞，许多基于MoE的大型模型被提出，并在计算机视觉（Li等，2023a；Riquelme等，2021；Xue等，2022）、自然语言处理（Shazeer等，2017；Fedus等，2022）等领域取得了压倒性的性能。具体来说，稀疏门控的专家混合模型（Shazeer等，2017）将LSTM模型扩展到了1370亿参数，以极小的计算开销增加提高了模型容量。Switch Transformers（Fedus等，2022）扩展到了1.6万亿参数，在推理过程中与T5-XXL（Raffel等，2020）保持相同的困惑度，同时实现了加速。然而，MoE的成功牺牲了有效GPU内存利用率，在只有一小部分参数对当前批次的推理有效的情况下，导致了大量内存占用。图1展示了基于MoE的变换器架构，其中每个MoE层在每次推理中只激活了一小部分专家。

Further, with the trend of model scaling, we have observed a substantial gap between the memory demands of large models and the memory capacity of GPUs. For instance, in the past three years, the number of parameters in state-of-the-art models has scaled from 175 billion in GPT-3 Brown et al. (2020) to 1.76 trillion in the newly announced GPT-4 OpenAI (2023), showing an over 10 $\times$ increase. Contrarily, the memory capacity of high-end GPUs remains around 80GB Choquette (2023), and commodity GPUs are still limited to 48GB or even smaller. This growing discrepancy motivates techniques to improve memory utilization efficiency. Thus, we seek to answer a compelling research question:
此外，随着模型规模化的趋势，我们观察到大型模型的内存需求与GPU的内存容量之间存在显著差距。例如，在过去三年中，最先进模型的参数数量已从GPT-3的1750亿增加到新公布的GPT-4的1.76万亿，显示出超过10倍的增长。相反，高端GPU的内存容量仍然维持在80GB左右，而普通GPU的内存容量仍限于48GB甚至更小。这种日益增长的差异激发了提高内存利用效率的技术。因此，我们寻求回答一个引人注目的研究问题：

How to serve large Mixture-of-Experts models in an efficient and scalable manner under constrained memory?
如何在内存受限的情况下高效且可扩展地部署大型专家混合模型？

Previous efforts have studied the efficiency problem of MoE models to some extent. Deepspeed-MoE Rajbhandari et al. (2022) optimizes the MoE module in the Deepspeed framework for efficient grouping and scheduling. A later version of the work Aminabadi et al. (2022) focused on optimizing the inference efficiency with optimized computation kernels and careful coordination of communication and parallelism. Tutel Hwang et al. (2023) enables adaptive parallelism and pipelining at runtime. However, these methods only focus on optimizing device-to-device communication but ignore the data-awareness,
之前的研究在一定程度上研究了MoE模型的效率问题。Deepspeed-MoE Rajbhandari等人（2022年）优化了Deepspeed框架中的MoE模块，以实现高效的分组和调度。该工作的后续版本Aminabadi等人（2022年）专注于通过优化的计算核心和仔细协调通信与并行性来优化推理效率。Tutel Hwang等人（2023年）在运行时实现了自适应并行和流水线。然而，这些方法仅关注于优化设备间通信，却忽略了数据感知。

not to mention exploiting the data-awareness to improve efficiency during inference. The data-awareness refers to a design where the technique or strategy is determined based on the incoming data. Our proposed framework embraces the data-awareness which brings three advantages. Firstly, the data-awareness can squeeze the sparsity leading to a further increase in memory efficiency compared to previous methods. Secondly, the data-awareness preserves the structure crucial for a sample’s unique features, better maintaining the model’s performance. Thirdly, the data-awareness offers better adaptability since the framework varies according to data distribution.
更不用说利用数据感知来提高推理过程中的效率了。数据感知是指根据传入数据确定技术或策略的设计。我们提出的框架采用了数据感知，带来了三个优势。首先，数据感知可以压缩稀疏性，与之前的方法相比，进一步提高了内存效率。其次，数据感知保留了对样本独特特征至关重要的结构，更好地维持了模型的性能。第三，数据感知提供了更好的适应性，因为框架会根据数据分布变化。

Table 1: Comparison of SiDA and Baseline Methods. This table delineates the capabilities of various methods in terms of data-awareness, effective GPU memory utilization, and inference speed on large MoE models. SiDA excels in its data-aware approach with high effective GPU memory utilization and high inference speed on large MoE models.
表1：SiDA与基准方法的比较。该表格详细列出了各种方法在数据感知能力、有效GPU内存利用率以及大型MoE模型上的推理速度方面的性能。SiDA在数据感知方法上表现出色，具有高效的GPU内存利用率和大型MoE模型上的高推理速度。

Methods 方法	Data-aware 数据感知的	Effective GPU 高效的GPU memory utilization 内存利用率	Inference speed 推理速度 on large MoE 在大型混合专家模型上
Standard 标准	✗	low	slow 慢
Deepspeed 深度加速	✗	medium 中等的	slow 慢
Tutel 图特尔	✗	medium 中等的	slow 慢
SiDA	✓	Extremely high 极其高的	Extremely high 极其高的

In this paper, we present an efficient inference system, i.e., SiDA (Sparsity-inspired Data-Aware), for serving large MoE models. By noticing that modern server CPUs support terabytes (TB) of main memory, dwarfing GPU capacity, SiDA dynamically leverages both main memory and GPU memory by exploiting sparsity in MoE models in a data-aware manner. We summarize the comparison in Table 1 between SiDA and baselines. Specifically, SiDA contains two threads that run in parallel, an inference thread and a hash-building thread. The hash-building thread exploits the sparsity of expert activation in a data-aware manner, whose core is a network-based hash function. Specifically, the hash function is an offline trained predictor that predicts the experts to be activated. In this work, we employ a LSTM (Hochreiter & Schmidhuber, 1997) with sparse attention and a truncated knowledge distillation to boost the performance of the hash function. The inference thread offloads inactivated experts predicted by the hash-building thread to maximize effective GPU memory utilization. Besides, SiDA also brings significant speedup during inference.
在本文中，我们提出了一个高效的推理系统，即SiDA（受稀疏性启发的数据感知），用于服务大型MoE模型。通过注意到现代服务器CPU支持TB级别的主内存，远超GPU容量，SiDA通过以数据感知的方式利用MoE模型中的稀疏性，动态地利用主内存和GPU内存。我们在表1中总结了SiDA与基线之间的比较。具体来说，SiDA包含两个并行运行的线程，一个是推理线程，另一个是哈希构建线程。哈希构建线程以数据感知的方式利用专家激活的稀疏性，其核心是一个基于网络的哈希函数。具体而言，哈希函数是一个离线训练的预测器，用于预测将要被激活的专家。在这项工作中，我们采用了一个具有稀疏注意力和截断知识蒸馏的LSTM（Hochreiter & Schmidhuber, 1997）来提升哈希函数的性能。推理线程将哈希构建线程预测的未激活专家卸载，以最大化有效GPU内存利用率。此外，SiDA在推理过程中也带来了显著的加速。

Our contributions are summarized as follows:
我们的贡献总结如下：

•

To the best of our knowledge, SiDA is the first sparsity-inspired data-aware system serving for efficient and scalable inference on large MoE models.

据我们所知，SiDA是首个受稀疏性启发的数据感知系统，用于在大型MoE模型上进行高效且可扩展的推理。
•

We propose an offline training strategy to build a data-aware hash function deployed in SiDA that replaces the router function in MoE layers. Our design boosts the throughput of MoE models up to $3.93\times$ and reduces the latency down to $25\%$ .

我们提出了一种离线训练策略，用于构建在SiDA中部署的数据感知哈希函数，该函数替代了MoE层中的路由函数。我们的设计将MoE模型的吞吐量提高到 $3.93\times$ ，并将延迟降低到 $25\%$ 。
•

Our offloading scheme achieves up to $80\%$ GPU memory saving with only less than 1% performance drop. Our hash function can achieve up to $99\%$ prediction accuracy on expert activation.

• 我们的卸载方案实现了高达 $80\%$ 的GPU内存节省，仅损失不到1%的性能。我们的哈希函数在专家激活上能达到高达 $99\%$ 的预测准确率。

The paper is organized in the following manner: In Section 2, we introduce the background and motivation. Section 3 is devoted to the framework of SiDA. In Section 4, we present our experimental results. Sections 5, 6 and 7 are devoted to related works, discussions, and conclusions, respectively.
本文的结构安排如下：第二部分介绍背景和动机。第三部分专门讨论SiDA的框架。第四部分展示我们的实验结果。第五、六、七部分分别专门讨论相关工作、讨论和结论。

2 Background and Motivation
背景与动机

We introduce the background and motivation for SiDA in this section. For notation, we use $a,{\bm{a}},{\mathbf{a}},{\bm{A}},{\mathbb{A}}$ to denote a scalar, vector, random vector variable, matrix, and set, respectively. We use $[K]$ to denote $\{1,2,...,K\}$ .
在本节中，我们将介绍SiDA的背景和动机。对于符号表示，我们分别使用 $a,{\bm{a}},{\mathbf{a}},{\bm{A}},{\mathbb{A}}$ 来表示标量、向量、随机向量变量、矩阵和集合。我们使用 $[K]$ 来表示 $\{1,2,...,K\}$ 。

2.1 Mixture of Experts 2.1 专家混合模型

Since the first proposal of Mixture-of-Experts (MoE) Jacobs et al. (1991); Jordan & Jacobs (1994), different MoE models have been proposed based on various experts models, for example, hidden Markov models (Jordan et al., 1996), Gaussian Process (Tresp, 2000), and support vector machine (Collobert et al., 2001). With the rise of deep learning, Eigen et al. propose the use of several sets of routers and experts to build a stacked model, namely Deep MoE Eigen et al. (2013).
自从杰克布斯等人（1991年）以及乔丹与杰克布斯（1994年）首次提出专家混合模型（MoE）以来，基于各种专家模型，已经提出了不同的MoE模型，例如，隐马尔可夫模型（乔丹等人，1996年），高斯过程（特雷斯普，2000年）和支持向量机（科洛贝特等人，2001年）。随着深度学习的兴起，艾根等人提出使用多组路由器和专家来构建一个堆叠模型，即深度MoE（艾根等人，2013年）。

A MoE layer consists of a router function, denoted as $h(\cdot;{\bm{W}}_{r})$ , followed by $K$ experts in parallel, denoted as $\{f_{i}(\cdot;{\bm{\theta}}_{i})\}_{i=1}^{K}$ . Usually, the router function is set as a linear function, i.e., $h({\mathbf{x}};{\bm{W}}_{r})={{\bm{W}}_{r}}^{\top}{\mathbf{x}}$ where ${\bm{W}}_{r}\in{\mathbb{R}}^{d\times K}$ for input ${\mathbf{x}}\in{\mathbb{R}}^{d}$ , and experts are multi-layer perceptrons (MLPs) with a non-linear activation function (Chen et al., 2022; Fedus et al., 2022; Shazeer et al., 2017). The output of a MoE layer takes the form:
MoE层由一个路由函数 $h(\cdot;{\bm{W}}_{r})$ 组成，后面跟着并行的 $K$ 个专家 $\{f_{i}(\cdot;{\bm{\theta}}_{i})\}_{i=1}^{K}$ 。通常，路由函数被设置为线性函数，即 $h({\mathbf{x}};{\bm{W}}_{r})={{\bm{W}}_{r}}^{\top}{\mathbf{x}}$ ，其中 ${\bm{W}}_{r}\in{\mathbb{R}}^{d\times K}$ 为输入 ${\mathbf{x}}\in{\mathbb{R}}^{d}$ ，而专家是带有非线性激活函数的多层感知机（MLPs）（Chen等，2022；Fedus等，2022；Shazeer等，2017）。MoE层的输出形式为：

M({\mathbf{x}};{\bm{W}}_{r},{\bm{\theta}}_{1},...,{\bm{\theta}}_{K})=\sum_{i\in{\mathbb{I}}}\alpha_{i}({\mathbf{x}})f_{i}({\mathbf{x}};{\bm{\theta}}_{i}),

(1)

where ${\mathbb{I}}$ contains the selected indices of experts and the scaling factor $\alpha_{i}$ is defined as
其中 ${\mathbb{I}}$ 包含了专家的选定索引，而缩放因子 $\alpha_{i}$ 被定义为

\alpha_{i}({\mathbf{x}})=\frac{\exp\{{\bm{W}}_{r}[:,i]^{\top}{\mathbf{x}}\}}{\sum_{j=1}^{K}\exp\{{\bm{W}}_{r}[:,j]^{\top}{\mathbf{x}}\}}.

Different selection mechanism of ${\mathbb{I}}$ leads to different models. The soft-routing model (Jordan & Jacobs, 1994) selects all experts, i.e., ${\mathbb{I}}=[K]$ , which leads to high computational overheads. The switch-routing model (Fedus et al., 2022) selects the top- $1$ expert, i.e., ${\mathbb{I}}=\operatorname*{arg\,max}_{i\in[K]}\alpha_{i}(\cdot)$ , introducing little extra computational overhead.
不同的选择机制会导致不同的模型。软路由模型（Jordan & Jacobs, 1994）选择所有专家，即 ${\mathbb{I}}=[K]$ ，这导致了高计算开销。开关路由模型（Fedus等，2022）选择排名前 $1$ 的专家，即 ${\mathbb{I}}=\operatorname*{arg\,max}_{i\in[K]}\alpha_{i}(\cdot)$ ，引入了很少的额外计算开销。

2.2 Low Effective Utilization of GPU Memory
2.2 GPU内存的低效利用

Encouraged by the advantage of MoE-based large models that drastically increasing the number of parameters leads to little computational overhead, many large-scale architectures have been proposed such as the Sparsely-Gated MoE (Shazeer et al., 2017), Gshard (Lepikhin et al., 2020), and Switch Transformers (Fedus et al., 2022). Specifically, the Sparsely-Gated MoE proposes a trainable router function to determine the expert to be activated for each sample, which makes it possible to build very large MoE-based models as it improves the computational efficiency by a large margin compared to the soft-routing selecting all experts. The Sparsely-Gated MoE scales LSTM models to 137 billion parameters achieving outstanding performance. Switch Transformers, the most widely used transformer-based large MoE, converts T5 models (Raffel et al., 2020) to their MoE versions. All Switch Transformers outperform their foundation dense model with the same FLOPs.
受到基于MoE的大型模型的优势鼓舞，即大幅增加参数数量导致的计算开销很小，许多大规模架构被提出，如稀疏门控MoE（Shazeer等，2017年）、Gshard（Lepikhin等，2020年）和Switch Transformers（Fedus等，2022年）。具体来说，稀疏门控MoE提出了一种可训练的路由函数，用于确定每个样本要激活的专家，这使得构建非常大的基于MoE的模型成为可能，因为与选择所有专家的软路由相比，它大幅提高了计算效率。稀疏门控MoE将LSTM模型扩展到1370亿参数，取得了卓越的性能。Switch Transformers是最广泛使用的基于变换器的大型MoE，将T5模型（Raffel等，2020年）转换为其MoE版本。所有Switch Transformers在相同的FLOPs下都超过了它们的基础密集模型。

In our study, we found that large MoE models do not efficiently utilize GPUs. As shown in Eq. 1, we denote an expert as activated if $i\in{\mathbb{I}}$ . Inactivated experts remain idle in the forward pass, leading to low effective GPU memory utilization. Effective GPU memory refers to the memory storing parameters that are effective for the forwarding of the model. The inactivated experts occupy a large amount of GPU memory while remaining idle, leading to low effective GPU memory utilization. To quantitatively analyze the GPU memory utilization, we provide a summary of Switch Transformers on model size and MoE layer size in Table 2. It is shown that for all Switch Transformers, especially the large ones, MoE layers occupy a large portion of GPU memory. Meanwhile, most of the parameters of the MoE layers are idle during one forward pass. To ascertain the amount of ineffective GPU memory, we feed samples from the SST2 dataset to Switch Transformers and record the corresponding effective memory utilization rates. The results are depicted in Fig. 2. For large Switch Transformers such as Switch-base-128 and Switch-base-256, the ineffective GPU memory for short sentences is around 24GB and 50GB, respectively. Even for the longest sentences with 80 tokens, the ineffective GPU memory is around 20GB and 46GB, respectively. Our method, SiDA, can save all ineffective GPU memory, outperforming baselines by a large margin. Further results on GPU memory reduction across datasets can be found in Section 4.
在我们的研究中，我们发现大型MoE模型并不能有效利用GPU。如方程1所示，我们将一个专家定义为激活状态，如果 $i\in{\mathbb{I}}$ 。未激活的专家在前向传播中保持空闲，导致有效GPU内存利用率低。有效GPU内存指的是存储对模型前向传播有效的参数的内存。未激活的专家占用了大量GPU内存同时保持空闲，导致有效GPU内存利用率低。为了定量分析GPU内存的利用率，我们在表2中提供了Switch Transformers模型大小和MoE层大小的总结。结果显示，对于所有Switch Transformers，特别是大型的，MoE层占用了大量GPU内存。同时，MoE层的大多数参数在一次前向传播中处于空闲状态。为了确定无效GPU内存的数量，我们向Switch Transformers输入SST2数据集的样本，并记录相应的有效内存利用率。结果如图2所示。对于大型Switch Transformers，如Switch-base-128和Switch-base-256，短句子的无效GPU内存分别约为24GB和50GB。即使对于最长的含有80个词汇的句子，无效GPU内存也分别约为20GB和46GB。我们的方法SiDA可以节省所有无效GPU内存，大幅度超过基准线。关于跨数据集GPU内存减少的进一步结果可以在第4节找到。

Table 2: Memory Occupation of Switch Transformers. This table highlights the allocation of parameters in gigabytes (GB) for different models. MoE parameters dominate memory usage, especially in larger models. In contrast, mainstream GPUs peak at 48GB, with many at 24GB, while mobile GPUs range from 4GB to 12GB.
表2：交换式变换器的内存占用。此表格突出了不同模型中以千兆字节（GB）为单位的参数分配。MoE参数在内存使用中占主导地位，尤其是在较大的模型中。相比之下，主流GPU的峰值为48GB，许多为24GB，而移动GPU的范围从4GB到12GB。

	Model (GB) 型号（英国）	MoE (GB) 教育部（英国）	Percentage ( $\%$ ) 百分比（ $\%$ ）
Switch-base-8 切换到八进制基数	2.298	1.7932	78.03
Switch-base-64 切换至Base-64	14.112	13.608	96.42
Switch-base-128 切换基数为128	27.614	27.11	98.17
Switch-base-256 切换基数为256	54.62	54.114	99.07

2.3 High Expert Selection Overhead
2.3高专家选择成本

Apart from the low effective GPU memory utilization, we also observed a high overhead on expert selection in the feedforward pass of MoE. Specifically, in all baseline implementations of MoE models, a non-negligible amount of time is consumed in the process of selecting the most suitable experts. We conduct experiments on SST2 with multiple MoE models and provide the profiling results of averaged inference time and expert selection overhead in Fig. 3. It is shown that the expert selection process consumes nearly $75\%$ of the total inference time for Switch-base-256, which is a bottleneck of the inference latency. Notably, the overhead associated with expert selection escalates with the scale of the model, further emphasizing the imperative of addressing the bottleneck in inference efficiency.
除了低有效GPU内存利用率外，我们还观察到MoE前馈过程中专家选择的高开销。具体来说，在所有MoE模型的基线实现中，选择最合适的专家的过程消耗了不可忽视的时间。我们在SST2上使用多个MoE模型进行实验，并在图3中提供了平均推理时间和专家选择开销的分析结果。结果显示，对于Switch-base-256，专家选择过程消耗了总推理时间的近一半，这是推理延迟的瓶颈。值得注意的是，随着模型规模的扩大，与专家选择相关的开销增加，进一步强调了解决推理效率瓶颈的重要性。

2.4 Sparse Activation of Experts in Large MoE Models
大型MoE模型中专家的稀疏激活

The sparse selection of experts is one of the critical observations that motivate SiDA. Our observation verifies that only a small portion of experts will be activated during inference.
稀疏专家选择是激励SiDA的关键观察之一。我们的观察验证了在推理过程中只有一小部分专家会被激活。

For each token, the router function will select either top- $K$ (Shazeer et al., 2017) or top- $1$ (Fedus et al., 2022) experts inducing a token level expert activation sparsity. However, the sparsity on sentences, typically with 512 or 768 tokens, remains elusive. Not to mention in the training stage, an expert loading balance loss must be applied, which forces the router to assign an almost equal number of tokens to each expert. Otherwise, router’s outputs will collapse to few experts leading to capacity degradation (Chen et al., 2022).
对于每个令牌，路由函数将选择顶尖的 $K$ （Shazeer等，2017年）或顶尖的 $1$ （Fedus等，2022年）专家，引发令牌级别的专家激活稀疏性。然而，对于通常包含512或768个令牌的句子，其稀疏性仍然难以捉摸。更不用说在训练阶段，必须应用一个专家负载平衡损失，这迫使路由器将几乎相同数量的令牌分配给每个专家。否则，路由器的输出将崩溃为少数几个专家，导致容量降级（Chen等，2022年）。

We test Switch Transformers with different number of experts on the SST2 dataset and report the sentence level sparsity in Fig. 4. Our observation verifies that the sparse activation pattern still exists at the sentence level for large MoE models such as Switch-base-128 and Switch-base-256. As shown in the figure, down to less than $40\%$ experts and $20\%$ experts are activated for Switch-base-128 and Switch-base-256, respectively. Even for the longest sentences with around 80 tokens, the ratio of idle experts is still higher than $70\%$ for Switch-base-128 and $80\%$ for Switch-base-256.
我们在SST2数据集上测试了具有不同专家数量的Switch Transformers，并在图4中报告了句子级别的稀疏性。我们的观察验证了，对于像Switch-base-128和Switch-base-256这样的大型MoE模型，稀疏激活模式在句子级别仍然存在。如图所示，对于Switch-base-128和Switch-base-256，激活的专家数量分别减少到少于 $40\%$ 个和 $20\%$ 个。即使对于大约有80个词符的最长句子，Switch-base-128和Switch-base-256的空闲专家比率仍然高于 $70\%$ 和 $80\%$ 。

3 SiDA

3.1 Overview: workflow 3.1概述：工作流程

We introduce a novel framework, Sparsity-inspired Data-Aware (SiDA), for efficient inference of large MoE models, whose overview is shown in Fig. 5. SiDA contains two parallel threads that run simultaneously, namely the Inference thread and the Hash-building thread. Consider a sequence of incoming batches, batch ${\mathbb{X}}_{j}$ is fed to the hash-building thread to build the hash table ${\mathbb{H}}_{j}$ storing expert activation patterns for batch ${\mathbb{X}}_{j}$ , which will be pushed to the hash table queue. At the same time, the inference thread is handling the precedent batch ${\mathbb{X}}_{i}$ and operating dynamical offloading on MoE layers based on the hash table ${\mathbb{H}}_{i}$ .
我们提出了一个新颖的框架，稀疏启发的数据感知（SiDA），用于高效推理大型MoE模型，其概览如图5所示。SiDA包含两个并行线程，同时运行，即推理线程和哈希构建线程。考虑一系列即将到来的批次，批次 ${\mathbb{X}}_{j}$ 被送入哈希构建线程以构建哈希表 ${\mathbb{H}}_{j}$ ，用于存储批次 ${\mathbb{X}}_{j}$ 的专家激活模式，该模式将被推送到哈希表队列中。同时，推理线程正在处理前一个批次 ${\mathbb{X}}_{i}$ ，并根据哈希表 ${\mathbb{H}}_{i}$ 对MoE层进行动态卸载。

Hash-building thread. The Hash-building thread consists of two components, a hash function and a hash table queue. For each incoming batch ( -a), the hash function will determine experts to be activated for each token at each layer and the corresponding scaling factor $\alpha$ ( -b). The predictions are stored in the hash table ${\mathbb{H}}_{j}$ for the batch ${\mathbb{X}}_{j}$ and pushed to the hash table queue ( -c). The hash function can be a predefined hash function if the MoE model is trained with the Hash layer (Roller et al., 2021). More commonly, for the MoE model using trained router functions, such as Switch Transformers, the hash function will be offline trained. We propose hash function training techniques dedicated to modern MoE models, which will be introduced in later sections.
构建哈希线程。构建哈希线程由两部分组成：一个哈希函数和一个哈希表队列。对于每个传入的批次（-a），哈希函数将确定每个层次上每个令牌要激活的专家及其相应的缩放因子 $\alpha$ （-b）。预测结果存储在批次 ${\mathbb{X}}_{j}$ 的哈希表 ${\mathbb{H}}_{j}$ 中，并推送到哈希表队列（-c）。如果MoE模型是与哈希层一起训练的，哈希函数可以是预定义的哈希函数（Roller等人，2021）。更常见的是，对于使用训练过的路由函数的MoE模型，如Switch Transformers，哈希函数将进行离线训练。我们提出了专门针对现代MoE模型的哈希函数训练技术，这将在后面的章节中介绍。

Inference thread. The inference thread performs two tasks, i.e., dynamically load activated experts and offload inactivated experts according to the hash table built by the hash-building thread, and use the SiDA MoE layers to inference input batches. Specifically, for each incoming batch ${\mathbb{X}}_{i}$ ( -a), the inference thread will first pop the hash table ${\mathbb{H}}_{i}$ from the hash table queue ( -b) and remain idle if ${\mathbb{H}}_{i}$ is not found. Notably, in practice, the inference thread takes a longer time to inference a batch than the hash-building thread to build a hash table for a batch. As a result, the inference thread never idles except at the very beginning. With the popped hash table ${\mathbb{H}}_{i}$ , the next step is to dynamically load and offload experts. Based on GPU memory budgets and the expert activation pattern of the current batch, the inference thread will load activated experts to GPU and offload inactivated experts to RAM ( -c). A first-in-first-out (FIFO) scheme is applied on experts if no memory budgets remain. The dynamical loading task of a MoE layer will be done right after the finish of inference on the previous batch following the pipeline parallelism mechanism Huang et al. (2019). Note that, in our system, all routers are offloaded to the main memory and do not participate in the forward pass. Lastly, the incoming batch ${\mathbb{X}}_{i}$ will be forwarded using the SiDA MoE layers specific to ${\mathbb{X}}_{i}$ ( -d).
推理线程。推理线程执行两项任务，即根据哈希构建线程构建的哈希表动态加载激活的专家并卸载未激活的专家，并使用SiDA MoE层对输入批次进行推理。具体来说，对于每个传入的批次 ${\mathbb{X}}_{i}$ （-a），推理线程首先从哈希表队列中弹出哈希表 ${\mathbb{H}}_{i}$ （-b），如果未找到 ${\mathbb{H}}_{i}$ 则保持空闲。值得注意的是，在实践中，推理线程对一个批次进行推理的时间比哈希构建线程为一个批次构建哈希表的时间要长。因此，除了最开始之外，推理线程从不空闲。有了弹出的哈希表 ${\mathbb{H}}_{i}$ ，下一步是动态加载和卸载专家。根据GPU内存预算和当前批次的专家激活模式，推理线程将加载激活的专家到GPU并将未激活的专家卸载到RAM（-c）。如果没有剩余的内存预算，则对专家采用先进先出（FIFO）方案。MoE层的动态加载任务将在上一个批次推理完成后立即进行，遵循Huang等人（2019）提出的流水线并行机制。注意，在我们的系统中，所有路由器都被卸载到主内存中，不参与前向传递。最后，传入的批次 ${\mathbb{X}}_{i}$ 将使用针对 ${\mathbb{X}}_{i}$ （-d）的SiDA MoE层进行转发。

3.2 Design challenges 3.2 设计挑战

In the design of SiDA, we spot three key challenges.
在SiDA的设计中，我们发现了三个关键挑战。

Challenge 1: How to efficiently obtain experts that are to be offloaded beforehand? Given the observation that experts are activated sparsely, it is trivial to save GPU memory by offloading inactivated experts to RAM. However, this naiv̈e implementation sacrifices the latency since expert activation patterns are inaccessible without the output of the router functions. It incurs large overheads to move experts between CPU and GPU after each router function as it breaks the forwarding pipeline. We propose to use an offline-trained hash function to acquire the expert activation pattern before inference starts for each batch. Furthermore, we design the hash function to run independently of model inference and build a hash-building thread running in parallel with the inference thread to achieve the efficiency requirements. By employing the hash-building thread, SiDA achieves outstanding latency compared to baselines since the expert selection, dynamical offloading, and inference all run in parallel.
挑战1：如何高效地提前获取将要卸载的专家？鉴于专家被激活的情况较为稀疏，通过将未激活的专家卸载到RAM中可以节省GPU内存，这一点并不复杂。然而，这种简单的实现牺牲了延迟，因为在没有路由函数输出的情况下，无法访问专家激活模式。在每个路由函数之后在CPU和GPU之间移动专家会打破前向传播管道，从而产生大量开销。我们提出使用一个离线训练的哈希函数，在每个批次的推理开始前获取专家激活模式。此外，我们设计哈希函数能够独立于模型推理运行，并构建一个与推理线程并行运行的哈希构建线程，以满足效率要求。通过使用哈希构建线程，SiDA在延迟方面与基准相比取得了卓越的表现，因为专家选择、动态卸载和推理都并行运行。

Challenge 2: How to leverage sparse cross-embedding dependency on experts activation to design a lightweight offline trained hash function? Considering the inference efficiency and the GPU memory consumption of the system, the hash function must be a lightweight predictor. However, simple predictors can hardly capture the contextual information of the sequence and can be easily distracted. Hence, it becomes crucial to enforce the predictor to focus on critical information. We empirically verify that there exists a sparse cross-embedding dependency on expert activation, i.e., a limited number of embeddings in the sequence jointly affect expert activation. This sparse cross-embedding dependency sheds light on the success of lightweight predictors. However, it is impractical and inefficient to rule out all possible outcomes to find the cross-embedding dependency for every token. In response to the challenge, we propose a sparse attention mechanism on LSTM that enforces the predictor to focus on the most important embedding automatically.
挑战2：如何利用专家激活的稀疏交叉嵌入依赖来设计一个轻量级的离线训练哈希函数？考虑到系统的推理效率和GPU内存消耗，哈希函数必须是一个轻量级的预测器。然而，简单的预测器很难捕捉序列的上下文信息，并且容易被分散注意力。因此，强制预测器专注于关键信息变得至关重要。我们通过实验证明，存在专家激活的稀疏交叉嵌入依赖，即序列中有限数量的嵌入共同影响专家激活。这种稀疏交叉嵌入依赖为轻量级预测器的成功提供了启示。然而，排除所有可能的结果以找到每个令牌的交叉嵌入依赖是不切实际且低效的。为了应对这一挑战，我们提出了一种在LSTM上的稀疏注意力机制，自动强制预测器专注于最重要的嵌入。

Challenge 3: How to improve the expert selection accuracy and approximate the scaling factor simultaneously? The hash function needs to determine not only the expert activation but also the scaling factor $\alpha$ in Eq. 1. As the scaling factor is derived from the SoftMax logits output from the model, it is natural to apply knowledge distillation (KD), setting the router functions as teacher models and the hash function as the student model. However, it is impossible for the hash function to approximate the scaling factor distribution over all experts by KD due to the limited capacity of the hash function. To solve this challenge, we propose to use a truncated knowledge distillation (TKD), where the KD loss is computed over the top- $T$ experts. However, the TKD cannot guarantee adequate prediction accuracy. We further add a cross-entropy loss to boost the prediction accuracy.
挑战3：如何同时提高专家选择的准确性和估算缩放因子？哈希函数需要确定的不仅是专家激活，还有方程1中的缩放因子。由于缩放因子是从模型输出的SoftMax logits中得出的，自然而然地应用知识蒸馏（KD），将路由函数设为教师模型，哈希函数设为学生模型。然而，由于哈希函数的容量有限，它不可能通过KD来近似所有专家上的缩放因子分布。为了解决这一挑战，我们提出使用截断知识蒸馏（TKD），其中KD损失是在前排专家上计算的。然而，TKD不能保证足够的预测准确性。我们进一步添加了交叉熵损失以提高预测准确性。

We introduce how SiDA deals with each challenge in detail in the following sections.
在接下来的章节中，我们将详细介绍SiDA如何应对每一个挑战。

3.3 Data-Aware and Efficient Expert Activation Prediction
数据感知与高效的专家激活预测

SiDA proposes a data-aware solution to efficiently obtain the experts to be offloaded beforehand. Specifically, we propose to use a trained hash function that takes the sequence of embedding as input and predicts all the activated experts for each token in the sequence. SiDA, augmented by the data-aware expert activation prediction, enjoys two advantages while compromising little loss of model performance down to less than $1\%$ . Firstly, the system can acquire the activation pattern of each sample beforehand and operate dynamically loading and offloading according to the GPU memory budget without interrupting the inference process. Secondly, since the hash function determines the expert activation across all the MoE layers for a sample independently of the inference, the system can build the hash function in a hash-building thread running in parallel with the inference thread. By doing this, we can remove the overhead caused by expert selection from the inference time, which boosts the throughput up to $3.93\times$ .
SiDA 提出了一种数据感知的解决方案，以高效地预先获取需要卸载的专家。具体来说，我们提出使用一个训练好的哈希函数，该函数以嵌入序列为输入，并预测序列中每个令牌的所有激活专家。通过数据感知的专家激活预测增强的 SiDA，在几乎不损失模型性能的情况下（降低到小于 0），享有两大优势。首先，系统可以预先获取每个样本的激活模式，并根据 GPU 内存预算动态地加载和卸载，而不中断推理过程。其次，由于哈希函数独立于推理过程确定样本在所有 MoE 层中的专家激活，系统可以在与推理线程并行运行的哈希构建线程中构建哈希函数。通过这样做，我们可以去除推理时间中由专家选择引起的开销，从而将吞吐量提高到 <1>。

Previous works have also been proposed to improve the router function of MoE, such as the Hash layer (Roller et al., 2021) and the Base layer (Lewis et al., 2021). SiDA is orthogonal to these router functions as they can be accommodated in the hash-building thread. For MoE models with trained routers, we propose to train an LSTM as the hash function with the sparse attention boosted with our truncated knowledge distillation, detailed in the following sections.
之前的研究也提出了改进MoE路由器功能的方法，例如哈希层（Roller等，2021年）和基础层（Lewis等，2021年）。SiDA与这些路由器功能是正交的，因为它们可以被容纳在构建哈希的线程中。对于具有训练过的路由器的MoE模型，我们提议使用LSTM作为哈希函数，并通过我们的截断知识蒸馏增强稀疏注意力，详细内容将在以下部分中说明。

3.4 LSTM with Sparse Attention
3.4 带有稀疏注意力的LSTM

3.4.1 Sparse cross-embedding dependency on expert activation
3.4.1 专家激活的稀疏交叉嵌入依赖

In the MoE layer, each word embedding will be fed to the router function to decide which expert to activate for inference of the token. However, the expert activation does not solely depend on the embedding corresponding to the token due to the self-attention layer before each MoE layer (shown in Fig. 1), where the word embedding is mixed together. Because of the positional embedding, the position of tokens will also affect the expert activation. While the process by which embeddings collectively influence expert activation is complex, we identify a sparse cross-embedding dependency on expert activation, indicating that only a limited number of other tokens and positions are critical to the expert activation for the current token.
在MoE层中，每个词嵌入都将被送入路由函数以决定激活哪个专家进行令牌的推断。然而，专家激活并不完全依赖于与令牌相对应的嵌入，因为在每个MoE层之前的自注意力层（如图1所示），词嵌入被混合在一起。由于位置嵌入的存在，令牌的位置也会影响专家激活。尽管嵌入共同影响专家激活的过程复杂，我们识别出一个稀疏的跨嵌入对专家激活的依赖性，表明只有有限数量的其他令牌和位置对当前令牌的专家激活至关重要。

Suppose a sequence of length $L$ , and let $c_{i}$ denote the number of critical tokens for the token at position $i$ . We define the critical tokens as tokens in the sequence other than the selected $i$ -th token, whose changes lead to a change in expert activation of the $i$ -th token. In order to empirically verify that $c_{i}$ is a small number for all $i$ , we consider finding a combinatorial equation involving $c_{i}$ and quantities we can measure. Consider selecting a set of tokens from the sequence excluding the $i$ -th token, the probability that the set contains a critical token is formulated as below:
假设一个长度为 $L$ 的序列，并且让 $c_{i}$ 表示位于 $i$ 位置的标记的关键标记数量。我们将关键标记定义为序列中除了选定的第 $i$ 个标记之外的标记，其变化会导致第 $i$ 个标记的专家激活发生变化。为了实证验证 $c_{i}$ 对所有 $i$ 来说是一个小数，我们考虑找到一个涉及 $c_{i}$ 和我们可以测量的量的组合方程。考虑从序列中选择一组标记，排除第 $i$ 个标记，该组包含关键标记的概率如下所述：

\mathbb{E}[\hat{p}_{i}]=1-\frac{\binom{L-1-c_{i}}{\lfloor pL\rfloor}}{\binom{L-1}{\lfloor pL\rfloor}}.

(2)

where $\lfloor pL\rfloor$ denotes the size of the set and $p$ denotes the portion of selection over the sequence. Note that the probability that the selected set of tokens contains a critical token is equal to the probability that the $i$ -th token’s expert activation changes, denoted as $\hat{p}_{i}$ , if we change all selected tokens in the set. We denote the process of changing the tokens in a sequence as ‘corruption.’ Given Eq. 2, $p$ and $\hat{p}$ are quantities that we can empirically acquire, that is, by randomly selecting a portion $p$ of tokens, we can empirically measure the probability that the $i$ -th token’s expert activation changes. We show in Fig. 6 the relation between $c$ and $\hat{p}$ under different $p$ .
其中 $\lfloor pL\rfloor$ 表示集合的大小， $p$ 表示在序列上选择的部分。注意，选中的令牌集包含关键令牌的概率等于如果我们更改集合中所有选中的令牌，第 $i$ 个令牌的专家激活改变的概率，记为 $\hat{p}_{i}$ 。我们将更改序列中的令牌的过程称为“损坏”。根据方程2， $p$ 和 $\hat{p}$ 是我们可以通过经验获得的量，即通过随机选择一部分 $p$ 的令牌，我们可以经验性地测量第 $i$ 个令牌的专家激活改变的概率。我们在图6中展示了在不同 $p$ 下 $c$ 与 $\hat{p}$ 之间的关系。

Empirically, to study the token dependency of the token at position $i$ , the corruption is executed by randomly modifying a fraction $p$ of chosen tokens from $[L]-\{i\}$ to values distinct from their original and the $i$ -th token. To examine the position dependency for the $i$ -th token, the corruption also involves randomly choosing a fraction $p$ of positions from $[L]-\{i\}$ and swapping the token positions. We use the English division in the dataset C4 (Raffel et al., 2020) to measure the probability that the $i$ -th token’s expert activation changes under different $p$ , depicted in Fig. 7. We set the length $L=512$ and truncate or pad sentences which are not of length 512. We randomly test over 100 word embedding positions (i.e., 100 $i$ ’s) on Switch-base-128 and plot all of them in Fig. 7 with the average trend shown. Fig. 7(a) and Fig. 7(b) show the cross-embedding dependency of the token and position, respectively. Only a large portion of corruption leads to high chances of expert activation change, which demonstrates that most of the other tokens do not have an impact on the expert activation of the current token.
从经验上来看，为了研究位于位置 $i$ 的令牌的依赖性，通过随机修改选定令牌中的一部分 $p$ ，将其值改为与原始值及第 $i$ 个令牌的值不同的值来执行破坏操作。为了检查第 $i$ 个令牌的位置依赖性，破坏操作还包括从 $[L]-\{i\}$ 中随机选择一部分位置并交换令牌位置。我们使用数据集C4（Raffel等人，2020）中的英文部分来测量在不同 $p$ 下，第 $i$ 个令牌的专家激活变化的概率，如图7所示。我们设置长度 $L=512$ ，并截断或填充不是512长度的句子。我们在Switch-base-128上随机测试了100个词嵌入位置（即100个 $i$ ），并将它们全部绘制在图7中，显示平均趋势。图7(a)和图7(b)分别展示了令牌和位置的跨嵌入依赖性。只有大量的破坏才会导致专家激活变化的高概率，这表明大多数其他令牌对当前令牌的专家激活没有影响。

By combining Fig. 6 and Fig. 7, we can read the best approximation of $c_{i}$ based on different pairs of ( $p$ , $\hat{p}$ ) in Fig. 7, where we find that the best approximation of $\hat{c}$ ranges from $1$ to $4$ demonstrating the sparse cross-embedding dependency.
通过结合图6和图7，我们可以在图7中根据不同的（ $p$ ， $\hat{p}$ ）对读取 $c_{i}$ 的最佳近似值，其中我们发现 $\hat{c}$ 的最佳近似值范围从 $1$ 到 $4$ ，展示了稀疏的交叉嵌入依赖性。

3.4.2 Design of the hash function
3.4.2哈希函数的设计

The design of the hash function must satisfy the following conditions: (1) be able to capture the sequential information, (2) be lightweight to preserve efficiency, and (3) be able to extract and focus on the critical embedding automatically. We adopt a 2-layer LSTM followed by a fully connected layer to align the first two conditions. Further, we add one fully connected layer to compress the embedding dimension. To achieve the third condition, we adopt the sparse attention mechanism with the SparseMax activation (Martins & Astudillo, 2016).
哈希函数的设计必须满足以下条件：（1）能够捕获序列信息，（2）保持轻量以保持效率，（3）能够自动提取并关注关键嵌入。我们采用了一个2层的LSTM，后接一个全连接层来满足前两个条件。进一步地，我们添加了一个全连接层来压缩嵌入维度。为了实现第三个条件，我们采用了带有SparseMax激活函数的稀疏注意力机制（Martins & Astudillo, 2016）。

Attention mechanism. The attention mechanism was first proposed in Bahdanau et al. (2015), which has been proven to be influential in the realm of deep learning. The attention mechanism was proposed to allow the decoder to focus on different parts, resolving the problem that the encoder encodes the entire sentence. Given a query ${\bm{q}}$ and a set of key-value pairs $({\bm{k}},{\bm{v}})$ , the attention mechanism computes a weighted sum of values based on the similarity of the query to the keys. Formally, the attention weights ${\bm{w}}$ and the output ${\bm{o}}$ are computed as ${\bm{o}}=\sum_{i}w_{i}{\bm{v}}_{i}$ with
注意力机制。注意力机制最初由Bahdanau等人在2015年提出，已被证明在深度学习领域具有重要影响。注意力机制的提出是为了允许解码器关注不同的部分，解决了编码器将整个句子编码的问题。给定一个查询 ${\bm{q}}$ 和一组键值对 $({\bm{k}},{\bm{v}})$ ，注意力机制根据查询与键的相似度计算值的加权和。形式上，注意力权重 ${\bm{w}}$ 和输出 ${\bm{o}}$ 的计算如 ${\bm{o}}=\sum_{i}w_{i}{\bm{v}}_{i}$ 所示。

w_{i}=\frac{\exp(\text{score}({\bm{q}},{\bm{k}}_{i}))}{\sum_{j}\exp(\text{score}({\bm{q}},{\bm{k}}_{j}))},

where $\text{score}({\bm{q}},{\bm{k}})$ is a function that calculates the similarity between the query and a key. One common choice for score is the dot product of the query and key.
其中 $\text{score}({\bm{q}},{\bm{k}})$ 是一个函数，用于计算查询和键之间的相似度。得分的一个常见选择是查询和键的点积。

We append one attention layer right after the LSTM layer where the key, value, and query are all set as the output sequence from LSTM. Consequently, each embedding will be a weighted sum of the sequence with weights proportional to the similarity between two vectors. The attention mechanism allows the predictor to pay different attention to different embeddings. However, the naive attention mechanism cannot impose a sparse focus. We further apply the SparseMax activation over ${\bm{w}}$ .
我们在LSTM层之后紧接着添加了一个注意力层，其中键、值和查询都被设置为LSTM的输出序列。因此，每个嵌入都将是序列的加权和，权重与两个向量之间的相似度成正比。注意力机制允许预测器对不同的嵌入给予不同的关注。然而，朴素的注意力机制不能施加稀疏聚焦。我们进一步在此基础上应用了SparseMax激活函数。

SparseMax activation. In contrast to the SoftMax activation, which provides a dense distribution, that is, non-zero probabilities assigned to all classes or positions, the SparseMax provides a sparse distribution, where zero probability is assigned to many positions. We apply the SparseMax activation over the attention weights ${\bm{w}}$ to obtain a sparse attention mechanism. Given an input vector ${\bm{w}}\in\mathbb{R}^{L}$ , the SparseMax transformation is defined as:
SparseMax激活。与提供密集分布的SoftMax激活不同，即为所有类别或位置分配非零概率，SparseMax提供一个稀疏分布，其中许多位置被分配了零概率。我们在注意力权重上应用SparseMax激活，以获得一个稀疏的注意力机制。给定一个输入向量，SparseMax转换定义为：

\text{SparseMax}({\bm{w}})=\text{argmin}_{{\bm{u}}\in\Delta^{L-1}}\left\|{\bm{u}}-{\bm{w}}\right\|_{2}^{2},

where $\Delta^{L-1}$ denotes the $(L-1)$ -dimensional simplex, i.e.,
其中 $\Delta^{L-1}$ 表示 $(L-1)$ 维单纯形，即，

\Delta^{L-1}=\{{\bm{u}}\in\mathbb{R}^{L}|{\bm{u}}\geq 0,\sum_{i=1}^{L}u_{i}=1\}.

Although the expert selection is affected by other tokens in the sequence, the current token is always the most crucial on expert selection. Hence, we adopt the residual connection (He et al., 2016) to boost the performance right before the final fully connected layer.
尽管专家选择受序列中其他标记的影响，但当前标记始终是专家选择中最关键的。因此，我们采用残差连接（He等人，2016年）在最后的全连接层之前提升性能。

3.5 Truncated Knowledge Distillation
3.5截断式知识蒸馏

The hash function of SiDA is required to predict the expert to be activated and the corresponding scaling factor $\alpha$ . Knowledge distillation (KD) (Hinton et al., 2015), which aims to minimize the distance of logits between the teacher and student model, should be the best training strategy for our hash function. However, the capacity of our hash function, 2-layer LSTM, is far less capable than the MoE model. The predictor cannot fully capture the behavior of logits of the router functions in the MoE model. The naiv̈e usage of KD greatly harms the performance of the system.
SiDA的哈希函数需要预测要激活的专家和相应的缩放因子 $\alpha$ 。知识蒸馏（KD）（Hinton等人，2015年），旨在最小化教师模型和学生模型之间的logits距离，应该是我们哈希函数的最佳训练策略。然而，我们的哈希函数，2层LSTM的容量远远小于MoE模型。预测器无法完全捕捉MoE模型中路由函数的logits行为。简单地使用KD极大地损害了系统的性能。

We propose Truncated KD (TKD) to tackle the challenge. Different from the traditional KD, the truncated KD only considers positions with top- $T$ SoftMax logit, which helps the hash function focus more on predicting the scaling factor for experts with a higher chance of being activated. Notably, large $T$ can provide a smooth ground truth for the hash function, while small $T$ enforces the hash function to be more focused on fewer experts. Further, we add the cross entropy loss to ensure the prediction accuracy. The training objective is $\lambda\mathcal{L}_{\text{CE}}+\mathcal{L}_{\text{TKD}}(T)$ .
我们提出了截断知识蒸馏（TKD）来应对这一挑战。与传统的知识蒸馏不同，截断知识蒸馏仅考虑排名前 $T$ 的SoftMax逻辑值，这有助于哈希函数更加专注于预测有更高激活几率的专家的缩放因子。值得注意的是，较大的 $T$ 可以为哈希函数提供一个平滑的真实值，而较小的 $T$ 则迫使哈希函数更加专注于较少的专家。此外，我们增加了交叉熵损失以确保预测的准确性。训练目标是 $\lambda\mathcal{L}_{\text{CE}}+\mathcal{L}_{\text{TKD}}(T)$ 。

4 Experiment 实验4

We extensively evaluate SiDA on different datasets. Specifically, we first show the GPU memory reduction ratio of SiDA demonstrating a memory saving up to $80\%$ . We then report the throughput and latency of SiDA and baselines, where SiDA achieves up to $3.93\times$ improvements in terms of throughput with little performance degradation down to less than $1\%$ . Our hash function achieves a prediction accuracy of up to $99\%$ . Also, SiDA achieves the best efficiency under different GPU memory budgets.
我们对不同的数据集进行了广泛的SiDA评估。具体来说，我们首先展示了SiDA的GPU内存减少比例，显示出最多可节省 $80\%$ 的内存。然后，我们报告了SiDA及基准的吞吐量和延迟，其中SiDA在吞吐量方面的提升高达 $3.93\times$ ，性能下降幅度小于 $1\%$ 。我们的哈希函数实现了最高 $99\%$ 的预测准确率。此外，在不同的GPU内存预算下，SiDA实现了最佳效率。

Implementation. We implement the proposed SiDA framework atop the readily available Switch Transformer implementation in transformer Wolf et al. (2019), albeit not without substantial additional engineering effort. Enabling performant slice extraction poses challenges, as the MoE must maintain fine-grained associations between experts and hash table slices across layers and iterations. We optimize the parallel invocation of experts through meticulous inter-thread coordination, as naive parallelism introduces serious race conditions. The SiDA manager tackles intricate scheduling across the main training thread and the concurrent prediction thread, synchronizing via a shared queue that demands careful contention management. The main thread must then judiciously merge predictor outputs with the model state to orchestrate expert device placement, avoiding costly overheads like GPU-CPU data transfers.
实现。我们在现成的Switch Transformer实现的基础上实现了提出的SiDA框架，这是transformer Wolf等人（2019年）的工作，尽管这需要大量额外的工程努力。启用高性能的切片提取带来挑战，因为MoE必须在各层和迭代中保持专家与哈希表切片之间的细粒度关联。我们通过细致的线程间协调优化了专家的并行调用，因为天真的并行会引入严重的竞争条件。SiDA管理器处理主训练线程和并发预测线程之间复杂的调度，通过一个共享队列同步，这要求仔细管理竞争。主线程必须审慎地合并预测器输出与模型状态，以安排专家设备的放置，避免像GPU-CPU数据传输这样的昂贵开销。

Setup. We select three baselines namely, Standard, Deepspeed, and Tutel. The Standard baseline refers to the standard inference of the model. The Deepspeed refers to the Deepspeed implementation (Aminabadi et al., 2022) of the model, and the Tutel (Hwang et al., 2023) is designed for MoE models by enabling adaptive parallelism. We select three datasets from GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019). Specifically, we select SST2 and MRPC from GLUE for short sentences and mid-length sentences, and MultiRC from SuperGLUE for long sentences. We test most of the experiments on a server with an A-100 80GB GPU and 64 Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz CPUs. We investigate Switch-base-8, Switch-base-64, Swicth-base-128, and Switch-base-256 on efficiency, where the number indicates the number of experts in each MoE layer in the Switch Transformer. And we select Switch-base-8 and Switch-base-128 to fine-tune on selected datasets as the representatives on accuracy analysis, considering the representativeness and limited resources. Our hash function in the hash building thread is trained on the train set of the dataset with the true hash table and evaluated on the test set of the dataset.
设置。我们选择了三个基准，分别是标准、Deepspeed和Tutel。标准基准指的是模型的标准推理。Deepspeed指的是模型的Deepspeed实现（Aminabadi等人，2022年），而Tutel（Hwang等人，2023年）旨在通过启用自适应并行性为MoE模型设计。我们从GLUE（Wang等人，2018年）和SuperGLUE（Wang等人，2019年）中选择了三个数据集。具体来说，我们从GLUE中选择了SST2和MRPC，用于短句和中等长度的句子，以及从SuperGLUE中选择了MultiRC，用于长句子。我们在一台配备A-100 80GB GPU和64个Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz的服务器上测试了大部分实验。我们调查了Switch-base-8、Switch-base-64、Switch-base-128和Switch-base-256在效率上的表现，其中数字表示Switch Transformer中每个MoE层的专家数量。考虑到代表性和资源有限，我们选择Switch-base-8和Switch-base-128在选定的数据集上进行微调，作为准确性分析的代表。我们的哈希函数在哈希构建线程中训练，使用数据集的训练集和真实哈希表，并在数据集的测试集上进行评估。

Evaluation metrics. We follow standard evaluation metrics for SST2, MRPC and MultiRC (Raffel et al., 2020), i.e., classification accuracy for SST2, F1 score for MRPC and MultiRC. Further, we evaluate the fidelity of SiDA, which refers to how much performance can be preserved compared to baselines. We refer the hash hits rate as the prediction accuracy on the expert activation of our hash function.
评估指标。我们遵循了SST2、MRPC和MultiRC的标准评估指标（Raffel等，2020），即SST2的分类准确率，MRPC和MultiRC的F1分数。此外，我们评估了SiDA的忠实度，即与基准相比，可以保留多少性能。我们将哈希命中率定义为对我们哈希函数的专家激活的预测准确性。

Hyperparameters We use AdamW (Loshchilov & Hutter, 2019) optimizer for fine-tuning the Switch Transformers and training the hash function. We set the batch size as 1 when measuring the latency and memory usage to eliminate the disturbance of the batch size. We select $T=30$ in the truncated KD with learning rate $5e-5$ , batch size $64$ , $\lambda=0.005$ , and train to converge. For fine-tuning Switch Transformers, we set learning as $5e-5$ and fine-tune with $16000$ max steps. We select top- $1$ experts from the hash function for SST2 and top- $3$ experts for MRPC and MultiRC when evaluating SiDA.
我们使用AdamW（Loshchilov & Hutter, 2019）优化器对Switch Transformers进行微调和训练哈希函数。在测量延迟和内存使用时，我们将批处理大小设置为1，以消除批处理大小的干扰。我们在截断的KD中选择 $T=30$ ，学习率为 $5e-5$ ，批处理大小为 $64$ ， $\lambda=0.005$ ，并训练至收敛。对于Switch Transformers的微调，我们将学习率设置为 $5e-5$ ，并以 $16000$ 的最大步数进行微调。在评估SiDA时，我们从哈希函数中选择前 $1$ 位专家用于SST2，以及前 $3$ 位专家用于MRPC和MultiRC。

4.1 GPU Memory Saving 4.1节节省GPU内存

We report the GPU memory saving in Fig. 8. For short sentences in SST2, SiDA can achieve over $80\%$ GPU memory reduction. For samples in MRPC whose lengths are clustered between 50 and 80, the GPU memory reduction remains substantial, yielding savings of 6.28GB and 19.84GB GPU memory for Switch-base-128 and Switch-base-256, respectively. Furthermore, even when processing long paragraphs in MultiRC with lengths ranging from 200 to 500, the rate of GPU memory reduction retains over $40\%$ and $20\%$ , leading to a save of 4.52GB for Switch-base-128 and 9.92GB for Switch-base-256.
我们在图8中报告了GPU内存节省情况。对于SST2中的短句子，SiDA可以实现超过 $80\%$ 的GPU内存减少。对于MRPC中长度集中在50到80之间的样本，GPU内存减少仍然显著，为Switch-base-128和Switch-base-256节省了6.28GB和19.84GB的GPU内存。此外，即使在处理长度范围从200到500的MultiRC中的长段落时，GPU内存减少的比率仍保持在 $40\%$ 和 $20\%$ 之上，为Switch-base-128和Switch-base-256节省了4.52GB和9.92GB的GPU内存。

4.2 Latency and Throughput
延迟与吞吐量

Apart from the GPU memory saving, SiDA also achieves overwhelming efficiency in terms of throughput and latency (see Fig. 9). Specifically, SiDA exceeds the average of baselines by $2.60\times$ and $3.93\times$ on throughput for large MoE models such as Swicth-base-128 and Switch-base-256 on SST2. Even for MultiRC containing long sentences, SiDA exceeds the average throughput of baselines by $1.26\times$ on Switch-base-128 and $1.57\times$ on Switch-base-256.
除了节省GPU内存外，SiDA在吞吐量和延迟方面也实现了压倒性的效率（见图9）。具体来说，对于像Swicth-base-128和Switch-base-256这样的大型MoE模型，在SST2上，SiDA的吞吐量超过基准平均值 $2.60\times$ 和 $3.93\times$ 。即使对于包含长句子的MultiRC，SiDA在Switch-base-128上的吞吐量也超过基准平均值 $1.26\times$ ，在Switch-base-256上超过 $1.57\times$ 。

We also investigate the inference latency of SiDA and baselines (see Fig. 10). For large MOE models such as Switch-base-128 and Switch-base-256, SiDA reduces the inference latency to $25\%$ on SST2 and MRPC and to $60\%$ on MultiRC. The improvements come from our design of the hash-building thread that resolves the expert selection overhead.
我们还研究了SiDA及基准模型的推理延迟（见图10）。对于大型MOE模型，如Switch-base-128和Switch-base-256，SiDA将SST2和MRPC的推理延迟降低到 $25\%$ ，将MultiRC的推理延迟降低到 $60\%$ 。这些改进来自于我们设计的哈希构建线程，该线程解决了专家选择的开销问题。

4.3 Efficiency under Limited GPU Memory Budgets
在有限GPU内存预算下的效率

We investigate the efficiency under different GPU memory budgets with different offloading methods on Switch-base-128 and Switch-base-256 since large MoE models are more resource-sensitive. Under a limited GPU memory budget, SiDA will offload and cache inactivated experts in a first-in-first-out manner, while all other baselines implement the model parallelism, where only layers required for inference will be kept on the GPU. The results of throughput versus GPU memory budgets are shown in Fig. 11. SiDA achieves better throughput under all GPU memory budgets across all datasets, demonstrating that SiDA employs a better offloading strategy under limited GPU memory budgets.
我们在Switch-base-128和Switch-base-256上，针对不同的卸载方法，在不同GPU内存预算下调查了大型MoE模型的效率，因为这些模型对资源更为敏感。在有限的GPU内存预算下，SiDA会以先进先出的方式卸载和缓存未激活的专家，而所有其他基准则实现了模型并行，其中只有推理所需的层会保留在GPU上。吞吐量与GPU内存预算的结果显示在图11中。SiDA在所有数据集上，在所有GPU内存预算下都实现了更好的吞吐量，这表明SiDA在有限的GPU内存预算下采用了更好的卸载策略。

4.4 Fidelity Analysis 4.4 保真度分析

Table 3: Evaluation of SiDA’s Performance Preservation. SiDA retains as much as

99\%

of the performance on the Switch-base-8 model and maintains over

95\%

on the Switch-base-128 model, resulting in down to less than

1\%

performance drop.
表3：SiDA性能保持评估。SiDA在Switch-base-8模型上保持了高达

99\%

的性能，在Switch-base-128模型上保持了超过

95\%

的性能，导致性能下降至少于

1\%

。

Backbone 骨干		SST2	MRPC 微软研究释义语料库	MultiRC 多项选择阅读理解
Switch-base-8 切换到八进制基数	Finetuned 微调	$92.20$	$89.14$	$56.70$
	SiDA	$90.59$	$86.91$	$56.11$
	Fidelity 忠诚	$98.25\%$	$97.49\%$	$98.95\%$
Switch-base-128 切换基数128	Finetuned 微调	$93.57$	$89.66$	$59.95$
	SiDA	$87.04$	$83.01$	$55.49$
	Fidelity 忠诚	$93.02\%$	$92.59\%$	$92.56\%$

We conduct the fidelity analysis to check how much performance SiDA can preserve. As Table. 3 shows, SiDA can preserve up to nearly $99\%$ accuracy leading to a performance degradation down to less than $1\%$ for Switch-base-8. For Switch-base-128, the fidelity is up to $96\%$ leading to a performance loss down to $3\%$ . Our results demonstrate the superiority of SiDA, which achieves low inference latency and low GPU memory occupation with negligible loss on the model’s performance.
我们进行了保真度分析，以检查SiDA能保留多少性能。如表3所示，SiDA能够保留高达 $99\%$ 的准确率，使得Switch-base-8的性能下降降至少于 $1\%$ 。对于Switch-base-128，保真度高达 $96\%$ ，导致性能损失降至 $3\%$ 。我们的结果展示了SiDA的优越性，它在模型性能上的损失可以忽略不计，同时实现了低推理延迟和低GPU内存占用。

4.5 Hash Hits Rate 哈希命中率

Table 4: Top-3 Hash Hits Rate. Demonstrating SiDA’s exemplary accuracy on expert activation prediction up to over

99\%

across various models.
表4：前3名哈希命中率。展示了SiDA在多种模型上对专家激活预测的卓越准确性，高达超过

99\%

。

Backbone 骨干	SST2	MRPC 微软研究释义语料库	MultiRC 多项选择阅读理解
Switch-base-8 切换到八进制基数	$99.00\%$	$97.41\%$	$91.74\%$
Switch-base-128 切换基数为128	$98.78\%$	$98.65\%$	$90.49\%$

SiDA adopts a predictor to predict the experts to be activated for each token. We investigate the accuracy of the predictor in the hash-building thread, which we refer to as the hash hits rate. Results can be found in Table 4 where we report top- $3$ accuracy. For very long sentences, such as the MultiRC dataset, the hash hits rate can achieve over $90\%$ .
SiDA采用了一个预测器来预测每个标记要激活的专家。我们在构建哈希的线程中调查了预测器的准确性，我们将其称为哈希命中率。结果见表4，我们报告了前 $3$ 准确率。对于非常长的句子，例如MultiRC数据集，哈希命中率可以达到超过 $90\%$ 。

5 Related Work 相关工作

With the rise of LLM, efficient serving for large models has become a hot topic. Much research has been done by adopting classical model compression methods, such as knowledge distillation (Fu et al., 2023; Li et al., 2023b; Tan et al., 2023; Wang et al., 2023; Wu et al., 2023; Gu et al., 2023; Zhou et al., 2023; Yuan et al., 2023a), quantization (Chee et al., 2023; Frantar et al., 2022; Lin et al., 2023; Cheng et al., 2023; Liu et al., 2023a; b; Shang et al., 2023; Shao et al., 2023; Xiao et al., 2023; Yuan et al., 2023b), and pruning (Frantar & Alistarh, 2023; Ji et al., 2023; Ma et al., 2023; Sun et al., 2023; Xia et al., 2023; Li et al., 2023c). Further, others have been exploring more efficient network architectures (Del Corro et al., 2023; Liu et al., 2023c; Miao et al., 2023; Jiang et al., 2023b; Ning et al., 2023; Spector & Re, 2023; Xu et al., 2023). Besides, some have tackled the efficiency problem from a data perspective by performing text compression (Chevalier et al., 2023; Ge et al., 2023; Valmeekam et al., 2023; Jiang et al., 2023a). However, these works are not specifically designed for MoE models and ignore the sparse expert activation patterns. SiDA exploits the expert activation patterns to achieve efficient inference. Furthermore, SiDA is orthogonal to methods such as quantization and pruning, which can be applied to the activated experts’ networks.
随着LLM的兴起，大型模型的高效服务已成为热门话题。通过采用经典的模型压缩方法，已进行了大量研究，例如知识蒸馏（Fu等，2023；Li等，2023b；Tan等，2023；Wang等，2023；Wu等，2023；Gu等，2023；Zhou等，2023；Yuan等，2023a）、量化（Chee等，2023；Frantar等，2022；Lin等，2023；Cheng等，2023；Liu等，2023a；b；Shang等，2023；Shao等，2023；Xiao等，2023；Yuan等，2023b）和剪枝（Frantar & Alistarh，2023；Ji等，2023；Ma等，2023；Sun等，2023；Xia等，2023；Li等，2023c）。此外，其他人正在探索更高效的网络架构（Del Corro等，2023；Liu等，2023c；Miao等，2023；Jiang等，2023b；Ning等，2023；Spector & Re，2023；Xu等，2023）。还有一些人从数据角度解决效率问题，进行了文本压缩（Chevalier等，2023；Ge等，2023；Valmeekam等，2023；Jiang等，2023a）。然而，这些工作并非专为MoE模型设计，忽略了稀疏专家激活模式。SiDA利用专家激活模式实现高效推理。此外，SiDA与量化和剪枝等方法是正交的，可以应用于激活的专家网络。

We notice several concurrent works that are specifically designed for efficient MoE-based model inference (Huang et al., 2023; Kong et al., 2023; Yi et al., 2023). However, SiDA is orthogonal to these works, which focus on designing better scheduling for caching experts. SiDA explores a data-aware path that predicts the experts to be activated. The data-aware approach and the caching scheduling can be combined to achieve better efficiency.
我们注意到几项针对高效MoE（Mixture of Experts）模型推理而专门设计的并行工作（黄等，2023年；孔等，2023年；易等，2023年）。然而，SiDA与这些工作是正交的，这些工作专注于为缓存专家设计更好的调度方案。SiDA探索了一条数据感知的路径，预测将要被激活的专家。数据感知方法和缓存调度可以结合起来，以实现更高的效率。

6 Discussion 6讨论

Enhanced Hierarchical Offloading. While SiDA offers offloading capabilities between main memory and GPU memory, its limitations are defined by the storage capacity of the main memory. This poses challenges, especially when deploying massive models like Switch-c-2048 with almost 5TB of parameters. A logical progression would be to introduce a layered offloading mechanism that fluidly transfers experts between GPU memory, main memory, and SSD storage. Such an advanced hierarchical approach in SiDA would make it adept at handling models of any magnitude.
增强的分层卸载。虽然SiDA提供了主内存和GPU内存之间的卸载能力，但其局限性由主内存的存储容量所定义。这在部署像Switch-c-2048这样几乎有5TB参数的庞大模型时，尤其具有挑战性。一个逻辑上的进步将是引入一个分层卸载机制，该机制能够在GPU内存、主内存和SSD存储之间流畅地传输专家。在SiDA中采用这种高级的分层方法将使其能够熟练地处理任何规模的模型。

Optimized Hash Graph for Expert Activation Storage. Currently, SiDA utilizes an LSTM model to function as its hash system. It’s evident that the expert activation is conditionally contingent upon the activation patterns observed in preceding MoE layers. To enhance efficiency, an ideal hash function could be designed as a graph. This graph would capture and store these conditional dependencies, enabling rapid and effective extraction of expert activation.
为专家激活存储优化的哈希图。目前，SiDA利用一个LSTM模型作为其哈希系统。显然，专家激活的发生条件依赖于之前MoE层中观察到的激活模式。为了提高效率，可以设计一个理想的哈希函数作为图。这个图将捕捉并存储这些条件依赖性，使得专家激活的快速有效提取成为可能。

7 Conclusion 7 结论

In summary, this paper presents SiDA, a novel data-aware method that adeptly addresses the challenges posed by the memory constraints of GPUs when serving expansive models, specifically leveraging the sparsity inherent in MoE architectures. Further, SiDA deploys an offline trained hash function running in the hash-building thread, which alleviates the expert selection overhead by a large margin. Through judicious utilization of both main and GPU memory, SiDA offers a promising route for serving large MoE models under limited GPU budgets with nearly zero performance setbacks.
总之，本文提出了SiDA，这是一种新颖的数据感知方法，巧妙地解决了在GPU内存限制下服务大型模型所面临的挑战，特别是利用了MoE架构固有的稀疏性。此外，SiDA部署了一个离线训练的哈希函数，该函数在构建哈希的线程中运行，大幅减轻了专家选择的开销。通过明智地利用主内存和GPU内存，SiDA为在有限的GPU预算下服务大型MoE模型提供了一条有希望的途径，几乎不会造成性能上的损失。

References 参考文献

Aminabadi et al. (2022) 阿米纳巴迪等人（2022年） Aminabadi, R. Y., Rajbhandari, S., Awan, A. A., Li, C., Li, D., Zheng, E., Ruwase, O., Smith, S., Zhang, M., Rasley, J., et al. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15. IEEE, 2022.
阿米纳巴迪，R. Y.，拉杰班达里，S.，阿万，A. A.，李，C.，李，D.，郑，E.，鲁瓦塞，O.，史密斯，S.，张，M.，拉斯利，J. 等。Deepspeed-inference：在前所未有的规模上实现Transformer模型的高效推理。在SC22：高性能计算、网络、存储与分析国际会议上，第1-15页。IEEE，2022。
Bahdanau et al. (2015) 巴赫达瑙等人（2015年） Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. In Bengio, Y. and LeCun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1409.0473.
巴赫丹努、赵凯和本吉奥在2015年的第三届国际学习表示会议（ICLR 2015），由本吉奥和勒昆编辑，于2015年5月7日至9日在美国加利福尼亚州圣地亚哥举行的会议追踪论文集中发表了题为“通过联合学习对齐和翻译的神经机器翻译”。论文链接：http://arxiv.org/abs/1409.0473。
Brown et al. (2020) 布朗等人（2020年） Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
布朗，T.，曼恩，B.，赖德，N.，苏比亚，M.，卡普兰，J. D.，达里瓦尔，P.，尼拉坎坦，A.，夏姆，P.，萨斯特里，G.，阿斯克尔，A.，等。语言模型是少数样本学习者。神经信息处理系统进展，33:1877-1901，2020。
Chee et al. (2023) Chee等（2023） Chee, J., Cai, Y., Kuleshov, V., and De Sa, C. Quip: 2-bit quantization of large language models with guarantees. arXiv preprint arXiv:2307.13304, 2023.
Chee, J., Cai, Y., Kuleshov, V., 和 De Sa, C. Quip：带保证的大型语言模型2比特量化。arXiv预印本arXiv:2307.13304, 2023.
Chen et al. (2022) 陈等人（2022年） Chen, Z., Deng, Y., Wu, Y., Gu, Q., and Li, Y. Towards understanding the mixture-of-experts layer in deep learning. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=MaYzugDmQV.
陈泽、邓宇、吴宇、顾强、李颖。深入理解深度学习中的专家混合层。收录于Oh, A. H., Agarwal, A., Belgrave, D., 和 Cho, K. 编辑的《神经信息处理系统进展》，2022年。网址 https://openreview.net/forum?id=MaYzugDmQV。
Cheng et al. (2023) 程等（2023） Cheng, W., Zhang, W., Shen, H., Cai, Y., He, X., and Lv, K. Optimize weight rounding via signed gradient descent for the quantization of llms. arXiv preprint arXiv:2309.05516, 2023.
程伟、张伟、沈浩、蔡宇、何星、吕凯。通过有符号梯度下降优化权重舍入以量化线性最小均方误差模型。arXiv预印本arXiv:2309.05516，2023。
Chevalier et al. (2023) 谢瓦利耶等人（2023年） Chevalier, A., Wettig, A., Ajith, A., and Chen, D. Adapting language models to compress contexts. arXiv preprint arXiv:2305.14788, 2023.
谢瓦利耶、韦蒂格、阿吉特和陈丹。适应语言模型以压缩上下文。arXiv预印本arXiv:2305.14788，2023。
Choquette (2023) Choquette（2023） Choquette, J. Nvidia hopper h100 gpu: Scaling performance. IEEE Micro, 2023.
Choquette, J. Nvidia Hopper H100 GPU：性能扩展。IEEE Micro, 2023.
Collobert et al. (2001) Collobert等人（2001年） Collobert, R., Bengio, S., and Bengio, Y. A parallel mixture of svms for very large scale problems. Advances in Neural Information Processing Systems, 14, 2001.
Collobert, R., Bengio, S., 和 Bengio, Y. 面向非常大规模问题的SVM并行混合模型。神经信息处理系统进展，14，2001。
Del Corro et al. (2023) Del Corro 等人（2023年） Del Corro, L., Del Giorno, A., Agarwal, S., Yu, B., Awadallah, A., and Mukherjee, S. Skipdecode: Autoregressive skip decoding with batching and caching for efficient llm inference. arXiv preprint arXiv:2307.02628, 2023.
Del Corro, L., Del Giorno, A., Agarwal, S., Yu, B., Awadallah, A., 以及 Mukherjee, S. 提出了 Skipdecode：一种通过批处理和缓存实现高效大型语言模型推理的自回归跳跃解码技术。arXiv预印本 arXiv:2307.02628, 2023.
Eigen et al. (2013) Eigen等人（2013年） Eigen, D., Ranzato, M., and Sutskever, I. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314, 2013.
Eigen, D., Ranzato, M., 和 Sutskever, I. 在深度专家混合模型中学习因子表示。arXiv预印本 arXiv:1312.4314, 2013.
Fedus et al. (2022) Fedus 等人（2022年） Fedus, W., Zoph, B., and Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
Fedus, W., Zoph, B., 和 Shazeer, N. 开关变压器：通过简单高效的稀疏性扩展至万亿参数模型。机器学习研究杂志，23(1):5232–5270, 2022.
Frantar & Alistarh (2023)
Frantar和Alistarh（2023年） Frantar, E. and Alistarh, D. Sparsegpt: Massive language models can be accurately pruned in one-shot. 2023.
Frantar, E. 和 Alistarh, D. Sparsegpt：大型语言模型可以通过一次性剪枝准确简化。2023.
Frantar et al. (2022) Frantar等（2022年） Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
Frantar, E., Ashkboos, S., Hoefler, T., 和 Alistarh, D. Gptq：生成式预训练变换器的精确后训练量化。arXiv预印本arXiv:2210.17323, 2022.
Fu et al. (2023) 傅等人（2023年） Fu, Y., Peng, H., Ou, L., Sabharwal, A., and Khot, T. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726, 2023.
傅宇、彭浩、欧雷、萨巴瓦尔、科特。专门化小型语言模型以进行多步推理。arXiv预印本arXiv:2301.12726，2023。
Ge et al. (2023) 葛等人（2023年） Ge, T., Hu, J., Wang, X., Chen, S.-Q., and Wei, F. In-context autoencoder for context compression in a large language model. arXiv preprint arXiv:2307.06945, 2023.
葛天、胡军、王晓、陈胜强和魏锋。在大型语言模型中用于上下文压缩的上下文自编码器。arXiv预印本arXiv:2307.06945，2023。
Gu et al. (2023) 顾等（2023年） Gu, Y., Dong, L., Wei, F., and Huang, M. Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543, 2023.
顾宇、董磊、魏锋、黄明。大型语言模型的知识蒸馏。arXiv预印本arXiv:2306.08543，2023。
He et al. (2016) 何等（2016年） He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
何凯明、张晓军、任少卿、孙剑。深度残差学习用于图像识别。在IEEE计算机视觉与模式识别会议论文集中，第770-778页，2016年。
Hinton et al. (2015) Hinton 等人（2015年） Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
Hinton, G., Vinyals, O., 和 Dean, J. 神经网络中知识的蒸馏。arXiv预印本 arXiv:1503.02531, 2015.
Hochreiter & Schmidhuber (1997)
霍赫赖特和施密德胡伯（1997） Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
Hochreiter, S. 和 Schmidhuber, J. 长短期记忆。神经计算，9(8):1735–1780, 1997.
Huang et al. (2023) 黄等人（2023年） Huang, H., Ardalani, N., Sun, A., Ke, L., Lee, H.-H. S., Sridhar, A., Bhosale, S., Wu, C.-J., and Lee, B. Towards moe deployment: Mitigating inefficiencies in mixture-of-expert (moe) inference. arXiv preprint arXiv:2303.06182, 2023.
黄H., 阿尔达拉尼N., 孙A., 柯L., 李H.-H. S., 斯里达尔A., 布霍萨尔S., 吴C.-J., 李B. 朝向更多专家部署：减轻混合专家（MoE）推理中的低效率。arXiv预印本arXiv:2303.06182, 2023.
Huang et al. (2019) 黄等人（2019年） Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
黄Y.，程Y.，巴普纳A.，菲拉特O.，陈D.，陈M.，李H.，倪加姆J.，乐Q. V.，吴Y.，等。使用管道并行性高效训练巨型神经网络。神经信息处理系统进展，32，2019。
Hwang et al. (2023) 黄等人（2023年） Hwang, C., Cui, W., Xiong, Y., Yang, Z., Liu, Z., Hu, H., Wang, Z., Salas, R., Jose, J., Ram, P., et al. Tutel: Adaptive mixture-of-experts at scale. Proceedings of Machine Learning and Systems, 5, 2023.
黄C., 崔W., 熊Y., 杨Z., 刘Z., 胡H., 王Z., 萨拉斯R., 何塞J., 拉姆P.等. Tutel: 大规模自适应专家混合体系. 机器学习与系统会议论文集, 5, 2023.
Jacobs et al. (1991) 雅各布斯等人（1991年） Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
雅各布斯，R. A.，乔丹，M. I.，诺兰，S. J.，以及辛顿，G. E. 局部专家的自适应混合。神经计算，3(1):79-87，1991。
Ji et al. (2023) 季等人（2023年） Ji, Y., Cao, Y., and Liu, J. Pruning large language models via accuracy predictor. arXiv preprint arXiv:2309.09507, 2023.
季Y、曹Y和刘J。通过准确性预测器对大型语言模型进行剪枝。arXiv预印本arXiv:2309.09507，2023。
Jiang et al. (2023a) 姜等（2023a） Jiang, H., Wu, Q., Lin, C.-Y., Yang, Y., and Qiu, L. Llmlingua: Compressing prompts for accelerated inference of large language models. arXiv preprint arXiv:2310.05736, 2023a.
江恒、吴强、林春宇、杨洋、邱丽。Llmlingua：压缩提示以加速大型语言模型的推理。arXiv预印本arXiv:2310.05736，2023a。
Jiang et al. (2023b) 姜等人（2023b） Jiang, Y., He, Q., Zhuang, X., Wu, Z., Wang, K., Zhao, W., and Yang, G. Recyclegpt: An autoregressive language model with recyclable module. arXiv preprint arXiv:2308.03421, 2023b.
姜宇、何强、庄晓、吴泽、王凯、赵伟、杨光。Recyclegpt：一种具有可回收模块的自回归语言模型。arXiv预印本arXiv:2308.03421，2023b。
Jordan et al. (1996) 乔丹等人（1996年） Jordan, M., Ghahramani, Z., and Saul, L. Hidden markov decision trees. Advances in neural information processing systems, 9, 1996.
乔丹、M.，加拉曼尼、Z.，和索尔、L. 隐藏马尔可夫决策树。神经信息处理系统进展，第9卷，1996年。
Jordan & Jacobs (1994)
乔丹与雅各布斯（1994年） Jordan, M. I. and Jacobs, R. A. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181–214, 1994.
乔丹，M. I. 和雅各布斯，R. A. 专家的层次混合模型与EM算法。神经计算，6(2):181-214，1994。
Kaplan et al. (2020) Kaplan等（2020年） Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., 和 Amodei, D. 神经语言模型的规模化定律。arXiv预印本arXiv:2001.08361, 2020.
Kirillov et al. (2023) Kirillov等（2023年） Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
基里洛夫、A.、明顿、E.、拉维、N.、毛、H.、罗兰、C.、古斯塔夫森、L.、肖、T.、怀特黑德、S.、伯格、A.C.、罗、W.-Y. 等。分割任何事物。arXiv预印本arXiv:2304.02643，2023。
Kong et al. (2023) 孔等人（2023年） Kong, R., Li, Y., Feng, Q., Wang, W., Kong, L., and Liu, Y. Serving moe models on resource-constrained edge devices via dynamic expert swapping. arXiv preprint arXiv:2308.15030, 2023.
孔睿、李颖、冯强、王伟、孔亮、刘阳。通过动态专家交换在资源受限的边缘设备上部署MOE模型。arXiv预印本arXiv:2308.15030，2023。
Lepikhin et al. (2020) Lepikhin等人（2020年） Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., and Chen, Z. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., 和 Chen, Z. Gshard：通过条件计算和自动分片扩展巨型模型。arXiv预印本arXiv:2006.16668, 2020.
Lewis et al. (2021) 刘易斯等人（2021年） Lewis, M., Bhosale, S., Dettmers, T., Goyal, N., and Zettlemoyer, L. Base layers: Simplifying training of large, sparse models. In International Conference on Machine Learning, pp. 6265–6274. PMLR, 2021.
刘易斯、M.，博萨莱、S.，德特默斯、T.，戈亚尔、N.，以及泽特尔莫耶、L. 基础层：简化大型稀疏模型的训练。在国际机器学习会议上，第6265-6274页。机器学习研究进展，2021。
Li et al. (2023a) 李等人（2023a） Li, B., Shen, Y., Yang, J., Wang, Y., Ren, J., Che, T., Zhang, J., and Liu, Z. Sparse mixture-of-experts are domain generalizable learners. In The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=RecZ9nB9Q4.
李柏、沈阳、杨洁、王艳、任洁、车天、张洁和刘泽。稀疏混合专家是领域泛化学习者。发表于第十一届国际学习表征会议，2023年。网址 https://openreview.net/forum?id=RecZ9nB9Q4。
Li et al. (2023b) 李等人（2023b） Li, L. H., Hessel, J., Yu, Y., Ren, X., Chang, K.-W., and Choi, Y. Symbolic chain-of-thought distillation: Small models can also” think” step-by-step. arXiv preprint arXiv:2306.14050, 2023b.
李立华、赫塞尔、余洋、任晓、张凯文和崔勇。符号化思维链精炼：小模型也能“逐步思考”。arXiv预印本arXiv:2306.14050，2023b。
Li et al. (2023c) 李等人（2023c） Li, Y., Yu, Y., Zhang, Q., Liang, C., He, P., Chen, W., and Zhao, T. Losparse: Structured compression of large language models based on low-rank and sparse approximation. arXiv preprint arXiv:2306.11222, 2023c.
李毅、余宇、张强、梁晨、何平、陈伟、赵涛。Losparse：基于低秩和稀疏近似的大型语言模型的结构化压缩。arXiv预印本arXiv:2306.11222，2023c。
Lin et al. (2023) 林等人（2023年） Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., and Han, S. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
林 J.、唐 J.、唐 H.、杨 S.、党 X. 和韩 S. 提出了一种针对大型语言模型压缩和加速的激活感知权重量化方法 Awq。arXiv 预印本 arXiv:2306.00978，2023。
Liu et al. (2023a) 刘等（2023a） Liu, J., Gong, R., Wei, X., Dong, Z., Cai, J., and Zhuang, B. Qllm: Accurate and efficient low-bitwidth quantization for large language models. arXiv preprint arXiv:2310.08041, 2023a.
刘佳、龚荣、魏晓、董哲、蔡佳和庄波。Qllm：大型语言模型的准确且高效的低比特宽度量化。arXiv预印本arXiv:2310.08041，2023a。
Liu et al. (2023b) 刘等（2023b） Liu, Z., Oguz, B., Zhao, C., Chang, E., Stock, P., Mehdad, Y., Shi, Y., Krishnamoorthi, R., and Chandra, V. Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888, 2023b.
刘柱、欧古兹、赵晨、常恩、斯托克、梅达德、石岩、克里希纳穆尔希、钱德拉。大型语言模型的无数据量化感知训练（Llm-qat）。arXiv预印本arXiv:2305.17888，2023b。
Liu et al. (2023c) 刘等（2023c） Liu, Z., Wang, J., Dao, T., Zhou, T., Yuan, B., Song, Z., Shrivastava, A., Zhang, C., Tian, Y., Re, C., et al. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pp. 22137–22176. PMLR, 2023c.
刘柱，王健，刀涛，周涛，袁波，宋哲，施瑞瓦斯塔瓦，张晨，田野，雷西等。既视感：推理时高效大型语言模型的上下文稀疏性。在国际机器学习大会上，页码22137-22176。机器学习研究出版社，2023年。
Loshchilov & Hutter (2019)
洛希洛夫与胡特（2019） Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
洛希洛夫，I. 和胡特，F. 解耦权重衰减正则化。发表于2019年国际学习表征会议。网址 https://openreview.net/forum?id=Bkg6RiCqY7。
Ma et al. (2023) 马等人（2023） Ma, X., Fang, G., and Wang, X. Llm-pruner: On the structural pruning of large language models. arXiv preprint arXiv:2305.11627, 2023.
马X、方G和王X。Llm-pruner：关于大型语言模型结构剪枝的研究。arXiv预印本arXiv:2305.11627，2023。
Martins & Astudillo (2016)
马丁斯与阿斯图迪略（2016年） Martins, A. and Astudillo, R. From softmax to sparsemax: A sparse model of attention and multi-label classification. In International conference on machine learning, pp. 1614–1623. PMLR, 2016.
马丁斯，A. 和阿斯图迪略，R. 从softmax到sparsemax：注意力和多标签分类的稀疏模型。在国际机器学习会议上，第1614-1623页。机器学习与模式识别出版社，2016年。
Miao et al. (2023) 缪等（2023） Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Wang, Z., Wong, R. Y. Y., Chen, Z., Arfeen, D., Abhyankar, R., and Jia, Z. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 2023.
缪晓、欧利亚罗、张哲、程晓、王志、黄荣耀、陈志、阿菲恩、阿比扬卡、贾哲。Specinfer：通过推测性推理和令牌树验证加速生成式LLM服务。arXiv预印本arXiv:2305.09781，2023。
Ning et al. (2023) 宁等人（2023年） Ning, X., Lin, Z., Zhou, Z., Yang, H., and Wang, Y. Skeleton-of-thought: Large language models can do parallel decoding. arXiv preprint arXiv:2307.15337, 2023.
宁X.、林Z.、周Z.、杨H. 和王Y. 思维骨架：大型语言模型能够进行并行解码。arXiv预印本arXiv:2307.15337, 2023.
OpenAI (2023) OpenAI（2023年） OpenAI. Gpt-4 technical report, 2023.
OpenAI. GPT-4技术报告，2023年。
Raffel et al. (2020) Raffel等人（2020年） Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P. J. 探索统一文本到文本转换器在迁移学习中的极限。机器学习研究杂志，21(1):5485–5551, 2020.
Rajbhandari et al. (2022)
Rajbhandari等人（2022年） Rajbhandari, S., Li, C., Yao, Z., Zhang, M., Aminabadi, R. Y., Awan, A. A., Rasley, J., and He, Y. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International Conference on Machine Learning, pp. 18332–18346. PMLR, 2022.
Rajbhandari, S., Li, C., Yao, Z., Zhang, M., Aminabadi, R. Y., Awan, A. A., Rasley, J., 和 He, Y. Deepspeed-moe：推进专家混合模型的推理和训练，以支持下一代AI规模。在国际机器学习会议上，第18332-18346页。PMLR，2022。
Ramesh et al. (2022) 拉梅什等人（2022年） Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
拉梅什、A.、达里瓦尔、P.、尼科尔、A.、朱、C. 以及陈、M.。层次化文本条件图像生成与剪辑潜变量。arXiv预印本arXiv:2204.06125, 1(2):3, 2022。
Riquelme et al. (2021) 里克尔梅等人（2021） Riquelme, C., Puigcerver, J., Mustafa, B., Neumann, M., Jenatton, R., Susano Pinto, A., Keysers, D., and Houlsby, N. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583–8595, 2021.
里克尔梅、C.，普伊格塞弗、J.，穆斯塔法、B.，诺伊曼、M.，杰纳顿、R.，苏萨诺平托、A.，凯瑟斯、D.，以及霍尔斯比、N. 用稀疏专家混合模型扩展视觉能力。神经信息处理系统进展，第34卷：8583-8595，2021。
Roller et al. (2021) 罗勒等人（2021年） Roller, S., Sukhbaatar, S., Szlam, A., and Weston, J. E. Hash layers for large sparse models. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=lMgDDWb1ULW.
罗勒、苏赫巴特尔、斯兹拉姆和韦斯顿。大型稀疏模型的哈希层。收录于贝格尔齐默、道芬、梁和沃恩编辑的《神经信息处理系统进展》，2021年。网址 https://openreview.net/forum?id=lMgDDWb1ULW。
Saharia et al. (2022) Saharia等人（2022年） Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., 等. 深度语言理解的真实感文本到图像扩散模型。神经信息处理系统进展, 35:36479–36494, 2022.
Shang et al. (2023) 商等人（2023年） Shang, Y., Yuan, Z., Wu, Q., and Dong, Z. Pb-llm: Partially binarized large language models. arXiv preprint arXiv:2310.00034, 2023.
商颖、袁泽、吴强、董志。Pb-llm：部分二值化的大型语言模型。arXiv预印本arXiv:2310.00034，2023。
Shao et al. (2023) 邵等人（2023年） Shao, W., Chen, M., Zhang, Z., Xu, P., Zhao, L., Li, Z., Zhang, K., Gao, P., Qiao, Y., and Luo, P. Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv preprint arXiv:2308.13137, 2023.
邵伟、陈明、张哲、徐鹏、赵亮、李志、张凯、高鹏、乔阳、罗平。Omniquant：面向大型语言模型的全方位校准量化。arXiv预印本arXiv:2308.13137，2023。
Shazeer et al. (2017) Shazeer 等人（2017年） Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=B1ckMDqlg.
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., 和 Dean, J. 著. 在国际学习表征会议上发表的《离谱的大型神经网络：稀疏门控的专家混合层》, 2017. 网址 https://openreview.net/forum?id=B1ckMDqlg.
Smith et al. (2022) 史密斯等人（2022年） Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
史密斯，S.，帕特瓦里，M.，诺里克，B.，勒格雷斯利，P.，拉杰班达里，S.，卡斯珀，J.，刘，Z.，普拉布莫耶，S.，泽尔维亚斯，G.，科尔蒂坎蒂，V.，等。使用DeepSpeed和Megatron训练Megatron-Turing NLG 530B，一个大规模生成语言模型。arXiv预印本arXiv:2201.11990，2022。
Spector & Re (2023)
斯佩克特和雷（2023年） Spector, B. and Re, C. Accelerating llm inference with staged speculative decoding. arXiv preprint arXiv:2308.04623, 2023.
Spector，B. 和 Re，C. 通过分阶段推测解码加速大型语言模型推理。arXiv预印本 arXiv:2308.04623，2023。
Sun et al. (2023) 孙等人（2023年） Sun, M., Liu, Z., Bair, A., and Kolter, J. Z. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023.
孙M., 刘Z., 白尔A., 和科尔特J. Z. 一种简单有效的大型语言模型剪枝方法。arXiv预印本arXiv:2306.11695, 2023.
Tan et al. (2023) 谭等人（2023年） Tan, S., Tam, W. L., Wang, Y., Gong, W., Zhao, S., Zhang, P., and Tang, J. [industry] gkd: A general knowledge distillation framework for large-scale pre-trained language model. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023.
谭S.、谭W. L.、王Y.、龚W.、赵S.、张P. 和唐J. [行业] gkd：一个用于大规模预训练语言模型的通用知识蒸馏框架。在第61届计算语言学协会年会，2023年。
Tresp (2000) 特雷斯普（2000） Tresp, V. Mixtures of gaussian processes. Advances in neural information processing systems, 13, 2000.
Tresp, V. 高斯过程混合模型。神经信息处理系统进展，13，2000。
Valmeekam et al. (2023) 瓦尔米卡姆等人（2023年） Valmeekam, C. S. K., Narayanan, K., Kalathil, D., Chamberland, J.-F., and Shakkottai, S. Llmzip: Lossless text compression using large language models. arXiv preprint arXiv:2306.04050, 2023.
瓦尔米卡姆，C. S. K.，纳拉亚南，K.，卡拉蒂尔，D.，尚伯兰，J.-F.，以及沙科塔伊，S. 使用大型语言模型进行无损文本压缩。arXiv预印本arXiv:2306.04050，2023。
Vemprala et al. (2023) Vemprala等（2023年） Vemprala, S., Bonatti, R., Bucker, A., and Kapoor, A. Chatgpt for robotics: Design principles and model abilities. Microsoft Auton. Syst. Robot. Res, 2:20, 2023.
Vemprala, S., Bonatti, R., Bucker, A., 和 Kapoor, A. 针对机器人的Chatgpt：设计原则与模型能力。微软自主系统与机器人研究，2:20, 2023.
Wang et al. (2018) 王等（2018年） Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
王，A.，辛格，A.，迈克尔，J.，希尔，F.，列维，O.，以及鲍曼，S. R. Glue：一个用于自然语言理解的多任务基准测试和分析平台。arXiv预印本arXiv:1804.07461，2018。
Wang et al. (2019) 王等（2019年） Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019.
王，A.，普鲁克萨恰昆，Y.，南加，N.，辛格，A.，迈克尔，J.，希尔，F.，列维，O.，以及鲍曼，S.。Superglue：一个更具粘性的通用语言理解系统基准测试。神经信息处理系统进展，32，2019。
Wang et al. (2023) 王等（2023年） Wang, P., Wang, Z., Li, Z., Gao, Y., Yin, B., and Ren, X. Scott: Self-consistent chain-of-thought distillation. arXiv preprint arXiv:2305.01879, 2023.
王鹏、王志、李志、高阳、尹波、任欣。Scott：自洽思维链蒸馏。arXiv预印本arXiv:2305.01879，2023。
Wolf et al. (2019) 沃尔夫等人（2019年） Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
沃尔夫、德布特、桑、尚蒙、德朗格、莫伊、西斯塔克、劳尔特、卢夫、芬托维奇等。Huggingface的变换器：最先进的自然语言处理。arXiv预印本arXiv:1910.03771，2019。
Wu et al. (2023) 吴等（2023） Wu, M., Waheed, A., Zhang, C., Abdul-Mageed, M., and Aji, A. F. Lamini-lm: A diverse herd of distilled models from large-scale instructions. arXiv preprint arXiv:2304.14402, 2023.
吴M., 瓦希德A., 张C., 阿卜杜勒-马吉德M., 和阿吉A. F. Lamini-lm：来自大规模指令的多样化精简模型群。arXiv预印本arXiv:2304.14402, 2023.
Xia et al. (2023) 夏等人（2023年） Xia, H., Zheng, Z., Li, Y., Zhuang, D., Zhou, Z., Qiu, X., Li, Y., Lin, W., and Song, S. L. Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity. arXiv preprint arXiv:2309.10285, 2023.
夏恒、郑志、李艳、庄丹、周志、邱欣、李艳、林伟、宋思莱。Flash-llm：利用非结构化稀疏性实现成本效益高且高效的大型生成模型推理。arXiv预印本arXiv:2309.10285，2023。
Xiao et al. (2023) 肖等人（2023年） Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pp. 38087–38099. PMLR, 2023.
肖G.，林J.，塞兹内克M.，吴H.，德茂斯J.，韩S. Smoothquant：大型语言模型准确且高效的训练后量化。在国际机器学习会议上，第38087-38099页。机器学习研究进展，2023。
Xu et al. (2023) 徐等（2023） Xu, M., Xu, Y. L., and Mandic, D. P. Tensorgpt: Efficient compression of the embedding layer in llms based on the tensor-train decomposition. arXiv preprint arXiv:2307.00526, 2023.
徐明、徐宇龙和曼迪克·D·P。Tensorgpt：基于张量列车分解的LLMs嵌入层高效压缩。arXiv预印本arXiv:2307.00526，2023。
Xue et al. (2022) 薛等人（2022年） Xue, F., Shi, Z., Wei, F., Lou, Y., Liu, Y., and You, Y. Go wider instead of deeper. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 8779–8787, 2022.
薛飞、施泽、魏锋、楼阳、刘宇、游游。向宽而非深发展。在人工智能AAAI会议论文集中，第36卷，第8779-8787页，2022年。
Yang et al. (2023) 杨等人（2023年） Yang, H., Yue, S., and He, Y. Auto-gpt for online decision making: Benchmarks and additional opinions. arXiv preprint arXiv:2306.02224, 2023.
杨恒、岳松、何宇。在线决策的自动GPT：基准测试和额外意见。arXiv预印本arXiv:2306.02224，2023。
Yi et al. (2023) 易等（2023） Yi, R., Guo, L., Wei, S., Zhou, A., Wang, S., and Xu, M. Edgemoe: Fast on-device inference of moe-based large language models. arXiv preprint arXiv:2308.14352, 2023.
易睿，郭磊，魏森，周安，王硕，徐明。EdgeMOE：基于MOE的大型语言模型在设备上的快速推理。arXiv预印本arXiv:2308.14352，2023。
Yuan et al. (2023a) 袁等人（2023a） Yuan, S., Chen, J., Fu, Z., Ge, X., Shah, S., Jankowski, C. R., Yang, D., and Xiao, Y. Distilling script knowledge from large language models for constrained language planning. arXiv preprint arXiv:2305.05252, 2023a.
袁S., 陈J., 傅Z., 葛X., Shah S., Jankowski C.R., 杨D., 肖Y. 从大型语言模型中提取脚本知识以用于受限语言规划。arXiv预印本arXiv:2305.05252, 2023a.
Yuan et al. (2023b) 袁等人（2023b） Yuan, Z., Niu, L., Liu, J., Liu, W., Wang, X., Shang, Y., Sun, G., Wu, Q., Wu, J., and Wu, B. Rptq: Reorder-based post-training quantization for large language models. arXiv preprint arXiv:2304.01089, 2023b.
袁泽，牛莉，刘嘉，刘伟，王晓，尚颖，孙刚，吴强，吴杰，吴博睿。Rptq：基于重排序的大型语言模型训练后量化。arXiv预印本arXiv:2304.01089，2023b。
Zhou et al. (2023) 周等人（2023年） Zhou, Y., Lyu, K., Rawat, A. S., Menon, A. K., Rostamizadeh, A., Kumar, S., Kagy, J.-F., and Agarwal, R. Distillspec: Improving speculative decoding via knowledge distillation. arXiv preprint arXiv:2310.08461, 2023.
周，Y.，吕，K.，拉瓦特，A. S.，梅农，A. K.，罗斯塔米扎德，A.，库马尔，S.，卡吉，J.-F.，以及阿加瓦尔，R. Distillspec：通过知识蒸馏改进推测性解码。arXiv预印本arXiv:2310.08461，2023。

Abstract 摘要

1 Introduction 一、引言

2 Background and Motivation背景与动机

2.1 Mixture of Experts 2.1 专家混合模型

2.2 Low Effective Utilization of GPU Memory 2.2 GPU内存的低效利用

2.3 High Expert Selection Overhead2.3高专家选择成本

2.4 Sparse Activation of Experts in Large MoE Models大型MoE模型中专家的稀疏激活