SiDA: SPARSITY-INSPIRED DATA-AWARE SERVING FOR EFFICIENT AND Scalable LARGe MiXtURE-OF-Experts MOdelS
SiDA：灵感来自稀疏性的数据感知服务，用于高效且可扩展的大型混合专家模型

Zhixu Du $^{1}$ Shiyu Li $^{1}$ Yuhao Wu $^{1}$ Xiangyu Jiang $^{2}$ Jingwei Sun $^{1}$ Qilin Zheng $^{1}$ Yongkai $^{2} {W u}^{2}$ Ang Li $^{3}$
杜志旭 $^{1}$ 李诗雨 $^{1}$ 吴宇豪 $^{1}$ 蒋翔宇 $^{2}$ 孙靖威 $^{1}$ 郑启林 $^{1}$ 永凯 $^{2} {W u}^{2}$ 李昂 $^{3}$ Hai "Helen" Li $^{1}$ Yiran Chen $^{1}$
海 "海伦" 李 $^{1}$ 陈怡然 $^{1}$

Abstract 摘要

Mixture-of-Experts (MoE) has emerged as a favorable architecture in the era of large models due to its inherent advantage, i.e., enlarging model capacity without incurring notable computational overhead. Yet, the realization of such benefits often results in ineffective GPU memory utilization, as large portions of the model parameters remain dormant during inference. Moreover, the memory demands of large models consistently outpace the memory capacity of contemporary GPUs. Addressing this, we introduce SiDA (Sparsity-inspired Data-Aware), an efficient inference approach tailored for large MoE models. SiDA judiciously exploits both the system's main memory, which is now abundant and readily scalable, and GPU memory by capitalizing on the inherent sparsity on expert activation in MoE models. By adopting a data-aware perspective, SiDA achieves enhanced model efficiency with a neglectable performance drop. Specifically, SiDA attains a remarkable speedup in MoE inference with up to $3.93 \times$ throughput increasing, up to $75 %$ latency reduction, and up to $80 %$ GPU memory saving with down to $1 %$ performance drop. This work paves the way for scalable and efficient deployment of large MoE models, even in memory-constrained systems.
在大型模型的时代，专家混合模型（MoE）因其固有优势——在不显著增加计算开销的情况下扩大模型容量——而成为一种受欢迎的架构。然而，实现这些好处往往导致GPU内存利用率低下，因为在推理过程中大部分模型参数保持休眠状态。此外，大型模型对内存的需求持续超过现代GPU的内存容量。针对这一问题，我们引入了SiDA（受稀疏性启发的数据感知），这是一种为大型MoE模型量身定制的高效推理方法。SiDA巧妙地利用了系统的主内存——现在主内存已经变得丰富且易于扩展——以及GPU内存，通过利用MoE模型中专家激活的固有稀疏性。通过采用数据感知的视角，SiDA实现了模型效率的显著提升，同时性能损失可以忽略不计。具体来说，SiDA在MoE推理中实现了显著的加速，达到了高达 $3.93 \times$ 的吞吐量提升，高达 $75 %$ 的延迟减少，以及高达 $80 %$ 的GPU内存节省，同时将性能损失降低到 $1 %$ 。这项工作为大型MoE模型的可扩展和高效部署铺平了道路，即使是在内存受限的系统中也是如此。

1 INTRODUCTION 1 引言

Recently, rapid advances in large models with shocking performance have surprised the community in several areas, such as vision (Ramesh et al., 2022; Kirillov et al., 2023; Saharia et al., 2022), language (Brown et al., 2020; OpenAI, 2023; Smith et al., 2022), decision making (Yang et al., 2023), and robotics (Vemprala et al., 2023). For example, GPT-4 has demonstrated its capability that is comparable or even exceeds human-level understanding on several tasks (OpenAI, 2023), and DALLE. 2 can generate astonishing high-quality images. The outstanding performance of large models heavily relies on the outrageous number of parameters, namely the scaling law (Kaplan et al., 2020). Broadly speaking, the scaling law asserts that as the model size increases, various characteristics such as training loss, test performance, and the amount of required data exhibit predictable scaling behaviors.
近期，在多个领域内，大型模型取得的快速进展以其惊人的性能令人瞩目，这些领域包括视觉（Ramesh 等人，2022年；Kirillov 等人，2023年；Saharia 等人，2022年）、语言（Brown 等人，2020年；OpenAI，2023年；Smith 等人，2022年）、决策制定（Yang 等人，2023年）和机器人技术（Vemprala 等人，2023年）。例如，GPT-4 展示了其在多项任务上与人类水平相当乃至超越的能力（OpenAI，2023年），而 DALLE.2 能够生成令人惊叹的高质量图像。大型模型的卓越性能在很大程度上依赖于其庞大的参数数量，即所谓的规模定律（Kaplan 等人，2020年）。广义上讲，规模定律指出，随着模型大小的增加，训练损失、测试性能和所需数据量等各种特性表现出可预测的规模行为。

Mixture-of-Experts (MoE), a classical model architecture, enjoys the advantage that naturally fits the era of large models. MoE can improve the model's performance by drastically increasing the number of parameters while only
专家混合模型（MoE），一种经典的模型架构，自然而然地适应了大模型时代的优势。MoE可以通过大幅增加参数数量来提高模型的性能，同时仅仅

Figure 1. Diagram Showcasing the Architecture of MoE-based Transformers. Within each MoE layer only a limited number of experts are activated for inference.
图1. 展示基于MoE的变换器架构的示意图。在每个MoE层中，仅有限数量的专家被激活用于推理。

incurring little computational overhead. Although the number of parameters involved in the forward pass of an MoE model remains almost unchanged, research (Fedus et al., 2022) suggests that augmenting parameter counts using the MoE architecture still conforms to the scaling law. Encour-
带来的计算开销很小。尽管MoE模型前向传播中涉及的参数数量几乎保持不变，但研究（Fedus等人，2022年）表明，使用MoE架构增加参数数量仍然符合规模定律。鼓励-
aged by the advantage, many MoE-based large models have been proposed and achieved overwhelming performance in computer vision (Li et al., 2023a; Riquelme et al., 2021; Xue et al., 2022), natural language processing (Shazeer et al., 2017; Fedus et al., 2022), Specifically, the SparselyGated Mixture-of-Experts (Shazeer et al., 2017) layer scales LSTM models to 137 billion parameters, which improves the model capacity by

1000 \times

with marginal computational overhead increase. Switch Transformers (Fedus et al., 2022) scale to 1.6 trillion parameters with the same perplexity as T5-XXL (Raffel et al., 2020) while

4 \times

speedup during inference. However, the success of MoE comes with sacrifices in effective GPU memory utilization, incurring large memory occupation while only a small fraction of parameters residing in the memory are effective for inference of the current batch. Fig. 1 depicts the architecture of MoE-based transformers, where only a small portion of experts are activated in each MoE layer during each inference.
受到优势的推动，许多基于MoE的大型模型已经被提出，并在计算机视觉（Li等，2023a；Riquelme等，2021；Xue等，2022）、自然语言处理（Shazeer等，2017；Fedus等，2022）中取得了压倒性的表现。具体来说，稀疏门控专家混合（Shazeer等，2017）层将LSTM模型扩展到了1370亿参数，这通过边际计算开销的增加，提高了模型容量。而Switch Transformers（Fedus等，2022）扩展到了1.6万亿参数，与T5-XXL（Raffel等，2020）保持相同的困惑度，同时在推理过程中加速。然而，MoE的成功牺牲了有效GPU内存利用率，导致大量内存占用，而内存中仅有一小部分参数对当前批次的推理有效。图1展示了基于MoE的变换器的架构，其中每次推理过程中，每个MoE层只有一小部分专家被激活。

Further, with the trend of model scaling, we have observed a substantial gap between the memory demands of large models and the memory capacity of GPUs. For instance, in the past three years, the number of parameters in state-of-the-art models has scaled from 175 billion in GPT-3 (Brown et al., 2020 ) to 1.76 trillion in the newly announced GPT-4 (OpenAI, 2023), showing an over

10 \times

increase. Contrarily, the memory capacity of high-end GPUs remains around 80GB (Choquette, 2023), and commodity GPUs are still limited to 48GB or even smaller. This growing discrepancy motivates techniques to improve memory utilization efficiency. Thus, we seek to answer a compelling research question:
此外，随着模型规模化的趋势，我们观察到大型模型的内存需求与GPU的内存容量之间存在显著差距。例如，在过去三年中，最先进模型的参数数量已从GPT-3的1750亿（Brown等人，2020年）增加到新发布的GPT-4的1.76万亿（OpenAI，2023年），增长了十倍以上。相反，高端GPU的内存容量仍然维持在80GB左右（Choquette，2023年），而普通GPU的内存容量仍限于48GB甚至更小。这种日益增长的差异激发了提高内存利用效率的技术。因此，我们寻求回答一个引人注目的研究问题：

How to serve large Mixture-of-Experts models in an efficient and scalable manner under constrained memory?
如何在内存受限的情况下高效且可扩展地部署大型专家混合模型？

Previous efforts have studied the efficiency problem of MoE models to some extent. Deepspeed-MoE (Rajbhandari et al., 2022) optimizes the MoE module in the Deepspeed framework for efficient grouping and scheduling. A later version of the work (Aminabadi et al., 2022) focused on optimizing the inference efficiency with optimized computation kernels and careful coordination of communication and parallelism. Tutel (Hwang et al., 2023) enables adaptive parallelism and pipelining at runtime. However, these methods only focus on optimizing device-to-device communication but ignore the data-awareness,
之前的研究已经在一定程度上研究了MoE模型的效率问题。Deepspeed-MoE（Rajbhandari等，2022年）优化了Deepspeed框架中的MoE模块，以实现高效的分组和调度。该工作的后续版本（Aminabadi等，2022年）专注于通过优化计算核心和仔细协调通信与并行性来优化推理效率。Tutel（Hwang等，2023年）在运行时实现了自适应并行性和流水线。然而，这些方法仅关注于优化设备间通信，却忽略了数据感知。

not to mention exploiting the data-awareness to improve efficiency during inference. The data-awareness refers to a design where the technique or strategy is determined based on the incoming data. Our proposed framework embraces the data-awareness which brings three advantages. Firstly, the data-awareness can squeeze the sparsity leading to a further increase in memory efficiency compared to previous methods. Secondly, the data-awareness preserves the structure crucial for a sample's unique features, better maintaining
更不用说利用数据感知来提高推理过程中的效率了。数据感知是指根据传入数据确定技术或策略的设计。我们提出的框架采用了数据感知，这带来了三个优势。首先，数据感知可以压缩稀疏性，与之前的方法相比，进一步提高了内存效率。其次，数据感知保留了对样本独特特征至关重要的结构，更好地维持了
Table 1. Comparison of SiDA and Baseline Methods. This table delineates the capabilities of various methods in terms of dataawareness, effective GPU memory utilization, and inference speed on large MoE models. SiDA excels in its data-aware approach with high effective GPU memory utilization and high inference speed on large MoE models.
表1. SiDA与基准方法的比较。该表格详细列出了各种方法在数据感知能力、有效GPU内存利用率以及大型MoE模型上的推理速度方面的性能。SiDA在数据感知方法上表现出色，具有高效的GPU内存利用率和大型MoE模型上的高推理速度。

Methods 方法

Data-aware 数据感知的

Effective GPU 高效的GPU

memory utilization 内存利用率

Inference speed 推理速度

on large MoE 在大型MoE上

Standard 标准

x

low

slow 慢

Deepspeed 深度加速

x

medium 中等的

slow 慢

Tutel 图特尔

x

medium 中等的

slow 慢

SiDA

✓

Extremely high 极高的

Extremely high 极其高的

the model's performance. Thirdly, the data-awareness offers better adaptability since the framework varies according to data distribution.
模型的性能。第三，数据感知能力由于框架根据数据分布的不同而变化，提供了更好的适应性。

In this paper, we present an efficient inference system, i.e., SiDA (Sparsity-inspired Data-Aware), for serving large MoE models. By noticing that modern server CPUs support terabytes (TB) of main memory, dwarfing GPU capacity, SiDA dynamically leverages both main memory and GPU memory by exploiting sparsity in MoE models in a dataaware manner. We summarize the comparison in Table 1 between SiDA and baselines. Specifically, SiDA contains two threads that run in parallel, an inference thread and a hash-building thread. The hash-building thread exploits the sparsity of expert activation in a data-aware manner, whose core is a network-based hash function. Specifically, the hash function is an offline trained predictor that predicts the experts to be activated. In this work, we employ a LSTM (Hochreiter & Schmidhuber, 1997) with sparse attention and a truncated knowledge distillation to boost the performance of the hash function. The inference thread offloads inactivated experts predicted by the hash-building thread to maximize effective GPU memory utilization. Besides, SiDA also brings significant speedup during inference.
在本文中，我们提出了一个高效的推理系统，即SiDA（受稀疏性启发的数据感知），用于服务大型MoE模型。通过注意到现代服务器CPU支持TB级别的主内存，远超GPU容量，SiDA通过以数据感知的方式利用MoE模型中的稀疏性，动态地利用主内存和GPU内存。我们在表1中总结了SiDA与基线之间的比较。具体来说，SiDA包含两个并行运行的线程，一个是推理线程，另一个是哈希构建线程。哈希构建线程以数据感知的方式利用专家激活的稀疏性，其核心是一个基于网络的哈希函数。具体而言，哈希函数是一个离线训练的预测器，用于预测将要被激活的专家。在这项工作中，我们采用了一个带有稀疏注意力和截断知识蒸馏的LSTM（Hochreiter & Schmidhuber, 1997）来提升哈希函数的性能。推理线程将哈希构建线程预测的未激活专家卸载，以最大化GPU内存的有效利用。此外，SiDA在推理过程中也带来了显著的加速。

Our contributions are summarized as follows:
我们的贡献总结如下：

To the best of our knowledge, SiDA is the first sparsityinspired data-aware system serving for efficient and scalable inference on large MoE models.
据我们所知，SiDA是首个受稀疏性启发的数据感知系统，用于在大型MoE模型上进行高效且可扩展的推理。
We propose an offline training strategy to build a dataaware hash function deployed in SiDA that replaces the router function in MoE layers. Our design boosts the throughput of MoE models up to $3.93 \times$ and reduces the latency down to $25 %$ .
我们提出了一种离线训练策略，用于构建在SiDA中部署的数据感知哈希函数，该函数替代了MoE层中的路由函数。我们的设计将MoE模型的吞吐量提高到 $3.93 \times$ ，并将延迟降低到 $25 %$ 。
Our offloading scheme achieves up to $80 %$ GPU memory saving with only less than $1 %$ performance drop. Our hash function can achieve up to $99 %$ prediction accuracy on expert activation.
我们的卸载方案实现了高达 $80 %$ 的GPU内存节省，仅损失了不到 $1 %$ 的性能。我们的哈希函数在专家激活上能够达到高达 $99 %$ 的预测准确率。

The paper is organized in the following manner: In Section 2 , we introduce the background and motivation. Section 3 is
本文的结构如下：第二部分介绍背景和动机。第三部分是
devoted to the framework of SiDA. In Section 4, we present our experimental results. Sections 5, 6 and 7 are devoted to related works, discussions, and conclusions, respectively.
致力于SiDA框架。在第4节，我们展示了我们的实验结果。第5、6和7节分别致力于相关工作、讨论和结论。

2 BaCkground and Motivation
背景与动机

We introduce the background and motivation for SiDA in this section. For notation, we use

a, a, a, A, A

to denote a scalar, vector, random vector variable, matrix, and set, respectively. We use

[K]

to denote

{1, 2, \dots, K}

.
在本节中，我们将介绍SiDA的背景和动机。对于符号表示，我们分别使用

a, a, a, A, A

来表示标量、向量、随机向量变量、矩阵和集合。我们使用

[K]

来表示

{1, 2, \dots, K}

。

2.1 Mixture of Experts
2.1 专家混合模型

Since the first proposal of Mixture-of-Experts (MoE) (Jacobs et al., 1991; Jordan & Jacobs, 1994), different MoE models have been proposed based on various experts models, for example, hidden Markov models (Jordan et al., 1996), Gaussian Process (Tresp, 2000), and support vector machine (Collobert et al., 2001). With the rise of deep learning, Eigen et al. propose the use of several sets of routers and experts to build a stacked model, namely Deep MoE (Eigen et al., 2013).
自从混合专家模型（MoE）首次被提出（Jacobs等人，1991年；Jordan & Jacobs，1994年）以来，基于各种专家模型，已经提出了不同的MoE模型，例如，隐马尔可夫模型（Jordan等人，1996年），高斯过程（Tresp，2000年），以及支持向量机（Collobert等人，2001年）。随着深度学习的兴起，Eigen等人提出使用多组路由器和专家来构建一个堆叠模型，即深度MoE（Eigen等人，2013年）。

A MoE layer consists of a router function, denoted as

h (\cdot; W_{r})

, followed by

K

experts in parallel, denoted as

{f_{i} (\cdot; θ_{i})}_{i = 1}^{K}

. Usually, the router function is set as a linear function, i.e.,

h (x; W_{r}) = W_{r}^{⊤} x

where

W_{r} \in R^{d \times K}

for input

x \in R^{d}

, and experts are multi-layer perceptrons (MLPs) with a non-linear activation function (Chen et al., 2022; Fedus et al., 2022; Shazeer et al., 2017). The output of a MoE layer takes the form:
MoE层由一个路由函数

h (\cdot; W_{r})

组成，后面跟着并行的

K

个专家

{f_{i} (\cdot; θ_{i})}_{i = 1}^{K}

。通常，路由函数被设置为线性函数，即

h (x; W_{r}) = W_{r}^{⊤} x

，其中

W_{r} \in R^{d \times K}

为输入

x \in R^{d}

，而专家是带有非线性激活函数的多层感知机（MLPs）（Chen等，2022；Fedus等，2022；Shazeer等，2017）。MoE层的输出形式为：

\begin{matrix} (1) & M (x; W_{r}, θ_{1}, \dots, θ_{K}) = \sum_{i \in I} α_{i} (x) f_{i} (x; θ_{i}), \end{matrix}

where

I

contains the selected indices of experts and the scaling factor

α_{i}

is defined as
其中

I

包含了专家的选定索引，而缩放因子

α_{i}

被定义为

α_{i} (x) = \frac{\exp {W_{r} [:, i]^{⊤} x}}{\sum_{j = 1}^{K} \exp {W_{r} [:, j]^{⊤} x}}

Different selection mechanism of

I

leads to different models. The soft-routing model (Jordan & Jacobs, 1994) selects all experts, i.e.,

I = [K]

, which leads to high computational overheads. The switch-routing model (Fedus et al., 2022) selects the top-1 expert, i.e.,

I = \arg max_{i \in [K]} α_{i} (\cdot)

, introducing little extra computational overhead.
不同的选择机制会导致不同的模型。软路由模型（Jordan & Jacobs, 1994）选择所有专家，即

I = [K]

，这导致了高计算开销。开关路由模型（Fedus等，2022）选择排名第一的专家，即

I = \arg max_{i \in [K]} α_{i} (\cdot)

，引入了很少的额外计算开销。

2.2 Low Effective Utilization of GPU Memory
2.2 GPU内存的低效利用

Encouraged by the advantage of MoE-based large models that drastically increasing the number of parameters leads to little computational overhead, many large-scale architectures have been proposed such as the Sparsely-Gated MoE (Shazeer et al., 2017), Gshard (Lepikhin et al., 2020), and Switch Transformers (Fedus et al., 2022). Specifically,
受到基于MoE的大型模型的优势鼓舞，即大幅增加参数数量导致的计算开销很小，许多大规模架构被提出，例如稀疏门控MoE（Shazeer等，2017年）、Gshard（Lepikhin等，2020年）和Switch Transformers（Fedus等，2022年）。具体来说，

Figure 2. Memory Efficiency of Switch Transformers on SST2. The

x

-axis represents the length of the sentence and the bar records the counts of sentences of corresponding length. The line represents the effective memory utilization for Switch Transformer on SST2 with a varied sentence length. Down to

5 %

utilization can be observed for large models.
图2. 在SST2上Switch Transformers的内存效率。横轴代表句子长度，柱状图记录了相应长度句子的数量。曲线表示了在SST2上，随着句子长度变化，Switch Transformer的有效内存利用率。对于大型模型，可以观察到内存利用率降低至

5 %

。

the Sparsely-Gated MoE proposes a trainable router function to determine the expert to be activated for each sample, which makes it possible to build very large MoE-based models as it improves the computational efficiency by a large margin compared to the soft-routing selecting all experts. The Sparsely-Gated MoE scales LSTM models to 137 billion parameters achieving outstanding performance. Switch Transformers, the most widely used transformer-based large MoE, converts T5 models (Raffel et al., 2020) to their MoE versions. All Switch Transformers outperform their foundation dense model with the same FLOPs.
稀疏门控MoE提出了一种可训练的路由函数，用于确定每个样本要激活的专家，这使得构建非常大的基于MoE的模型成为可能，因为与选择所有专家的软路由相比，它大幅提高了计算效率。稀疏门控MoE将LSTM模型扩展到了1370亿参数，取得了卓越的性能。Switch Transformers是最广泛使用的基于transformer的大型MoE，它将T5模型（Raffel等人，2020）转换为它们的MoE版本。所有Switch Transformers在相同的FLOPs下都超过了它们的基础密集模型。

In our study, we found that large MoE models do not efficiently utilize GPUs. As shown in Eq. 1, we denote an expert as activated if

i \in I

. Inactivated experts remain idle in the forward pass, leading to low effective GPU memory utilization. Effective GPU memory refers to the memory storing parameters that are effective for the forwarding of the model. The inactivated experts occupy a large amount of GPU memory while remaining idle, leading to low effective GPU memory utilization. To quantitatively analyze the GPU memory utilization, we provide a summary of Switch Transformers on model size and MoE layer size in Table 2. It is shown that for all Switch Transformers, especially the large ones, MoE layers occupy a large portion of GPU memory. Meanwhile, most of the parameters of the MoE layers are idle during one forward pass. To ascertain the amount of ineffective GPU memory, we feed samples from the SST2 dataset to Switch Transformers and record the corresponding effective memory utilization rates. The results are depicted in Fig. 2. For large Switch Transformers such as Switch-base-128 and Switch-base-256, the ineffective GPU memory for short sentences is around 24GB and 50GB, respectively. Even for the longest sentences with 80 tokens, the ineffective GPU memory is around 20GB and 46GB, respectively. Our method, SiDA, can save all ineffective GPU memory, outperforming baselines by a large margin. Further results on GPU memory reduction across datasets can be found in Section 4.
在我们的研究中，我们发现大型MoE模型并不能有效利用GPU。如方程1所示，我们将一个专家定义为激活状态，如果

i \in I

。未激活的专家在前向传播中保持空闲，导致有效GPU内存利用率低。有效GPU内存指的是存储对模型前向传播有效的参数的内存。未激活的专家占用了大量GPU内存同时保持空闲，导致有效GPU内存利用率低。为了定量分析GPU内存的利用率，我们在表2中提供了Switch Transformers模型大小和MoE层大小的总结。结果显示，对于所有Switch Transformers，特别是大型的，MoE层占用了大量GPU内存。同时，MoE层的大多数参数在一次前向传播中处于空闲状态。为了确定无效GPU内存的数量，我们向Switch Transformers输入SST2数据集的样本，并记录相应的有效内存利用率。结果如图2所示。对于大型Switch Transformers，如Switch-base-128和Switch-base-256，短句子的无效GPU内存分别约为24GB和50GB。即使对于最长的含有80个词汇的句子，无效GPU内存也分别约为20GB和46GB。我们的方法SiDA可以节省所有无效GPU内存，大幅度超过基准线。关于跨数据集GPU内存减少的进一步结果可以在第4节找到。

Table 2. Memory Occupation of Switch Transformers. This table highlights the allocation of parameters in gigabytes (GB) for different models. MoE parameters dominate memory usage, especially in larger models. In contrast, mainstream GPUs peak at 48GB, with many at 24GB, while mobile GPUs range from 4GB to 12GB.
表2. 交换式变换器的内存占用。此表格突出了不同模型中以吉字节（GB）为单位的参数分配。MoE参数在内存使用中占主导地位，尤其是在较大的模型中。相比之下，主流GPU的峰值为48GB，许多为24GB，而移动GPU的范围则是4GB到12GB。

	Model (GB) 型号（英国）	MoE (GB) 教育部（英国）	Percentage (%) 百分比（%）
Switch-base-8 切换到八进制基数	2.298	1.7932	78.03
Switch-base-64 切换至Base-64	14.112	13.608	96.42
Switch-base-128 切换基数为128	27.614	27.11	98.17
Switch-base-256 切换基数为256	54.62	54.114	99.07

Figure 3. Expert Selection Overhead on SST2. The bar depicts the percentage breakdown for expert selection overhead and total inference latency. Up to

74 %

time on Switch-base-256 are occupied by expert selection. Notably, the occupation of expert selection overhead scales up as model size increases.
图3. SST2上的专家选择开销。该柱状图显示了专家选择开销和总推理延迟的百分比分布。在Switch-base-256上，专家选择占用了高达

74 %

的时间。值得注意的是，随着模型大小的增加，专家选择开销的占用比例也在增加。

2.3 High Expert Selection Overhead
2.3 高专家选择成本

Apart from the low effective GPU memory utilization, we also observed a high overhead on expert selection in the feedforward pass of MoE. Specifically, in all baseline implementations of MoE models, a non-negligible amount of time is consumed in the process of selecting the most suitable experts. We conduct experiments on SST2 with multiple MoE models and provide the profiling results of averaged inference time and expert selection overhead in Fig. 3. It is shown that the expert selection process consumes nearly

75 %

of the total inference time for Switch-base-256, which is a bottleneck of the inference latency. Notably, the overhead associated with expert selection escalates with the scale of the model, further emphasizing the imperative of addressing the bottleneck in inference efficiency.
除了低有效GPU内存利用率外，我们还观察到MoE前馈过程中专家选择的高开销。具体来说，在所有MoE模型的基线实现中，选择最合适的专家的过程消耗了不可忽视的时间。我们在SST2上使用多个MoE模型进行实验，并在图3中提供了平均推理时间和专家选择开销的分析结果。结果显示，对于Switch-base-256，专家选择过程消耗了总推理时间的近一半，这是推理延迟的瓶颈。值得注意的是，随着模型规模的扩大，与专家选择相关的开销增加，进一步强调了解决推理效率瓶颈的重要性。

2.4 Sparse Activation of Experts in Large MoE Models
大型MoE模型中专家的稀疏激活

The sparse selection of experts is one of the critical observations that motivate SiDA. Our observation verifies that only a small portion of experts will be activated during inference.
稀疏专家选择是激励SiDA的关键观察之一。我们的观察验证了在推理过程中只有一小部分专家会被激活。

For each token, the router function will select either top

K

(Shazeer et al., 2017) or top-1 (Fedus et al., 2022) experts inducing a token level expert activation sparsity. However, the sparsity on sentences, typically with 512 or 768 tokens, remains elusive. Not to mention in the training stage, an expert loading balance loss must be applied, which forces the router to assign an almost equal number of tokens to
对于每个标记，路由函数将选择顶尖的

K

（Shazeer等人，2017年）或顶尖的1（Fedus等人，2022年）专家，引发标记级别的专家激活稀疏性。然而，对于通常包含512或768个标记的句子，其稀疏性仍然难以捉摸。更不用说在训练阶段，必须应用一个专家负载平衡损失，这迫使路由器分配几乎相等数量的标记给

Figure 4. Expert Activation in Switch Transformers on SST2. The

x

-axis denotes sentence length, with bars illustrating the counts of given lengths. The line depicts the ration of idle experts. Notably, Switch-base-256 and Switch-base-128 activate less than

20 %

and

40 %

of their experts, respectively.
图4. 在SST2上Switch Transformers的专家激活情况。横轴表示句子长度，条形图显示了给定长度的计数。该线表示空闲专家的比例。值得注意的是，Switch-base-256和Switch-base-128分别激活了不到

20 %

和

40 %

的专家。

each expert. Otherwise, router's outputs will collapse to few experts leading to capacity degradation (Chen et al., 2022).
每个专家。否则，路由器的输出将会倒塌至少数专家，导致容量下降（陈等，2022）。

We test Switch Transformers with different number of experts on the SST2 dataset and report the sentence level sparsity in Fig. 4. Our observation verifies that the sparse activation pattern still exists at the sentence level for large MoE models such as Switch-base-128 and Switch-base-256. As shown in the figure, down to less than

40 %

experts and

20 %

experts are activated for Switch-base-128 and Switchbase-256, respectively. Even for the longest sentences with around 80 tokens, the ratio of idle experts is still higher than

70 %

for Switch-base- 128 and

80 %

for Switch-base-256.
我们在SST2数据集上测试了具有不同专家数量的Switch Transformers，并在图4中报告了句子级别的稀疏性。我们的观察验证了，对于大型MoE模型（如Switch-base-128和Switch-base-256）而言，稀疏激活模式在句子级别仍然存在。如图所示，对于Switch-base-128和Switch-base-256，激活的专家数量分别减少到少于

40 %

个和

20 %

个。即使对于大约有80个词符的最长句子，空闲专家的比例仍然高于Switch-base-128的

70 %

和Switch-base-256的

80 %

。

3 SIDA 3 艾滋病

3.1 Overview: workflow 3.1 概览：工作流程

We introduce a novel framework, Sparsity-inspired DataAware (SiDA), for efficient inference of large MoE models, whose overview is shown in Fig. 5. SiDA contains two parallel threads that run simultaneously, namely the Inference thread and the Hash-building thread. Consider a sequence of incoming batches, batch

X_{j}

is fed to the hash-building thread to build the hash table

H_{j}

storing expert activation patterns for batch

X_{j}

, which will be pushed to the hash table queue. At the same time, the inference thread is handling the precedent batch

X_{i}

and operating dynamical offloading on MoE layers based on the hash table

H_{i}

.
我们提出了一个新颖的框架，即受稀疏性启发的数据感知（SiDA），用于高效推理大型MoE模型，其概览如图5所示。SiDA包含两个并行线程，它们同时运行，分别是推理线程和哈希构建线程。考虑一个序列的传入批次，批次

X_{j}

被送入哈希构建线程以构建哈希表

H_{j}

，用于存储批次

X_{j}

的专家激活模式，该模式将被推送到哈希表队列中。同时，推理线程正在处理前一个批次

X_{i}

，并根据哈希表

H_{i}

对MoE层进行动态卸载。

Hash-building thread. The Hash-building thread consists of two components, a hash function and a hash table queue. For each incoming batch (1)-a), the hash function will determine experts to be activated for each token at each layer and the corresponding scaling factor

α

(1)-b). The predictions are stored in the hash table

H_{j}

for the batch

X_{j}

and pushed to the hash table queue (1)-c). The hash function can be a predefined hash function if the MoE model is trained with the Hash layer (Roller et al., 2021). More commonly, for the MoE model using trained router functions,
构建哈希线程。哈希构建线程由两部分组成，一个哈希函数和一个哈希表队列。对于每个传入的批次（1）-a），哈希函数将确定在每一层为每个令牌激活的专家及其相应的缩放因子

α

（1）-b）。预测结果存储在哈希表

H_{j}

中，针对批次

X_{j}

并被推送到哈希表队列（1）-c）。如果MoE模型是与哈希层一起训练的，哈希函数可以是预定义的哈希函数（Roller等人，2021）。更常见的是，对于使用训练过的路由函数的MoE模型，

Figure 5. Overview of SiDA. SiDA contains two threads, the inference and hash-building thread, that run concurrently. As each batch

X_{j}

arrives, the hash-building thread constructs the expert hash table

H_{j}

and queues it. In tandem, the inference thread processes the preceding batch

X_{i}

, dynamically managing experts in MoE layers based on the hash table

H_{i}

.
图5. SiDA概览。SiDA包含两个并行运行的线程：推理线程和哈希构建线程。随着每个批次

X_{j}

的到来，哈希构建线程构建专家哈希表

H_{j}

并将其排队。同时，推理线程处理前一个批次

X_{i}

，基于哈希表

H_{i}

动态管理MoE层中的专家。

such as Switch Transformers, the hash function will be offline trained. We propose hash function training techniques dedicated to modern MoE models, which will be introduced in later sections.
例如在Switch Transformers中，哈希函数将会进行离线训练。我们提出了专门针对现代MoE模型的哈希函数训练技术，这些技术将在后续章节中介绍。

Inference thread. The inference thread performs two tasks, i.e., dynamically load activated experts and offload inactivated experts according to the hash table built by the hashbuilding thread, and use the SiDA MoE layers to inference input batches. Specifically, for each incoming batch

X_{i}

(2)-a), the inference thread will first pop the hash table

H_{i}

from the hash table queue (2)-b) and remain idle if

H_{i}

is not found. Notably, in practice, the inference thread takes a longer time to inference a batch than the hash-building thread to build a hash table for a batch. As a result, the inference thread never idles except at the very beginning. With the popped hash table

H_{i}

, the next step is to dynamically load and offload experts. Based on GPU memory budgets and the expert activation pattern of the current batch, the inference thread will load activated experts to GPU and offload inactivated experts to RAM (2)-c). A first-in-first-out (FIFO) scheme is applied on experts if no memory budgets remain. The dynamical loading task of a MoE layer will be done right after the finish of inference on the previous batch following the pipeline parallelism mechanism (Huang et al., 2019). Note that, in our system, all routers are offloaded to the main memory and do not participate in the forward pass. Lastly, the incoming batch

X_{i}

will be forwarded using the SiDA MoE layers specific to

X_{i} (2)

-d).
推理线程。推理线程执行两项任务，即根据哈希构建线程建立的哈希表动态加载激活的专家并卸载未激活的专家，并使用SiDA MoE层对输入批次进行推理。具体来说，对于每个传入的批次

X_{i}

（2）-a），推理线程首先会从哈希表队列中弹出哈希表

H_{i}

（2）-b），如果找不到

H_{i}

则保持空闲。值得注意的是，在实践中，推理线程对一个批次进行推理的时间比哈希构建线程为一个批次构建哈希表的时间要长。因此，除了最开始之外，推理线程从不空闲。有了弹出的哈希表

H_{i}

，下一步是动态加载和卸载专家。根据GPU内存预算和当前批次的专家激活模式，推理线程将加载激活的专家到GPU并将未激活的专家卸载到RAM（2）-c）。如果没有剩余的内存预算，将对专家应用先进先出（FIFO）方案。在上一个批次的推理完成后，将立即完成MoE层的动态加载任务，遵循管道并行机制（Huang等，2019）。请注意，在我们的系统中，所有路由器都被卸载到主内存中，不参与前向传递。最后，传入的批次

X_{i}

将使用针对

X_{i} (2)

-d）的SiDA MoE层进行转发。

3.2 Design challenges 3.2 设计挑战

In the design of SiDA, we spot three key challenges.
在SiDA的设计中，我们发现了三个关键挑战。

Challenge 1: How to efficiently obtain experts that are to be offloaded beforehand? Given the observation that experts are activated sparsely, it is trivial to save GPU memory by offloading inactivated experts to RAM. However, this naive implementation sacrifices the latency since ex- pert activation patterns are inaccessible without the output of the router functions. It incurs large overheads to move experts between CPU and GPU after each router function as it breaks the forwarding pipeline. We propose to use an offline-trained hash function to acquire the expert activation pattern before inference starts for each batch. Furthermore, we design the hash function to run independently of model inference and build a hash-building thread running in parallel with the inference thread to achieve the efficiency requirements. By employing the hash-building thread, SiDA achieves outstanding latency compared to baselines since the expert selection, dynamical offloading, and inference all run in parallel.
挑战1：如何高效地提前获取将要卸载的专家？鉴于观察到专家被稀疏激活的现象，通过将未激活的专家卸载到RAM中可以节省GPU内存，这一点是显而易见的。然而，这种天真的实现牺牲了延迟，因为在没有路由函数输出的情况下，无法访问专家激活模式。每次路由函数后在CPU和GPU之间移动专家会造成大量开销，因为它打破了前向传播管道。我们提出使用一个离线训练的哈希函数，在每个批次的推理开始前获取专家激活模式。此外，我们设计哈希函数独立于模型推理运行，并构建一个与推理线程并行运行的哈希构建线程，以满足效率要求。通过使用哈希构建线程，SiDA与基准相比实现了卓越的延迟，因为专家选择、动态卸载和推理都并行运行。

Challenge 2: How to leverage sparse cross-embedding
挑战 2：如何利用稀疏交叉嵌入

dependency on experts activation to design a lightweight offline trained hash function? Considering the inference efficiency and the GPU memory consumption of the system, the hash function must be a lightweight predictor. However, simple predictors can hardly capture the contextual information of the sequence and can be easily distracted. Hence, it becomes crucial to enforce the predictor to focus on critical information. We empirically verify that there exists a sparse cross-embedding dependency on expert activation, i.e., a limited number of embeddings in the sequence jointly affect expert activation. This sparse cross-embedding dependency sheds light on the success of lightweight predictors. However, it is impractical and inefficient to rule out all possible outcomes to find the cross-embedding dependency for every token. In response to the challenge, we propose a sparse attention mechanism on LSTM that enforces the predictor to focus on the most important embedding automatically.
依赖专家激活来设计轻量级离线训练的哈希函数？考虑到系统的推理效率和GPU内存消耗，哈希函数必须是一个轻量级的预测器。然而，简单的预测器很难捕捉序列的上下文信息，并且很容易被分散注意力。因此，强制预测器专注于关键信息变得至关重要。我们通过实证验证发现，存在一种稀疏的交叉嵌入依赖于专家激活，即序列中有限数量的嵌入共同影响专家激活。这种稀疏的交叉嵌入依赖为轻量级预测器的成功提供了线索。然而，排除所有可能的结果以找到每个令牌的交叉嵌入依赖是不切实际且低效的。为了应对这一挑战，我们提出了一种基于LSTM的稀疏注意力机制，自动强制预测器专注于最重要的嵌入。

Challenge 3: How to improve the expert selection accuracy and approximate the scaling factor simultaneously? The hash function needs to determine not only the expert activation but also the scaling factor

α

in Eq. 1. As the scaling factor is derived from the SoftMax logits output from the
挑战 3：如何同时提高专家选择的准确性和近似缩放因子？哈希函数不仅需要确定专家激活，还需要确定方程式 1 中的缩放因子

α

。由于缩放因子是从 SoftMax logits 输出中得出的，
model, it is natural to apply knowledge distillation (KD), setting the router functions as teacher models and the hash function as the student model. However, it is impossible for the hash function to approximate the scaling factor distribution over all experts by KD due to the limited capacity of the hash function. To solve this challenge, we propose to use a truncated knowledge distillation (TKD), where the KD loss is computed over the top-

T

experts. However, the TKD cannot guarantee adequate prediction accuracy. We further add a cross-entropy loss to boost the prediction accuracy.
在该模型中，自然而然地应用知识蒸馏（KD），将路由函数设为教师模型，哈希函数设为学生模型。然而，由于哈希函数的容量有限，它不可能通过KD来近似所有专家的缩放因子分布。为了解决这一挑战，我们提出使用截断知识蒸馏（TKD），其中KD损失是在前

T

个专家上计算的。然而，TKD不能保证足够的预测准确性。我们进一步添加了交叉熵损失以提高预测准确性。

We introduce how SiDA deals with each challenge in detail in the following sections.
在接下来的章节中，我们将详细介绍SiDA如何应对每一个挑战。

3.3 Data-Aware and Efficient Expert Activation Prediction
3.3 数据感知与高效的专家激活预测

SiDA proposes a data-aware solution to efficiently obtain the experts to be offloaded beforehand. Specifically, we propose to use a trained hash function that takes the sequence of embedding as input and predicts all the activated experts for each token in the sequence. SiDA, augmented by the data-aware expert activation prediction, enjoys two advantages while compromising little loss of model performance down to less than

1 %

. Firstly, the system can acquire the activation pattern of each sample beforehand and operate dynamically loading and offloading according to the GPU memory budget without interrupting the inference process. Secondly, since the hash function determines the expert activation across all the MoE layers for a sample independently of the inference, the system can build the hash function in a hash-building thread running in parallel with the inference thread. By doing this, we can remove the overhead caused by expert selection from the inference time, which boosts the throughput up to

3.93 \times

.
SiDA 提出了一种数据感知的解决方案，以高效地预先获取需要卸载的专家。具体来说，我们提出使用一个训练好的哈希函数，该函数以嵌入序列为输入，并预测序列中每个令牌的所有激活专家。通过数据感知的专家激活预测增强的 SiDA，在几乎不损失模型性能的情况下（降低到小于 0），享有两大优势。首先，系统可以预先获取每个样本的激活模式，并根据 GPU 内存预算动态地加载和卸载，而不中断推理过程。其次，由于哈希函数独立于推理过程确定样本在所有 MoE 层中的专家激活，系统可以在与推理线程并行运行的哈希构建线程中构建哈希函数。通过这样做，我们可以去除推理时间中由专家选择引起的开销，从而将吞吐量提高到 <1>。

Previous works have also been proposed to improve the router function of MoE, such as the Hash layer (Roller et al., 2021) and the Base layer (Lewis et al., 2021). SiDA is orthogonal to these router functions as they can be accommodated in the hash-building thread. For MoE models with trained routers, we propose to train an LSTM as the hash function with the sparse attention boosted with our truncated knowledge distillation, detailed in the following sections.
之前的研究也提出了改进MoE路由器功能的方法，例如哈希层（Roller等，2021年）和基础层（Lewis等，2021年）。SiDA与这些路由器功能是正交的，因为它们可以被容纳在哈希构建线程中。对于具有训练过的路由器的MoE模型，我们提议使用LSTM作为哈希函数，并通过我们的截断知识蒸馏增强稀疏注意力，详细内容将在以下部分中说明。

3.4 LSTM with Sparse Attention
3.4 带有稀疏注意力的LSTM

3.4.1 Sparse cross-embedding dependency on expert activation
3.4.1 专家激活的稀疏交叉嵌入依赖

In the MoE layer, each word embedding will be fed to the router function to decide which expert to activate for inference of the token. However, the expert activation does not solely depend on the embedding corresponding to the token due to the self-attention layer before each MoE layer (shown in Fig. 1), where the word embedding is mixed to-
在MoE层中，每个词嵌入都将被送入路由函数以决定激活哪个专家进行令牌的推断。然而，专家的激活并不完全依赖于与令牌相对应的嵌入，因为在每个MoE层之前的自注意力层（如图1所示），词嵌入会被混合。

Figure 6. Visualization of Eq. 2 over Different

p

and

c

.
图 6. 在不同的

p

和

c

下方程 2 的可视化。

(a) Tokens dependency. 令牌依赖。

(b) Positions dependency.
职位依赖性。
Figure 7. Cross-embedding Dependency for Expert Activation on Switch-base- 128 on C4. The

x

-axis shows the proportion of corruption, while the

y

-axis represents the empirical probability of expert activation change. Over 100 random embedding positions are examined, with the average trend displayed.
图7. 在C4上基于开关的128专家激活的交叉嵌入依赖性。X轴显示了损坏的比例，而Y轴代表了专家激活变化的经验概率。检查了100多个随机嵌入位置，显示了平均趋势。

gether. Because of the positional embedding, the position of tokens will also affect the expert activation. While the process by which embeddings collectively influence expert activation is complex, we identify a sparse cross-embedding dependency on expert activation, indicating that only a limited number of other tokens and positions are critical to the expert activation for the current token.
由于位置嵌入的存在，标记的位置也会影响专家激活。虽然嵌入共同影响专家激活的过程是复杂的，我们发现了一个稀疏的跨嵌入对专家激活的依赖，表明只有有限数量的其他标记和位置对当前标记的专家激活至关重要。

Suppose a sequence of length

L

, and let

c_{i}

denote the number of critical tokens for the token at position

i

. We define the critical tokens as tokens in the sequence other than the selected

i

-th token, whose changes lead to a change in expert activation of the

i

-th token. In order to empirically verify that

c_{i}

is a small number for all

i

, we consider finding a combinatorial equation involving

c_{i}

and quantities we can measure. Consider selecting a set of tokens from the sequence excluding the

i

-th token, the probability that the set contains a critical token is formulated as below:
假设一个长度为

L

的序列，并且让

c_{i}

表示位于

i

位置的标记的关键标记数量。我们将关键标记定义为序列中除了选定的第

i

个标记之外的标记，其变化会导致第

i

个标记的专家激活发生变化。为了实证验证

c_{i}

对所有

i

来说是一个小数，我们考虑找到一个涉及

c_{i}

和我们可以测量的量的组合方程。考虑从序列中选择一组标记，排除第

i

个标记，该组包含关键标记的概率如下所述：

\begin{matrix} (2) & E [{\hat{p}}_{i}] = 1 - \frac{(\begin{matrix} L - 1 - c_{i} \\ ⌊ p L ⌋ \end{matrix})}{(\begin{matrix} L - 1 \\ ⌊ p L ⌋ \end{matrix})} \end{matrix}

where

⌊ p L ⌋

denotes the size of the set and

p

denotes the portion of selection over the sequence. Note that the probability that the selected set of tokens contains a critical token is equal to the probability that the

i

-th token's expert activation changes, denoted as

{\hat{p}}_{i}

, if we change all selected
其中

⌊ p L ⌋

表示集合的大小，

p

表示在序列上选择的部分。注意，选定的令牌集包含关键令牌的概率等于第

i

个令牌的专家激活发生变化的概率，如果我们更改所有选定的，这一概率表示为

{\hat{p}}_{i}

。
tokens in the set. We denote the process of changing the tokens in a sequence as 'corruption.' Given Eq. 2,

p

and

\hat{p}

are quantities that we can empirically acquire, that is, by randomly selecting a portion

p

of tokens, we can empirically measure the probability that the

i

-th token's expert activation changes. We show in Fig. 6 the relation between

c

and

\hat{p}

under different

p

.
集合中的标记。我们将序列中改变标记的过程称为“损坏”。根据方程2，

p

和

\hat{p}

是我们可以通过经验获得的量，即通过随机选择一部分

p

的标记，我们可以经验性地测量第

i

个标记的专家激活改变的概率。我们在图6中展示了在不同

p

下

c

与

\hat{p}

之间的关系。

Empirically, to study the token dependency of the token at position

i

, the corruption is executed by randomly modifying a fraction

p

of chosen tokens from

[L] - {i}

to values distinct from their original and the

i

-th token. To examine the position dependency for the

i

-th token, the corruption also involves randomly choosing a fraction

p

of positions from

[L] - {i}

and swapping the token positions. We use the English division in the dataset C4 (Raffel et al., 2020) to measure the probability that the

i

-th token's expert activation changes under different

p

, depicted in Fig. 7. We set the length

L = 512

and truncate or pad sentences which are not of length 512. We randomly test over 100 word embedding positions (i.e.,

100 i

's) on Switch-base-128 and plot all of them in Fig. 7 with the average trend shown. Fig. 7a and Fig. 7b show the cross-embedding dependency of the token and position, respectively. Only a large portion of corruption leads to high chances of expert activation change, which demonstrates that most of the other tokens do not have an impact on the expert activation of the current token.
从经验上来看，为了研究位于位置

i

的令牌的依赖性，通过随机修改选定令牌中的一部分

p

，将其值改为与原始值及第

i

个令牌的值不同的值来执行破坏操作。为了检查第

i

个令牌的位置依赖性，破坏操作还包括从

[L] - {i}

中随机选择一部分位置并交换令牌位置。我们使用数据集C4（Raffel等人，2020）中的英文部分来测量在不同

p

下，第

i

个令牌的专家激活改变的概率，如图7所示。我们设置长度

L = 512

，并截断或填充不是512长度的句子。我们在Switch-base-128上随机测试了100个词嵌入位置（即

100 i

的位置），并将它们全部绘制在图7中，显示平均趋势。图7a和图7b分别展示了令牌和位置的跨嵌入依赖性。只有大量的破坏才会导致专家激活改变的高概率，这表明大多数其他令牌对当前令牌的专家激活没有影响。

By combining Fig. 6 and Fig. 7, we can read the best approximation of

c_{i}

based on different pairs of

(p, \hat{p})

in Fig. 7, where we find that the best approximation of

\hat{c}

ranges from 1 to 4 demonstrating the sparse cross-embedding dependency.
通过结合图6和图7，我们可以根据图7中不同的

(p, \hat{p})

对读出

c_{i}

的最佳近似值，在此我们发现

\hat{c}

的最佳近似值范围从1到4，展示了稀疏的交叉嵌入依赖性。

3.4.2 Design of the hash function
3.4.2 哈希函数的设计

The design of the hash function must satisfy the following conditions: (1) be able to capture the sequential information, (2) be lightweight to preserve efficiency, and (3) be able to extract and focus on the critical embedding automatically. We adopt a 2-layer LSTM followed by a fully connected layer to align the first two conditions. Further, we add one fully connected layer to compress the embedding dimension. To achieve the third condition, we adopt the sparse attention mechanism with the SparseMax activation (Martins & Astudillo, 2016).
哈希函数的设计必须满足以下条件：（1）能够捕获序列信息，（2）要轻量以保持效率，（3）能够自动提取并关注关键嵌入。我们采用了一个2层的LSTM，后接一个全连接层来满足前两个条件。进一步地，我们增加了一个全连接层来压缩嵌入维度。为了实现第三个条件，我们采用了带有SparseMax激活函数的稀疏注意力机制（Martins & Astudillo, 2016）。

Attention mechanism. The attention mechanism was first proposed in (Bahdanau et al., 2015), which has been proven to be influential in the realm of deep learning. The attention mechanism was proposed to allow the decoder to focus on different parts, resolving the problem that the encoder encodes the entire sentence. Given a query

q

and a set of key-value pairs

(k, v)

, the attention mechanism computes a weighted sum of values based on the similarity of the query to the keys. Formally, the attention weights

w

and the output

o

are computed as

o = \sum_{i} w_{i} v_{i}

with
注意力机制。注意力机制最初由（Bahdanau等人，2015年）提出，已被证明在深度学习领域具有重要影响。注意力机制的提出是为了让解码器能够关注不同的部分，解决了编码器将整个句子编码的问题。给定一个查询

q

和一组键值对

(k, v)

，注意力机制根据查询与键的相似度计算值的加权和。形式上，注意力权重

w

和输出

o

的计算如

o = \sum_{i} w_{i} v_{i}

所示。

w_{i} = \frac{\exp (score (q, k_{i}))}{\sum_{j} \exp (score (q, k_{j}))}

where score

(q, k)

is a function that calculates the similarity between the query and a key. One common choice for score is the dot product of the query and key.
其中 score

(q, k)

是一个计算查询和键之间相似度的函数。一个常见的选择是查询和键的点积。

We append one attention layer right after the LSTM layer where the key, value, and query are all set as the output sequence from LSTM. Consequently, each embedding will be a weighted sum of the sequence with weights proportional to the similarity between two vectors. The attention mechanism allows the predictor to pay different attention to different embeddings. However, the naive attention mechanism cannot impose a sparse focus. We further apply the SparseMax activation over

w

.
我们在LSTM层之后紧接着添加了一个注意力层，其中键、值和查询都被设置为LSTM的输出序列。因此，每个嵌入都将是序列的加权和，权重与两个向量之间的相似度成正比。注意力机制允许预测器对不同的嵌入给予不同的关注。然而，朴素的注意力机制不能施加稀疏聚焦。我们进一步在此基础上应用了SparseMax激活函数。

SparseMax activation. In contrast to the SoftMax activation, which provides a dense distribution, that is, non-zero probabilities assigned to all classes or positions, the SparseMax provides a sparse distribution, where zero probability is assigned to many positions. We apply the SparseMax activation over the attention weights

w

to obtain a sparse attention mechanism. Given an input vector

w \in R^{L}

, the SparseMax transformation is defined as:
SparseMax激活。与提供密集分布的SoftMax激活不同，即为所有类别或位置分配非零概率，SparseMax提供一个稀疏分布，其中许多位置被分配了零概率。我们在注意力权重上应用SparseMax激活，以获得一个稀疏的注意力机制。给定一个输入向量，SparseMax转换定义为：

SparseMax (w) = {argmin}_{u \in Δ^{L - 1}} ∥ u - w ∥_{2}^{2}

where

Δ^{L - 1}

denotes the

(L - 1)

-dimensional simplex, i.e.,
其中

Δ^{L - 1}

表示

(L - 1)

维单纯形，即，

Δ^{L - 1} = {u \in R^{L} ∣ u \geq 0, \sum_{i = 1}^{L} u_{i} = 1}

Although the expert selection is affected by other tokens in the sequence, the current token is always the most crucial on expert selection. Hence, we adopt the residual connection (He et al., 2016) to boost the performance right before the final fully connected layer.
尽管专家选择受序列中其他标记的影响，但当前标记在专家选择上始终是最关键的。因此，我们采用了残差连接（He等人，2016年）来在最后的全连接层之前提升性能。

3.5 Truncated Knowledge Distillation
3.5 截断式知识蒸馏

The hash function of SiDA is required to predict the expert to be activated and the corresponding scaling factor

α

. Knowledge distillation (KD) (Hinton et al., 2015), which aims to minimize the distance of logits between the teacher and student model, should be the best training strategy for our hash function. However, the capacity of our hash function, 2-layer LSTM, is far less capable than the MoE model. The predictor cannot fully capture the behavior of logits of the router functions in the MoE model. The naive usage of

K D

greatly harms the performance of the system.
SiDA的哈希函数需要预测要激活的专家和相应的缩放因子

α

。知识蒸馏（KD）（Hinton等人，2015年），旨在最小化教师模型和学生模型之间的logits距离，应该是我们哈希函数的最佳训练策略。然而，我们的哈希函数，2层LSTM的容量远远小于MoE模型。预测器无法完全捕捉MoE模型中路由函数的logits行为。

K D

的简单使用极大地损害了系统的性能。

We propose Truncated KD (TKD) to tackle the challenge. Different from the traditional KD, the truncated KD only considers positions with top-

T

SoftMax logit, which helps
我们提出了截断式知识蒸馏（TKD）来应对这一挑战。与传统的知识蒸馏不同，截断式知识蒸馏仅考虑SoftMax逻辑值排名前

T

的位置，这有助于
the hash function focus more on predicting the scaling factor for experts with a higher chance of being activated. Notably, large

T

can provide a smooth ground truth for the hash function, while small

T

enforces the hash function to be more focused on fewer experts. Further, we add the cross entropy loss to ensure the prediction accuracy. The training objective is

λ L_{C E} + L_{TKD} (T)

.
哈希函数更注重预测有更高激活几率的专家的缩放因子。值得注意的是，较大的

T

可以为哈希函数提供一个平滑的真实基准，而较小的

T

则迫使哈希函数更专注于较少的专家。此外，我们添加了交叉熵损失以确保预测准确性。训练目标是

λ L_{C E} + L_{TKD} (T)

。

4 EXPERIMENT 四、实验

We extensively evaluate SiDA on different datasets. Specifically, we first show the GPU memory reduction ratio of SiDA demonstrating a memory saving up to

80 %

. We then report the throughput and latency of SiDA and baselines, where SiDA achieves up to

3.93 \times

improvements in terms of throughput with little performance degradation down to less than

1 %

. Our hash function achieves a prediction accuracy of up to

99 %

. Also, SiDA achieves the best efficiency under different GPU memory budgets.
我们对不同的数据集进行了广泛的SiDA评估。具体来说，我们首先展示了SiDA的GPU内存减少比例，显示出最多可节省

80 %

的内存。然后，我们报告了SiDA及基准的吞吐量和延迟，其中SiDA在吞吐量方面的提升高达

3.93 \times

，性能下降幅度小于

1 %

。我们的哈希函数实现了最高

99 %

的预测准确率。此外，在不同的GPU内存预算下，SiDA实现了最佳效率。

Implementation. We implement the proposed SiDA framework atop the readily available Switch Transformer implementation in transformer (Wolf et al., 2019), albeit not without substantial additional engineering effort. Enabling performant slice extraction poses challenges, as the MoE must maintain fine-grained associations between experts and hash table slices across layers and iterations. We optimize the parallel invocation of experts through meticulous inter-thread coordination, as naive parallelism introduces serious race conditions. The SiDA manager tackles intricate scheduling across the main training thread and the concurrent prediction thread, synchronizing via a shared queue that demands careful contention management. The main thread must then judiciously merge predictor outputs with the model state to orchestrate expert device placement, avoiding costly overheads like GPU-CPU data transfers.
实现。我们在现成的Switch Transformer实现的基础上（Wolf等人，2019年），实现了提出的SiDA框架，尽管这需要大量额外的工程努力。启用高性能的切片提取带来挑战，因为MoE必须在层与迭代之间保持专家与哈希表切片之间的细粒度关联。我们通过细致的线程间协调，优化了专家的并行调用，因为天真的并行会引入严重的竞态条件。SiDA管理器处理主训练线程和并发预测线程之间复杂的调度，通过一个共享队列同步，这要求仔细管理竞争。主线程必须审慎地合并预测器输出与模型状态，以安排专家设备的放置，避免像GPU-CPU数据传输这样的昂贵开销。

Setup. We select three baselines namely, Standard, Deepspeed, and Tutel. The Standard baseline refers to the standard inference of the model. The Deepspeed refers to the Deepspeed implementation (Aminabadi et al., 2022) of the model, and the Tutel (Hwang et al., 2023) is designed for MoE models by enabling adaptive parallelism. We select three datasets from GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019). Specifically, we select SST2 and MRPC from GLUE for short sentences and mid-length sentences, and MultiRC from SuperGLUE for long sentences. We test most of the experiments on a server with an A-100 80GB GPU and

64 I n t e l (R) X e o n (R) P l a t i n u m

8358 CPU @ 2.60GHz CPUs. We investigate Switch-base8, Switch-base-64, Swicth-base-128, and Switch-base-256 on efficiency, where the number indicates the number of experts in each MoE layer in the Switch Transformer. And we select Switch-base-8 and Switch-base-128 to fine-tune on selected datasets as the representatives on accuracy analysis, considering the representativeness and limited resources. Our hash function in the hash building thread is trained on the train set of the dataset with the true hash table and evaluated on the test set of the dataset.
设置。我们选择了三个基准，分别是标准、Deepspeed和Tutel。标准基准指的是模型的标准推理。Deepspeed指的是模型的Deepspeed实现（Aminabadi等人，2022年），而Tutel（Hwang等人，2023年）旨在通过启用自适应并行性为MoE模型设计。我们从GLUE（Wang等人，2018年）和SuperGLUE（Wang等人，2019年）中选择了三个数据集。具体来说，我们从GLUE中选择了SST2和MRPC，用于短句和中等长度的句子，以及从SuperGLUE中选择了MultiRC，用于长句子。我们在一台配备A-100 80GB GPU和

64 I n t e l (R) X e o n (R) P l a t i n u m

8358 CPU @ 2.60GHz CPU的服务器上测试了大部分实验。我们调查了Switch-base8、Switch-base-64、Switch-base-128和Switch-base-256在效率上的表现，其中数字表示Switch Transformer中每个MoE层的专家数量。考虑到代表性和资源有限，我们选择Switch-base-8和Switch-base-128在选定的数据集上进行微调，作为准确性分析的代表。我们的哈希函数在哈希构建线程中接受训练，训练集是数据集的训练集，使用真实的哈希表，并在数据集的测试集上进行评估。

Evaluation metrics. We follow standard evaluation metrics for SST2, MRPC and MultiRC (Raffel et al., 2020), i.e., classification accuracy for SST2, F1 score for MRPC and MultiRC. Further, we evaluate the fidelity of SiDA, which refers to how much performance can be preserved compared to baselines. We refer the hash hits rate as the prediction accuracy on the expert activation of our hash function.
评估指标。我们遵循SST2、MRPC和MultiRC的标准评估指标（Raffel等，2020），即SST2的分类准确率，MRPC和MultiRC的F1分数。此外，我们评估了SiDA的忠实度，即与基线相比，可以保留多少性能。我们将哈希命中率定义为对我们哈希函数的专家激活的预测准确性。

Hyperparameters We use AdamW (Loshchilov & Hutter, 2019) optimizer for fine-tuning the Switch Transformers and training the hash function. We set the batch size as 1 when measuring the latency and memory usage to eliminate the disturbance of the batch size. We select

T = 30

in the truncated

K D

with learning rate

5 e - 5

, batch size 64 ,

λ = 0.005

, and train to converge. For fine-tuning Switch Transformers, we set learning as

5 e - 5

and fine-tune with 16000 max steps. We select top-1 experts from the hash function for SST2 and top-3 experts for MRPC and MultiRC when evaluating SiDA.
我们使用AdamW（Loshchilov & Hutter, 2019）优化器对Switch Transformers进行微调并训练哈希函数。在测量延迟和内存使用时，我们将批处理大小设置为1，以消除批处理大小的干扰。我们在截断的

K D

中选择

T = 30

，学习率为

5 e - 5

，批处理大小为64，

λ = 0.005

，并训练至收敛。对于Switch Transformers的微调，我们将学习率设置为

5 e - 5

，并以16000的最大步数进行微调。在评估SiDA时，我们从哈希函数中为SST2选择前1名专家，为MRPC和MultiRC选择前3名专家。

4.1 GPU Memory Saving
4.1 节省GPU内存

We report the GPU memory saving in Fig. 8. For short sentences in SST2, SiDA can achieve over

80 %

GPU memory reduction. For samples in MRPC whose lengths are clustered between 50 and 80, the GPU memory reduction remains substantial, yielding savings of

6.28 G B

and

19.84 G B

GPU memory for Switch-base-128 and Switch-base-256, respectively. Furthermore, even when processing long paragraphs in MultiRC with lengths ranging from 200 to 500, the rate of GPU memory reduction retains over

40 %

and

20 %

, leading to a save of

4.52 G B

for Switch-base- 128 and 9.92GB for Switch-base-256.
我们在图8中报告了GPU内存节省情况。对于SST2中的短句子，SiDA可以实现超过

80 %

的GPU内存减少。对于MRPC中长度集中在50到80之间的样本，GPU内存减少仍然很大，为Switch-base-128和Switch-base-256分别节省了

6.28 G B

和

19.84 G B

的GPU内存。此外，即使在处理MultiRC中长度范围从200到500的长段落时，GPU内存减少率仍保持在

40 %

和

20 %

以上，为Switch-base-128节省了

4.52 G B

，为Switch-base-256节省了9.92GB。

4.2 Latency and Throughput
4.2 延迟与吞吐量

Apart from the GPU memory saving, SiDA also achieves overwhelming efficiency in terms of throughput and latency (see Fig. 9). Specifically, SiDA exceeds the average of baselines by

2.60 \times

and

3.93 \times

on throughput for large

M o E

models such as Swicth-base-128 and Switch-base-256 on SST2. Even for MultiRC containing long sentences, SiDA exceeds the average throughput of baselines by

1.26 \times

on Switch-base-128 and 1.57× on Switch-base-256.
除了节省GPU内存外，SiDA在吞吐量和延迟方面也实现了压倒性的效率（见图9）。具体来说，对于SST2上的大型模型如Swicth-base-128和Switch-base-256，SiDA的吞吐量超过基准平均值

2.60 \times

和

3.93 \times

。即使对于包含长句子的MultiRC，SiDA在Switch-base-128上的吞吐量也超过基准平均值

1.26 \times

，在Switch-base-256上超过1.57倍。

We also investigate the inference latency of SiDA and baselines (see Fig. 10). For large MOE models such as Switchbase-128 and Switch-base-256, SiDA reduces the inference latency to

25 %

on SST2 and MRPC and to

60 %

on MultiRC. The improvements come from our design of the hashbuilding thread that resolves the expert selection overhead.
我们还研究了SiDA及基准模型的推理延迟（见图10）。对于大型MOE模型，如Switchbase-128和Switch-base-256，SiDA将SST2和MRPC的推理延迟降低到

25 %

，将MultiRC的推理延迟降低到

60 %

。这些改进来自于我们设计的哈希构建线程，它解决了专家选择的开销问题。

Figure 8. GPU Memory Reduction Rate by SiDA for Switch Transformers Across Datasets. SiDA achieves over

60 %

and

80 %

reduction on SST2 and MRPC for Switch-base-128 and Switch-base-256, respectively. And in MultiRC, with sentence lengths of 200-500, memory reductions of over

40 %

for Switch-base-256 and

20 %

for Switch-base-128 are noted.
图8. SiDA在不同数据集上对Switch Transformers的GPU内存减少率。SiDA在SST2和MRPC上分别为Switch-base-128和Switch-base-256实现了超过

60 %

和

80 %

的减少。在MultiRC中，句子长度为200-500时，Switch-base-256和Switch-base-128的内存减少率分别超过

40 %

和

20 %

。

Figure 9. Throughput of Different Methods for Switch Transformers Across Datasets. SiDA achieves outstanding throughput for large MoE models on all three datasets with various sentence length and comparable results for small MoE models. Specifically, SiDA achieves

2.60 \times, 3.93 \times

more throughput on SST2,

2.52 \times, 3.83 \times

4.3 Efficiency under Limited GPU Memory Budgets
在有限GPU内存预算下的效率

We investigate the efficiency under different GPU memory budgets with different offloading methods on Switch-base128 and Switch-base-256 since large MoE models are more resource-sensitive. Under a limited GPU memory budget, SiDA will offload and cache inactivated experts in a firstin-first-out manner, while all other baselines implement the model parallelism, where only layers required for inference will be kept on the GPU. The results of throughput versus GPU memory budgets are shown in Fig. 11. SiDA achieves better throughput under all GPU memory budgets across all datasets, demonstrating that SiDA employs a better offloading strategy under limited GPU memory budgets.
我们研究了在Switch-base128和Switch-base-256上，使用不同的卸载方法在不同GPU内存预算下的效率，因为大型MoE模型对资源更加敏感。在有限的GPU内存预算下，SiDA会以先进先出的方式卸载和缓存未激活的专家，而所有其他基准则实现了模型并行，其中只有推理所需的层会保留在GPU上。吞吐量与GPU内存预算的结果显示在图11中。SiDA在所有数据集上，在所有GPU内存预算下都实现了更好的吞吐量，这表明SiDA在有限的GPU内存预算下采用了更好的卸载策略。

4.4 Fidelity Analysis 4.4 保真度分析

We conduct the fidelity analysis to check how much performance SiDA can preserve. As Table. 3 shows, SiDA
我们进行了保真度分析，以检查SiDA能保留多少性能。如表3所示，SiDA
Table 4. Top-3 Hash Hits Rate. Demonstrating SiDA's exemplary accuracy on expert activation prediction up to over

99 %

across various models.
表4. 前三名哈希命中率。展示了SiDA在多种模型上对专家激活预测的卓越准确性，高达

99 %

以上。

Backbone 骨干	SST2	MRPC 微软研究释义语料库	MultiRC 多项选择阅读理解
Switch-base-8 切换到八进制基数	$99.00 %$	$97.41 %$	$91.74 %$
Switch-base-128 切换基数为128	$98.78 %$	$98.65 %$	$90.49 %$

can preserve up to nearly

99 %

accuracy leading to a performance degradation down to less than

1 %

for Switch-base-8. For Switch-base-128, the fidelity is up to

96 %

leading to a performance loss down to

3 %

. Our results demonstrate the superiority of SiDA, which achieves low inference latency and low GPU memory occupation with negligible loss on the model's performance.
可以保持高达近

99 %

的准确度，导致Switch-base-8的性能下降至少于

1 %

。对于Switch-base-128，保真度高达

96 %

，导致性能损失降至

3 %

。我们的结果展示了SiDA的优越性，它在模型性能上的损失可以忽略不计，同时实现了低推理延迟和低GPU内存占用。

4.5 Hash Hits Rate
哈希命中率

SiDA adopts a predictor to predict the experts to be activated for each token. We investigate the accuracy of the predictor in the hash-building thread, which we refer to as the hash hits rate. Results can be found in Table 4 where we report top-3 accuracy. For very long sentences, such as the MultiRC dataset, the hash hits rate can achieve over

90 %

.
SiDA采用了一个预测器，用于预测每个标记要激活的专家。我们在构建哈希的线程中调查了预测器的准确性，我们将其称为哈希命中率。结果见表4，我们报告了前三的准确性。对于非常长的句子，例如MultiRC数据集，哈希命中率可以达到超过

90 %

。

With the rise of LLM, efficient serving for large models has become a hot topic. Much research has been done by adopting classical model compression methods, such as knowledge distillation (Fu et al., 2023; Li et al., 2023b;
随着LLM的兴起，高效服务大型模型已成为热门话题。通过采用传统模型压缩方法，如知识蒸馏（Fu等，2023；Li等，2023b；已进行了大量研究。

Figure 10. Comparison of Inference Latency Across Different Methods. SiDA consistently outperforms baselines, especially evident on Switch-base-256 model with latency reduced down to

28 %

. Notably, improvements are more pronounced as sentence lengths decrease.
图10. 不同方法推理延迟的比较。SiDA一贯优于基准线，尤其是在Switch-base-256模型上的表现尤为明显，延迟降低至

28 %

。值得注意的是，随着句子长度的减少，改进效果更为显著。

Figure 11. Throughput Efficiency Relative to GPU Memory Budget. SiDA's advantage is particularly pronounced in constrained GPU memory scenarios, showcasing its superior efficiency by offloading experts compared to the conventional model parallelism, here denoted as 'Standard'.
图11. 相对于GPU内存预算的吞吐效率。在受限的GPU内存场景中，SiDA的优势尤为明显，通过卸载专家相比传统的模型并行性（此处称为“标准”）展现了其卓越的效率。

Tan et al., 2023; Wang et al., 2023; Wu et al., 2023; Gu et al., 2023; Zhou et al., 2023; Yuan et al., 2023a), quantization (Chee et al., 2023; Frantar et al., 2022; Lin et al., 2023; Cheng et al., 2023; Liu et al., 2023a;b; Shang et al., 2023; Shao et al., 2023; Xiao et al., 2023; Yuan et al., 2023b), and pruning (Frantar & Alistarh, 2023; Ji et al., 2023; Ma et al., 2023; Sun et al., 2023; Xia et al., 2023; Li et al., 2023c). Further, others have been exploring more efficient network architectures (Del Corro et al., 2023; Liu et al., 2023c; Miao et al., 2023; Jiang et al., 2023b; Ning et al., 2023; Spector & Re, 2023; Xu et al., 2023). Besides, some have tackled the efficiency problem from a data perspective by performing text compression (Chevalier et al., 2023; Ge et al., 2023; Valmeekam et al., 2023; Jiang et al., 2023a). However, these works are not specifically designed for MoE models and ignore the sparse expert activation patterns. SiDA exploits the expert activation patterns to achieve efficient inference. Furthermore, SiDA is orthogonal to methods such as quantization and pruning, which can be applied to the activated experts' networks.
谭等人（2023年）；王等人（2023年）；吴等人（2023年）；顾等人（2023年）；周等人（2023年）；袁等人（2023a年），量化（谢等人，2023年；弗兰塔等人，2022年；林等人，2023年；程等人，2023年；刘等人，2023a；b年；尚等人，2023年；邵等人，2023年；肖等人，2023年；袁等人，2023b年），以及剪枝（弗兰塔与阿利斯塔尔，2023年；季等人，2023年；马等人，2023年；孙等人，2023年；夏等人，2023年；李等人，2023c年）。此外，其他人一直在探索更高效的网络架构（德尔·科罗等人，2023年；刘等人，2023c年；苗等人，2023年；蒋等人，2023b年；宁等人，2023年；斯佩克特与雷，2023年；徐等人，2023年）。还有一些人通过执行文本压缩（谢瓦利埃等人，2023年；葛等人，2023年；瓦尔米卡姆等人，2023年；蒋等人，2023a年）从数据角度解决了效率问题。然而，这些工作并非专门为MoE模型设计，忽略了稀疏专家激活模式。SiDA利用专家激活模式实现高效推理。此外，SiDA与量化和剪枝等方法正交，可以应用于被激活专家的网络。
We notice several concurrent works that are specifically designed for efficient MoE-based model inference (Huang et al., 2023; Kong et al., 2023; Yi et al., 2023). However, SiDA is orthogonal to these works, which focus on designing better scheduling for caching experts. SiDA explores a data-aware path that predicts the experts to be activated. The data-aware approach and the caching scheduling can be combined to achieve better efficiency.
我们注意到几项并行的工作，这些工作专门为基于MoE的模型推理设计以提高效率（黄等，2023年；孔等，2023年；易等，2023年）。然而，SiDA与这些工作是正交的，这些工作专注于为缓存专家设计更好的调度。SiDA探索了一条数据感知的路径，预测将要激活的专家。数据感知方法和缓存调度可以结合起来，以实现更好的效率。

6 DisCUSSION 6 讨论

Enhanced Hierarchical Offloading. While SiDA offers offloading capabilities between main memory and GPU memory, its limitations are defined by the storage capacity of the main memory. This poses challenges, especially when deploying massive models like Switch-c-2048 with almost 5 TB of parameters. A logical progression would be to introduce a layered offloading mechanism that fluidly transfers experts between GPU memory, main memory, and SSD storage. Such an advanced hierarchical approach in SiDA would make it adept at handling models of any magnitude.
增强的分层卸载。虽然SiDA提供了主内存和GPU内存之间的卸载能力，但其局限性由主内存的存储容量决定。这在部署像Switch-c-2048这样的大型模型时尤其具有挑战性，这些模型几乎有5TB的参数。一个逻辑上的进步将是引入一个分层卸载机制，该机制能够在GPU内存、主内存和SSD存储之间流畅地传输专家。在SiDA中采用这种高级的分层方法将使其能够熟练地处理任何规模的模型。

Optimized Hash Graph for Expert Activation Storage.
为专家激活存储优化的哈希图。

Currently, SiDA utilizes an LSTM model to function as its hash system. It's evident that the expert activation is conditionally contingent upon the activation patterns observed in preceding MoE layers. To enhance efficiency, an ideal hash function could be designed as a graph. This graph would capture and store these conditional dependencies, enabling rapid and effective extraction of expert activation.
目前，SiDA利用一个LSTM模型作为其哈希系统的运作机制。很明显，专家激活是有条件地依赖于之前MoE层中观察到的激活模式。为了提高效率，一个理想的哈希函数可以设计成一个图。这个图将捕捉并存储这些条件依赖性，使得专家激活的快速和有效提取成为可能。

7 Conclusion 7 结论

In summary, this paper presents

S i D A

, a novel data-aware method that adeptly addresses the challenges posed by the memory constraints of GPUs when serving expansive models, specifically leveraging the sparsity inherent in MoE architectures. Further, SiDA deploys an offline trained hash function running in the hash-building thread, which alleviates the expert selection overhead by a large margin. Through judicious utilization of both main and GPU memory, SiDA offers a promising route for serving large MoE models under limited GPU budgets with nearly zero performance setbacks.
总的来说，本文提出了

S i D A

，这是一种新颖的数据感知方法，巧妙地解决了在GPU内存限制条件下服务大型模型所面临的挑战，特别是利用了MoE架构固有的稀疏性。此外，SiDA部署了一个离线训练的哈希函数，该函数在构建哈希的线程中运行，大幅减轻了专家选择的开销。通过对主内存和GPU内存的明智利用，SiDA为在有限的GPU预算下服务大型MoE模型提供了一条有希望的途径，几乎不会造成性能上的损失。

REFERENCES 参考文献

Aminabadi, R. Y., Rajbhandari, S., Awan, A. A., Li, C., Li, D., Zheng, E., Ruwase, O., Smith, S., Zhang, M., Rasley, J., et al. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1-15. IEEE, 2022.
Aminabadi, R. Y., Rajbhandari, S., Awan, A. A., Li, C., Li, D., Zheng, E., Ruwase, O., Smith, S., Zhang, M., Rasley, J., 等. Deepspeed-inference: 在前所未有的规模上实现变压器模型的高效推理. 在 SC22: 国际高性能计算、网络、存储与分析会议, 第1-15页. IEEE, 2022.

Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. In Bengio, Y. and LeCun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1409. 0473.
Bahdanau, D., Cho, K., 和 Bengio, Y. 通过联合学习对齐和翻译进行神经机器翻译。在 Bengio, Y. 和 LeCun, Y. (编)，第三届国际学习表征会议（ICLR 2015），2015年5月7-9日，美国加利福尼亚州圣地亚哥，会议论文集。网址 http://arxiv.org/abs/1409.0473。

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877-1901, 2020.
布朗，T.，曼恩，B.，赖德，N.，苏比亚，M.，卡普兰，J. D.，达里瓦尔，P.，尼拉坎坦，A.，夏姆，P.，萨斯特里，G.，阿斯克尔，A.，等。语言模型是少样本学习者。神经信息处理系统进展，33：1877-1901，2020。

Chee, J., Cai, Y., Kuleshov, V., and De Sa, C. Quip: 2-bit quantization of large language models with guarantees. arXiv preprint arXiv:2307.13304, 2023.
Chee, J., Cai, Y., Kuleshov, V., 和 De Sa, C. Quip：带保证的大型语言模型2比特量化。arXiv预印本arXiv:2307.13304, 2023.

Chen, Z., Deng, Y., Wu, Y., Gu, Q., and Li, Y. Towards understanding the mixture-of-experts layer in deep learning. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https: //openreview.net/ forum?id=MaYzugDmQV.
陈泽、邓宇、吴宇、顾强、李颖。深入理解深度学习中的专家混合层。收录于Oh, A. H., Agarwal, A., Belgrave, D., 和 Cho, K. 编辑的《神经信息处理系统进展》，2022年。网址：https://openreview.net/forum?id=MaYzugDmQV。

Cheng, W., Zhang, W., Shen, H., Cai, Y., He, X., and Lv, K. Optimize weight rounding via signed gradient descent for the quantization of llms. arXiv preprint arXiv:2309.05516, 2023.
程伟、张伟、沈浩、蔡宇、何星、吕凯。通过有符号梯度下降优化权重舍入以量化线性最小均方误差模型。arXiv预印本arXiv:2309.05516，2023。

Chevalier, A., Wettig, A., Ajith, A., and Chen, D. Adapting language models to compress contexts. arXiv preprint arXiv:2305.14788, 2023.
谢瓦利耶、韦蒂格、阿吉特和陈丹。适应语言模型以压缩上下文。arXiv预印本arXiv:2305.14788，2023。

Choquette, J. Nvidia hopper h100 gpu: Scaling performance. IEEE Micro, 2023.
Choquette, J. Nvidia Hopper H100 GPU：性能扩展。IEEE Micro, 2023.

Collobert, R., Bengio, S., and Bengio, Y. A parallel mixture of svms for very large scale problems. Advances in Neural Information Processing Systems, 14, 2001.
Collobert, R., Bengio, S., 和 Bengio, Y. 面向非常大规模问题的SVM并行混合模型。神经信息处理系统进展，14，2001。

Del Corro, L., Del Giorno, A., Agarwal, S., Yu, B., Awadallah, A., and Mukherjee, S. Skipdecode: Autoregressive skip decoding with batching and caching for efficient

11 m

inference. arXiv preprint arXiv:2307.02628, 2023.
Del Corro, L., Del Giorno, A., Agarwal, S., Yu, B., Awadallah, A., 和 Mukherjee, S. Skipdecode: 使用批处理和缓存的自回归跳跃解码，用于高效的

11 m

推理。arXiv预印本arXiv:2307.02628, 2023。

Eigen, D., Ranzato, M., and Sutskever, I. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314, 2013.
Eigen, D., Ranzato, M., 和 Sutskever, I. 在深度专家混合模型中学习因子表示。arXiv预印本 arXiv:1312.4314, 2013.
Fedus, W., Zoph, B., and Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232-5270, 2022.
Fedus, W., Zoph, B., 和 Shazeer, N. 开关变压器：通过简单高效的稀疏性扩展至万亿参数模型。机器学习研究杂志，23(1):5232-5270, 2022.

Frantar, E. and Alistarh, D. Sparsegpt: Massive language models can be accurately pruned in one-shot. 2023.
Frantar, E. 和 Alistarh, D. Sparsegpt：大型语言模型可以通过一次性剪枝准确简化。2023.

Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pretrained transformers. arXiv preprint arXiv:2210.17323, 2022.
Frantar, E., Ashkboos, S., Hoefler, T., 与 Alistarh, D. Gptq：生成式预训练变换器的精确后训练量化。arXiv预印本arXiv:2210.17323, 2022。

Fu, Y., Peng, H., Ou, L., Sabharwal, A., and Khot, T. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726, 2023.
傅宇、彭浩、欧雷、萨巴瓦尔、科特。专门化小型语言模型以进行多步推理。arXiv预印本arXiv:2301.12726，2023。

Ge, T., Hu, J., Wang, X., Chen, S.-Q., and Wei, F. In-context autoencoder for context compression in a large language model. arXiv preprint arXiv:2307.06945, 2023.
葛天、胡军、王晓、陈胜强和魏锋。在大型语言模型中用于上下文压缩的上下文自编码器。arXiv预印本arXiv:2307.06945，2023。

Gu, Y., Dong, L., Wei, F., and Huang, M. Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543, 2023.
顾宇、董磊、魏锋、黄明。大型语言模型的知识蒸馏。arXiv预印本arXiv:2306.08543，2023。

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778, 2016.
何凯明、张晓军、任少卿、孙剑。深度残差学习用于图像识别。在IEEE计算机视觉与模式识别会议论文集中，第770-778页，2016年。

Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
Hinton, G., Vinyals, O., 和 Dean, J. 神经网络中知识的蒸馏。arXiv预印本 arXiv:1503.02531, 2015.

Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735-1780, 1997.
霍赫赖特，S. 和施密特胡伯，J. 长短期记忆。神经计算，9(8):1735-1780, 1997.

Huang, H., Ardalani, N., Sun, A., Ke, L., Lee, H.-H. S., Sridhar, A., Bhosale, S., Wu, C.-J., and Lee, B. Towards moe deployment: Mitigating inefficiencies in mixture-ofexpert (moe) inference. arXiv preprint arXiv:2303.06182, 2023.
黄H., Ardalani N., 孙A., 柯L., 李H.-H. S., Sridhar A., Bhosale S., 吴C.-J., 以及李B. 朝向更多的部署：减轻专家混合（MoE）推理中的低效率。arXiv预印本arXiv:2303.06182, 2023.

Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
黄Y.，程Y.，巴普纳A.，菲拉特O.，陈D.，陈M.，李H.，倪嘉明，乐Q. V.，吴Y.等。使用管道并行性高效训练巨型神经网络。神经信息处理系统进展，第32卷，2019。

Hwang, C., Cui, W., Xiong, Y., Yang, Z., Liu, Z., Hu, H., Wang, Z., Salas, R., Jose, J., Ram, P., et al. Tutel: Adaptive mixture-of-experts at scale. Proceedings of Machine Learning and Systems, 5, 2023.
黄C., 崔W., 熊Y., 杨Z., 刘Z., 胡H., 王Z., 萨拉斯R., 何塞J., 拉姆P.等. Tutel: 大规模自适应专家混合体系. 机器学习与系统会议论文集, 5, 2023.

Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. Adaptive mixtures of local experts. Neural computation, 3(1):79-87, 1991.
雅各布斯，R. A.，乔丹，M. I.，诺兰，S. J.，以及辛顿，G. E. 适应性混合局部专家。神经计算，3(1):79-87, 1991.

Ji, Y., Cao, Y., and Liu, J. Pruning large language models via accuracy predictor. arXiv preprint arXiv:2309.09507, 2023 .
纪宇、曹阳、刘军。通过准确性预测器对大型语言模型进行剪枝。arXiv预印本arXiv:2309.09507，2023。

Jiang, H., Wu, Q., Lin, C.-Y., Yang, Y., and Qiu, L. Llmlingua: Compressing prompts for accelerated inference of large language models. arXiv preprint arXiv:2310.05736, 2023 a.
江恒、吴强、林春宇、杨洋、邱丽。Llmlingua：用于加速大型语言模型推理的提示压缩。arXiv预印本arXiv:2310.05736，2023年a。

Jiang, Y., He, Q., Zhuang, X., Wu, Z., Wang, K., Zhao, W., and Yang, G. Recyclegpt: An autoregressive language model with recyclable module. arXiv preprint arXiv:2308.03421, 2023b.
姜宇、何强、庄晓、吴泽、王凯、赵伟、杨光。Recyclegpt：一种具有可回收模块的自回归语言模型。arXiv预印本arXiv:2308.03421，2023b。

Jordan, M., Ghahramani, Z., and Saul, L. Hidden markov decision trees. Advances in neural information processing systems, 9, 1996.
乔丹、M.，加拉曼尼、Z.，和索尔、L. 隐藏马尔可夫决策树。神经信息处理系统进展，第9卷，1996年。

Jordan, M. I. and Jacobs, R. A. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2): 181-214, 1994.
乔丹，M. I. 和雅各布斯，R. A. 专家的层次混合和EM算法。神经计算，6(2): 181-214, 1994.

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., 和 Amodei, D. 神经语言模型的规模化定律。arXiv预印本 arXiv:2001.08361, 2020.

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
基里洛夫、A.、明顿、E.、拉维、N.、毛、H.、罗兰、C.、古斯塔夫森、L.、肖、T.、怀特黑德、S.、伯格、A. C.、罗、W.-Y. 等。分割任何事物。arXiv预印本arXiv:2304.02643, 2023.

Kong, R., Li, Y., Feng, Q., Wang, W., Kong, L., and Liu, Y. Serving moe models on resource-constrained edge devices via dynamic expert swapping. arXiv preprint arXiv:2308.15030, 2023.
孔睿、李颖、冯强、王伟、孔亮、刘阳。通过动态专家交换在资源受限的边缘设备上部署MOE模型。arXiv预印本arXiv:2308.15030，2023。

Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., and Chen, Z. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., 和 Chen, Z. Gshard：通过条件计算和自动分片扩展巨型模型。arXiv预印本arXiv:2006.16668, 2020.

Lewis, M., Bhosale, S., Dettmers, T., Goyal, N., and Zettlemoyer, L. Base layers: Simplifying training of large, sparse models. In International Conference on Machine Learning, pp. 6265-6274. PMLR, 2021.
刘易斯、M.，博萨莱、S.，德特默斯、T.，戈亚尔、N.，以及泽特尔莫耶、L. 基础层：简化大型稀疏模型的训练。在国际机器学习会议上，第6265-6274页。机器学习研究进展，2021。

Li, B., Shen, Y., Yang, J., Wang, Y., Ren, J., Che, T., Zhang, J., and Liu, Z. Sparse mixture-of-experts are domain generalizable learners. In The Eleventh International Conference on Learning Representations, 2023a. URL https: //openreview.net/forum?id=RecZ9nB9Q4.
李柏、沈阳、杨洁、王艳、任佳、车天、张洁、刘泽。稀疏混合专家是领域泛化学习者。发表于第十一届国际学习表征会议，2023年。网址：https://openreview.net/forum?id=RecZ9nB9Q4。

Li, L. H., Hessel, J., Yu, Y., Ren, X., Chang, K.-W., and Choi, Y. Symbolic chain-of-thought distillation: Small models can also" think" step-by-step. arXiv preprint arXiv:2306.14050, 2023 b.
李立华、赫塞尔、余洋、任晓、张凯文和崔勇。符号思维链精炼：小模型也能“逐步思考”。arXiv预印本arXiv:2306.14050，2023年版。
Li, Y., Yu, Y., Zhang, Q., Liang, C., He, P., Chen, W., and Zhao, T. Losparse: Structured compression of large language models based on low-rank and sparse approximation. arXiv preprint arXiv:2306.11222, 2023c.
李毅、余宇、张强、梁晨、何平、陈伟、赵涛。Losparse：基于低秩和稀疏近似的大型语言模型的结构化压缩。arXiv预印本arXiv:2306.11222，2023c。

Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., and Han, S. Awq: Activation-aware weight quantization for

1 l m

compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
林 J.、唐 J.、唐 H.、杨 S.、党 X. 和韩 S. 提出了一种激活感知的权重量化方法，用于深度学习模型的压缩和加速。预印本 arXiv:2306.00978, 2023。

Liu, J., Gong, R., Wei, X., Dong, Z., Cai, J., and Zhuang, B. Qllm: Accurate and efficient low-bitwidth quantization for large language models. arXiv preprint arXiv:2310.08041, 2023a.
刘佳、龚荣、魏晓、董哲、蔡佳和庄波。Qllm：大型语言模型的准确且高效的低比特宽度量化。arXiv预印本arXiv:2310.08041，2023a。

Liu, Z., Oguz, B., Zhao, C., Chang, E., Stock, P., Mehdad, Y., Shi, Y., Krishnamoorthi, R., and Chandra, V. Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888, 2023 b.
刘泽、欧古兹·B、赵晨、常恩、斯托克·P、梅达德·Y、史燕、克里希纳穆尔希·R、以及钱德拉·V. Llm-qat：大型语言模型的无数据量化感知训练。arXiv预印本arXiv:2305.17888, 2023 b.

Liu, Z., Wang, J., Dao, T., Zhou, T., Yuan, B., Song, Z., Shrivastava, A., Zhang, C., Tian, Y., Re, C., et al. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pp. 22137-22176. PMLR, 2023c.
刘柱，王健，刀涛，周涛，袁波，宋哲，施瑞瓦斯塔瓦，张晨，田野，雷西等。既视感：在推理时用于高效大型语言模型的上下文稀疏性。发表于《国际机器学习大会》，第22137-22176页。机器学习研究出版社，2023年。

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview . net/forum?id=Bkg6RiCqY7.
洛希洛夫，I. 和胡特，F. 解耦权重衰减正则化。发表于2019年国际学习表征会议。网址 https://openreview.net/forum?id=Bkg6RiCqY7。

Ma, X., Fang, G., and Wang, X. Llm-pruner: On the structural pruning of large language models. arXiv preprint arXiv:2305.11627, 2023.
马X、方G和王X。Llm-pruner：关于大型语言模型结构剪枝的研究。arXiv预印本arXiv:2305.11627，2023。

Martins, A. and Astudillo, R. From softmax to sparsemax: A sparse model of attention and multi-label classification. In International conference on machine learning, pp. 16141623. PMLR, 2016.
马丁斯，A. 和阿斯图迪略，R. 从softmax到sparsemax：注意力和多标签分类的稀疏模型。在机器学习国际会议上，第1614-1623页。机器学习与模式识别出版社，2016年。

Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Wang, Z., Wong, R. Y. Y., Chen, Z., Arfeen, D., Abhyankar, R., and Jia, Z. Specinfer: Accelerating generative

11 m

serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 2023.
缪晓、欧利亚罗、张哲、程晓、王志、黄荣耀、陈志、阿菲恩、阿比扬卡、贾哲。Specinfer：通过推测性推理和令牌树验证加速生成式服务。arXiv预印本arXiv:2305.09781，2023。

Ning, X., Lin, Z., Zhou, Z., Yang, H., and Wang, Y. Skeleton-of-thought: Large language models can do parallel decoding. arXiv preprint arXiv:2307.15337, 2023.
宁X.、林Z.、周Z.、杨H. 和王Y. 思维骨架：大型语言模型能够进行并行解码。arXiv预印本arXiv:2307.15337, 2023.

OpenAI. Gpt-4 technical report, 2023.
OpenAI. GPT-4技术报告，2023年。

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485-5551, 2020.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., 与 Liu, P. J. 探索统一文本到文本转换器在迁移学习中的极限。机器学习研究杂志，21(1):5485-5551, 2020.

Rajbhandari, S., Li, C., Yao, Z., Zhang, M., Aminabadi, R. Y., Awan, A. A., Rasley, J., and He, Y. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International Conference on Machine Learning, pp. 18332-18346. PMLR, 2022.
Rajbhandari, S., Li, C., Yao, Z., Zhang, M., Aminabadi, R. Y., Awan, A. A., Rasley, J., 和 He, Y. Deepspeed-moe：推进专家混合模型的推理和训练，以支持下一代AI规模。在国际机器学习会议上，第18332-18346页。PMLR，2022。

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
拉梅什、A.、达里瓦尔、P.、尼科尔、A.、朱、C. 以及陈、M.。层次化文本条件图像生成与剪辑潜变量。arXiv预印本arXiv:2204.06125, 1(2):3, 2022。

Riquelme, C., Puigcerver, J., Mustafa, B., Neumann, M., Jenatton, R., Susano Pinto, A., Keysers, D., and Houlsby, N. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34: 8583-8595, 2021.
里克尔梅、C.，普伊格塞弗、J.，穆斯塔法、B.，诺伊曼、M.，杰纳顿、R.，苏萨诺平托、A.，凯瑟斯、D.，以及霍尔斯比、N. 2021年。通过稀疏专家混合扩展视觉能力。《神经信息处理系统进展》，第34卷：8583-8595。

Roller, S., Sukhbaatar, S., Szlam, A., and Weston, J. E. Hash layers for large sparse models. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum? id=lMgDDWb1ULW.
罗勒、苏赫巴特尔、斯兹拉姆和韦斯顿在2021年的《神经信息处理系统进展》中发表了题为《大型稀疏模型的哈希层》的论文。该论文由贝格尔齐默、道芬、梁和沃恩编辑。论文链接：https://openreview.net/forum?id=lMgDDWb1ULW。

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35: 36479-36494, 2022.
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., 等. 深度语言理解的真实感文本到图像扩散模型。神经信息处理系统进展, 35: 36479-36494, 2022.

Shang, Y., Yuan, Z., Wu, Q., and Dong, Z. Pb-llm: Partially binarized large language models. arXiv preprint arXiv:2310.00034, 2023.
商颖、袁泽、吴强、董志。Pb-llm：部分二值化的大型语言模型。arXiv预印本arXiv:2310.00034，2023。

Shao, W., Chen, M., Zhang, Z., Xu, P., Zhao, L., Li, Z., Zhang, K., Gao, P., Qiao, Y., and Luo, P. Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv preprint arXiv:2308.13137, 2023.
邵伟、陈明、张哲、徐鹏、赵亮、李志、张凯、高鹏、乔阳、罗平。Omniquant：面向大型语言模型的全方位校准量化。arXiv预印本arXiv:2308.13137，2023。

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017. URL https://openreview. net/forum? id=B1ckMDqlg.
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., 和 Dean, J. 著. 超大型神经网络：稀疏门控的专家混合层. 在国际学习表征会议上, 2017. 网址 https://openreview.net/forum?id=B1ckMDqlg.

Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
史密斯，S.，帕特瓦里，M.，诺里克，B.，勒格雷斯利，P.，拉杰班达里，S.，卡斯珀，J.，刘，Z.，普拉布莫耶，S.，泽尔维亚斯，G.，科尔蒂坎蒂，V.，等。使用DeepSpeed和Megatron训练Megatron-Turing NLG 530B，一个大规模生成语言模型。arXiv预印本arXiv:2201.11990，2022。

Spector, B. and Re, C. Accelerating

1 l m

inference with staged speculative decoding. arXiv preprint arXiv:2308.04623, 2023.
Spector，B. 和 Re，C. 通过分阶段推测解码加速

1 l m

推理。arXiv预印本arXiv:2308.04623，2023。
Sun, M., Liu, Z., Bair, A., and Kolter, J. Z. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023.
孙明、刘泽、白安、科尔特 J. Z. 一种简单有效的大型语言模型剪枝方法。arXiv预印本arXiv:2306.11695，2023。

Tan, S., Tam, W. L., Wang, Y., Gong, W., Zhao, S., Zhang, P., and Tang, J. [industry] gkd: A general knowledge distillation framework for large-scale pre-trained language model. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023.
谭松、谭伟良、王颖、龚伟、赵硕、张鹏、唐杰。[行业] gkd：一个用于大规模预训练语言模型的通用知识蒸馏框架。在第61届计算语言学协会年会，2023年。

Tresp, V. Mixtures of gaussian processes. Advances in neural information processing systems, 13, 2000.
Tresp, V. 高斯过程混合模型。神经信息处理系统进展，13，2000。

Valmeekam, C. S. K., Narayanan, K., Kalathil, D., Chamberland, J.-F., and Shakkottai, S. Llmzip: Lossless text compression using large language models. arXiv preprint arXiv:2306.04050, 2023.
瓦尔米卡姆，C. S. K.，纳拉亚南，K.，卡拉蒂尔，D.，尚伯兰，J.-F.，以及沙科塔伊，S. 使用大型语言模型进行无损文本压缩。arXiv预印本arXiv:2306.04050，2023。

Vemprala, S., Bonatti, R., Bucker, A., and Kapoor, A. Chatgpt for robotics: Design principles and model abilities. Microsoft Auton. Syst. Robot. Res, 2:20, 2023.
Vemprala, S., Bonatti, R., Bucker, A., 和 Kapoor, A. 针对机器人的Chatgpt：设计原则与模型能力。微软自主系统与机器人研究，2:20, 2023.

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
王，A.，辛格，A.，迈克尔，J.，希尔，F.，列维，O.，以及鲍曼，S. R. Glue：一个用于自然语言理解的多任务基准测试和分析平台。arXiv预印本arXiv:1804.07461，2018。

Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019.
王，A.，普鲁克萨恰昆，Y.，南加，N.，辛格，A.，迈克尔，J.，希尔，F.，列维，O.，以及鲍曼，S.。Superglue：一个更具粘性的通用语言理解系统基准测试。神经信息处理系统进展，32，2019。

Wang, P., Wang, Z., Li, Z., Gao, Y., Yin, B., and Ren, X. Scott: Self-consistent chain-of-thought distillation. arXiv preprint arXiv:2305.01879, 2023.
王鹏、王志、李志、高阳、尹波、任欣。Scott：自洽思维链蒸馏。arXiv预印本arXiv:2305.01879，2023。

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al. Huggingface's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019 .
沃尔夫、德布特、桑、尚蒙、德朗格、莫伊、西斯塔克、劳尔特、卢夫、芬托维奇等。Huggingface的变换器：最先进的自然语言处理。arXiv预印本arXiv:1910.03771，2019。

Wu, M., Waheed, A., Zhang, C., Abdul-Mageed, M., and Aji, A. F. Lamini-lm: A diverse herd of distilled models from large-scale instructions. arXiv preprint arXiv:2304.14402, 2023.
吴明，瓦希德·阿，张晨，阿卜杜勒-马吉德·穆，以及阿吉·阿法里·法迪·拉米尼。Lamini-lm：来自大规模指令的多样化精简模型群。arXiv预印本arXiv:2304.14402，2023。

Xia, H., Zheng, Z., Li, Y., Zhuang, D., Zhou, Z., Qiu, X., Li, Y., Lin, W., and Song, S. L. Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity. arXiv preprint arXiv:2309.10285, 2023.
夏恒、郑志、李艳、庄丹、周志、邱欣、李艳、林伟、宋思莱。Flash-llm：利用非结构化稀疏性实现成本效益高且高效的大型生成模型推理。arXiv预印本arXiv:2309.10285，2023。

Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pp. 38087-38099. PMLR, 2023.
肖G.、林J.、塞兹内克M.、吴H.、德茂斯J.、韩S. Smoothquant：大型语言模型准确且高效的训练后量化。在机器学习国际会议上，第38087-38099页。PMLR，2023。

Xu, M., Xu, Y. L., and Mandic, D. P. Tensorgpt: Efficient compression of the embedding layer in

1 l m s

based on the tensor-train decomposition. arXiv preprint arXiv:2307.00526, 2023.
徐明、徐亚龙和曼迪克，D. P.。Tensorgpt：基于张量列车分解的

1 l m s

嵌入层高效压缩。arXiv预印本arXiv:2307.00526，2023。

Xue, F., Shi, Z., Wei, F., Lou, Y., Liu, Y., and You, Y. Go wider instead of deeper. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 87798787, 2022.
薛飞、史志、魏锋、楼阳、刘宇、游游。向宽而非深发展。在《美国人工智能协会会议论文集》，第36卷，第87798787页，2022年。

Yang, H., Yue, S., and He, Y. Auto-gpt for online decision making: Benchmarks and additional opinions. arXiv preprint arXiv:2306.02224, 2023.
杨恒、岳松、何宇。在线决策的自动GPT：基准测试和额外意见。arXiv预印本arXiv:2306.02224，2023。

Yi, R., Guo, L., Wei, S., Zhou, A., Wang, S., and Xu, M. Edgemoe: Fast on-device inference of moe-based large language models. arXiv preprint arXiv:2308.14352, 2023.
易睿，郭磊，魏森，周安，王硕，徐明。EdgeMOE：基于MOE的大型语言模型在设备上的快速推理。arXiv预印本arXiv:2308.14352，2023。

Yuan, S., Chen, J., Fu, Z., Ge, X., Shah, S., Jankowski, C. R., Yang, D., and Xiao, Y. Distilling script knowledge from large language models for constrained language planning. arXiv preprint arXiv:2305.05252, 2023a.
袁S., 陈J., 傅Z., 葛X., 沙S., 詹科夫斯基C. R., 杨D., 肖Y. 从大型语言模型中提取脚本知识以用于受限语言规划。arXiv预印本arXiv:2305.05252, 2023a.

Yuan, Z., Niu, L., Liu, J., Liu, W., Wang, X., Shang, Y., Sun, G., Wu, Q., Wu, J., and Wu, B. Rptq: Reorder-based posttraining quantization for large language models. arXiv preprint arXiv:2304.01089, 2023 b.
袁泽，牛莉，刘嘉，刘伟，王晓，尚颖，孙刚，吴强，吴杰，吴博睿。Rptq：基于重排序的大型语言模型训练后量化。arXiv预印本arXiv:2304.01089，2023年版。

Zhou, Y., Lyu, K., Rawat, A. S., Menon, A. K., Rostamizadeh, A., Kumar, S., Kagy, J.-F., and Agarwal, R. Distillspec: Improving speculative decoding via knowledge distillation. arXiv preprint arXiv:2310.08461, 2023.
周Y., 吕K., 拉瓦特A. S., 梅农A. K., 罗斯塔米扎德A., 库马尔S., 卡吉J.-F., 和阿加瓦尔R. Distillspec：通过知识蒸馏改进推测性解码. arXiv预印本arXiv:2310.08461, 2023.

$^{1}$ Department of Electrical and Computer Engineering, Duke University, Durham, USA $^{2}$ Department of Electrical and Computer Engineering, Clemson University, Clemson, USA $^{3}$ Department of Electrical and Computer Engineering, University of Maryland, College Park, USA. Correspondence to: Zhixu Du zhixu.du@duke.edu.
杜克大学电气与计算机工程系，美国达勒姆杜克大学电气与计算机工程系，美国克莱姆森克莱姆森大学电气与计算机工程系，美国克莱姆森马里兰大学电气与计算机工程系，美国大学公园。联系人：杜志旭 zhixu.du@duke.edu。

Backbone 骨干		SST2	MRPC 微软研究释义语料库	MultiRC 多项选择阅读理解
Switch-base-8 切换到八进制基数	Finetuned 微调	92.20	89.14	56.70
	SiDA	90.59	86.91	56.11
	Fidelity 忠诚	$98.25 %$	$97.49 %$	$98.95 %$
Switch-base-128 切换基数为128	Finetuned 微调	93.57	89.66	59.95
	SiDA	87.04	83.01	55.49
	Fidelity 忠诚	$93.02 %$	$92.59 %$	$92.56 %$

SiDA: SPARSITY-INSPIRED DATA-AWARE SERVING FOR EFFICIENT AND Scalable LARGe MiXtURE-OF-Experts MOdelS SiDA：灵感来自稀疏性的数据感知服务，用于高效且可扩展的大型混合专家模型

Abstract 摘要

1 INTRODUCTION 1 引言

2 BaCkground and Motivation背景与动机

2.1 Mixture of Experts2.1 专家混合模型

2.2 Low Effective Utilization of GPU Memory2.2 GPU内存的低效利用

2.3 High Expert Selection Overhead2.3 高专家选择成本

2.4 Sparse Activation of Experts in Large MoE Models大型MoE模型中专家的稀疏激活

3 SIDA 3 艾滋病

3.1 Overview: workflow 3.1 概览：工作流程

3.2 Design challenges 3.2 设计挑战

Challenge 2: How to leverage sparse cross-embedding挑战 2：如何利用稀疏交叉嵌入

3.3 Data-Aware and Efficient Expert Activation Prediction3.3 数据感知与高效的专家激活预测

3.4 LSTM with Sparse Attention3.4 带有稀疏注意力的LSTM

3.4.1 Sparse cross-embedding dependency on expert activation3.4.1 专家激活的稀疏交叉嵌入依赖

3.4.2 Design of the hash function3.4.2 哈希函数的设计

3.5 Truncated Knowledge Distillation3.5 截断式知识蒸馏

4 EXPERIMENT 四、实验

4.1 GPU Memory Saving4.1 节省GPU内存

4.2 Latency and Throughput4.2 延迟与吞吐量

4.3 Efficiency under Limited GPU Memory Budgets在有限GPU内存预算下的效率

4.4 Fidelity Analysis 4.4 保真度分析

4.5 Hash Hits Rate哈希命中率

5 Related WORK 5 相关工作

6 DisCUSSION 6 讨论

Optimized Hash Graph for Expert Activation Storage.为专家激活存储优化的哈希图。

7 Conclusion 7 结论

REFERENCES 参考文献

SiDA: SPARSITY-INSPIRED DATA-AWARE SERVING FOR EFFICIENT AND Scalable LARGe MiXtURE-OF-Experts MOdelS
SiDA：灵感来自稀疏性的数据感知服务，用于高效且可扩展的大型混合专家模型

2 BaCkground and Motivation
背景与动机

2.1 Mixture of Experts
2.1 专家混合模型

2.2 Low Effective Utilization of GPU Memory
2.2 GPU内存的低效利用

2.3 High Expert Selection Overhead
2.3 高专家选择成本

2.4 Sparse Activation of Experts in Large MoE Models
大型MoE模型中专家的稀疏激活

Challenge 2: How to leverage sparse cross-embedding
挑战 2：如何利用稀疏交叉嵌入

3.3 Data-Aware and Efficient Expert Activation Prediction
3.3 数据感知与高效的专家激活预测

3.4 LSTM with Sparse Attention
3.4 带有稀疏注意力的LSTM

3.4.1 Sparse cross-embedding dependency on expert activation
3.4.1 专家激活的稀疏交叉嵌入依赖

3.4.2 Design of the hash function
3.4.2 哈希函数的设计

3.5 Truncated Knowledge Distillation
3.5 截断式知识蒸馏

4.1 GPU Memory Saving
4.1 节省GPU内存

4.2 Latency and Throughput
4.2 延迟与吞吐量

4.3 Efficiency under Limited GPU Memory Budgets
在有限GPU内存预算下的效率

4.5 Hash Hits Rate
哈希命中率

Optimized Hash Graph for Expert Activation Storage.
为专家激活存储优化的哈希图。