这是用户在 2024-3-5 21:21 为 https://qyhfrank.github.io/papers/SiDA-%20Sparsity-Inspired%20Data-Aware%20Serving%20for%20Efficient... 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

SiDA: SPARSITY-INSPIRED DATA-AWARE SERVING FOR EFFICIENT AND Scalable LARGe MiXtURE-OF-Experts MOdelS
SiDA:灵感来自稀疏性的数据感知服务,用于高效且可扩展的大型混合专家模型

Zhixu Du 1 1 ^(1){ }^{1}Superscript 1 Shiyu Li 1 1 ^(1){ }^{1}Superscript 1 Yuhao Wu 1 1 ^(1){ }^{1}Superscript 1 Xiangyu Jiang 2 2 ^(2){ }^{2}squared Jingwei Sun 1 1 ^(1){ }^{1}Superscript 1 Qilin Zheng 1 1 ^(1){ }^{1}Superscript 1 Yongkai 2 W u 2 2 W u 2 ^(2)Wu^(2)^{2} \mathbf{W u}^{2}squared bold upper W bold u squared Ang Li 3 3 ^(3)^{3}cubed
杜志旭 1 1 ^(1){ }^{1}Superscript 1 李诗雨 1 1 ^(1){ }^{1}Superscript 1 吴宇豪 1 1 ^(1){ }^{1}Superscript 1 蒋翔宇 2 2 ^(2){ }^{2}squared 孙靖威 1 1 ^(1){ }^{1}Superscript 1 郑启林 1 1 ^(1){ }^{1}Superscript 1 永凯 2 W u 2 2 W u 2 ^(2)Wu^(2)^{2} \mathbf{W u}^{2}squared bold upper W bold u squared 李昂 3 3 ^(3)^{3}cubed
Hai "Helen" Li 1 1 ^(1)^{1}Superscript 1 Yiran Chen 1 1 ^(1){ }^{1}Superscript 1
海 "海伦" 李 1 1 ^(1)^{1}Superscript 1 陈怡然 1 1 ^(1){ }^{1}Superscript 1

Abstract 摘要

Mixture-of-Experts (MoE) has emerged as a favorable architecture in the era of large models due to its inherent advantage, i.e., enlarging model capacity without incurring notable computational overhead. Yet, the realization of such benefits often results in ineffective GPU memory utilization, as large portions of the model parameters remain dormant during inference. Moreover, the memory demands of large models consistently outpace the memory capacity of contemporary GPUs. Addressing this, we introduce SiDA (Sparsity-inspired Data-Aware), an efficient inference approach tailored for large MoE models. SiDA judiciously exploits both the system's main memory, which is now abundant and readily scalable, and GPU memory by capitalizing on the inherent sparsity on expert activation in MoE models. By adopting a data-aware perspective, SiDA achieves enhanced model efficiency with a neglectable performance drop. Specifically, SiDA attains a remarkable speedup in MoE inference with up to 3.93 × 3.93 × 3.93 xx3.93 \times3.93 times throughput increasing, up to 75 % 75 % 75%75 \%75 percent sign latency reduction, and up to 80 % 80 % 80%80 \%80 percent sign GPU memory saving with down to 1 % 1 % 1%1 \%1 percent sign performance drop. This work paves the way for scalable and efficient deployment of large MoE models, even in memory-constrained systems.
在大型模型的时代,专家混合模型(MoE)因其固有优势——在不显著增加计算开销的情况下扩大模型容量——而成为一种受欢迎的架构。然而,实现这些好处往往导致GPU内存利用率低下,因为在推理过程中大部分模型参数保持休眠状态。此外,大型模型对内存的需求持续超过现代GPU的内存容量。针对这一问题,我们引入了SiDA(受稀疏性启发的数据感知),这是一种为大型MoE模型量身定制的高效推理方法。SiDA巧妙地利用了系统的主内存——现在主内存已经变得丰富且易于扩展——以及GPU内存,通过利用MoE模型中专家激活的固有稀疏性。通过采用数据感知的视角,SiDA实现了模型效率的显著提升,同时性能损失可以忽略不计。具体来说,SiDA在MoE推理中实现了显著的加速,达到了高达 3.93 × 3.93 × 3.93 xx3.93 \times3.93 times 的吞吐量提升,高达 75 % 75 % 75%75 \%75 percent sign 的延迟减少,以及高达 80 % 80 % 80%80 \%80 percent sign 的GPU内存节省,同时将性能损失降低到 1 % 1 % 1%1 \%1 percent sign 。这项工作为大型MoE模型的可扩展和高效部署铺平了道路,即使是在内存受限的系统中也是如此。

1 INTRODUCTION 1 引言

Recently, rapid advances in large models with shocking performance have surprised the community in several areas, such as vision (Ramesh et al., 2022; Kirillov et al., 2023; Saharia et al., 2022), language (Brown et al., 2020; OpenAI, 2023; Smith et al., 2022), decision making (Yang et al., 2023), and robotics (Vemprala et al., 2023). For example, GPT-4 has demonstrated its capability that is comparable or even exceeds human-level understanding on several tasks (OpenAI, 2023), and DALLE. 2 can generate astonishing high-quality images. The outstanding performance of large models heavily relies on the outrageous number of parameters, namely the scaling law (Kaplan et al., 2020). Broadly speaking, the scaling law asserts that as the model size increases, various characteristics such as training loss, test performance, and the amount of required data exhibit predictable scaling behaviors.
近期,在多个领域内,大型模型取得的快速进展以其惊人的性能令人瞩目,这些领域包括视觉(Ramesh 等人,2022年;Kirillov 等人,2023年;Saharia 等人,2022年)、语言(Brown 等人,2020年;OpenAI,2023年;Smith 等人,2022年)、决策制定(Yang 等人,2023年)和机器人技术(Vemprala 等人,2023年)。例如,GPT-4 展示了其在多项任务上与人类水平相当乃至超越的能力(OpenAI,2023年),而 DALLE.2 能够生成令人惊叹的高质量图像。大型模型的卓越性能在很大程度上依赖于其庞大的参数数量,即所谓的规模定律(Kaplan 等人,2020年)。广义上讲,规模定律指出,随着模型大小的增加,训练损失、测试性能和所需数据量等各种特性表现出可预测的规模行为。
Mixture-of-Experts (MoE), a classical model architecture, enjoys the advantage that naturally fits the era of large models. MoE can improve the model's performance by drastically increasing the number of parameters while only
专家混合模型(MoE),一种经典的模型架构,自然而然地适应了大模型时代的优势。MoE可以通过大幅增加参数数量来提高模型的性能,同时仅仅
Figure 1. Diagram Showcasing the Architecture of MoE-based Transformers. Within each MoE layer only a limited number of experts are activated for inference.
图1. 展示基于MoE的变换器架构的示意图。在每个MoE层中,仅有限数量的专家被激活用于推理。
incurring little computational overhead. Although the number of parameters involved in the forward pass of an MoE model remains almost unchanged, research (Fedus et al., 2022) suggests that augmenting parameter counts using the MoE architecture still conforms to the scaling law. Encour-
带来的计算开销很小。尽管MoE模型前向传播中涉及的参数数量几乎保持不变,但研究(Fedus等人,2022年)表明,使用MoE架构增加参数数量仍然符合规模定律。鼓励-

aged by the advantage, many MoE-based large models have been proposed and achieved overwhelming performance in computer vision (Li et al., 2023a; Riquelme et al., 2021; Xue et al., 2022), natural language processing (Shazeer et al., 2017; Fedus et al., 2022), Specifically, the SparselyGated Mixture-of-Experts (Shazeer et al., 2017) layer scales LSTM models to 137 billion parameters, which improves the model capacity by 1000 × 1000 × 1000 xx1000 \times1000 times with marginal computational overhead increase. Switch Transformers (Fedus et al., 2022) scale to 1.6 trillion parameters with the same perplexity as T5-XXL (Raffel et al., 2020) while 4 × 4 × 4xx4 \times4 times speedup during inference. However, the success of MoE comes with sacrifices in effective GPU memory utilization, incurring large memory occupation while only a small fraction of parameters residing in the memory are effective for inference of the current batch. Fig. 1 depicts the architecture of MoE-based transformers, where only a small portion of experts are activated in each MoE layer during each inference.
受到优势的推动,许多基于MoE的大型模型已经被提出,并在计算机视觉(Li等,2023a;Riquelme等,2021;Xue等,2022)、自然语言处理(Shazeer等,2017;Fedus等,2022)中取得了压倒性的表现。具体来说,稀疏门控专家混合(Shazeer等,2017)层将LSTM模型扩展到了1370亿参数,这通过边际计算开销的增加,提高了模型容量。而Switch Transformers(Fedus等,2022)扩展到了1.6万亿参数,与T5-XXL(Raffel等,2020)保持相同的困惑度,同时在推理过程中加速。然而,MoE的成功牺牲了有效GPU内存利用率,导致大量内存占用,而内存中仅有一小部分参数对当前批次的推理有效。图1展示了基于MoE的变换器的架构,其中每次推理过程中,每个MoE层只有一小部分专家被激活。
Further, with the trend of model scaling, we have observed a substantial gap between the memory demands of large models and the memory capacity of GPUs. For instance, in the past three years, the number of parameters in state-of-the-art models has scaled from 175 billion in GPT-3 (Brown et al., 2020 ) to 1.76 trillion in the newly announced GPT-4 (OpenAI, 2023), showing an over 10 × 10 × 10 xx10 \times10 times increase. Contrarily, the memory capacity of high-end GPUs remains around 80GB (Choquette, 2023), and commodity GPUs are still limited to 48GB or even smaller. This growing discrepancy motivates techniques to improve memory utilization efficiency. Thus, we seek to answer a compelling research question:
此外,随着模型规模化的趋势,我们观察到大型模型的内存需求与GPU的内存容量之间存在显著差距。例如,在过去三年中,最先进模型的参数数量已从GPT-3的1750亿(Brown等人,2020年)增加到新发布的GPT-4的1.76万亿(OpenAI,2023年),增长了十倍以上。相反,高端GPU的内存容量仍然维持在80GB左右(Choquette,2023年),而普通GPU的内存容量仍限于48GB甚至更小。这种日益增长的差异激发了提高内存利用效率的技术。因此,我们寻求回答一个引人注目的研究问题:
How to serve large Mixture-of-Experts models in an efficient and scalable manner under constrained memory?
如何在内存受限的情况下高效且可扩展地部署大型专家混合模型?
Previous efforts have studied the efficiency problem of MoE models to some extent. Deepspeed-MoE (Rajbhandari et al., 2022) optimizes the MoE module in the Deepspeed framework for efficient grouping and scheduling. A later version of the work (Aminabadi et al., 2022) focused on optimizing the inference efficiency with optimized computation kernels and careful coordination of communication and parallelism. Tutel (Hwang et al., 2023) enables adaptive parallelism and pipelining at runtime. However, these methods only focus on optimizing device-to-device communication but ignore the data-awareness,
之前的研究已经在一定程度上研究了MoE模型的效率问题。Deepspeed-MoE(Rajbhandari等,2022年)优化了Deepspeed框架中的MoE模块,以实现高效的分组和调度。该工作的后续版本(Aminabadi等,2022年)专注于通过优化计算核心和仔细协调通信与并行性来优化推理效率。Tutel(Hwang等,2023年)在运行时实现了自适应并行性和流水线。然而,这些方法仅关注于优化设备间通信,却忽略了数据感知。
not to mention exploiting the data-awareness to improve efficiency during inference. The data-awareness refers to a design where the technique or strategy is determined based on the incoming data. Our proposed framework embraces the data-awareness which brings three advantages. Firstly, the data-awareness can squeeze the sparsity leading to a further increase in memory efficiency compared to previous methods. Secondly, the data-awareness preserves the structure crucial for a sample's unique features, better maintaining
更不用说利用数据感知来提高推理过程中的效率了。数据感知是指根据传入数据确定技术或策略的设计。我们提出的框架采用了数据感知,这带来了三个优势。首先,数据感知可以压缩稀疏性,与之前的方法相比,进一步提高了内存效率。其次,数据感知保留了对样本独特特征至关重要的结构,更好地维持了

Table 1. Comparison of SiDA and Baseline Methods. This table delineates the capabilities of various methods in terms of dataawareness, effective GPU memory utilization, and inference speed on large MoE models. SiDA excels in its data-aware approach with high effective GPU memory utilization and high inference speed on large MoE models.
表1. SiDA与基准方法的比较。该表格详细列出了各种方法在数据感知能力、有效GPU内存利用率以及大型MoE模型上的推理速度方面的性能。SiDA在数据感知方法上表现出色,具有高效的GPU内存利用率和大型MoE模型上的高推理速度。
Methods 方法 Data-aware 数据感知的
Effective GPU 高效的GPU
memory utilization 内存利用率
Effective GPU memory utilization| Effective GPU | | :---: | | memory utilization |
Inference speed 推理速度
on large MoE 在大型MoE上
Inference speed on large MoE| Inference speed | | :---: | | on large MoE |
Standard 标准 x x x\boldsymbol{x}bold italic x low slow 
Deepspeed 深度加速 x x x\boldsymbol{x}bold italic x medium 中等的 slow 
Tutel 图特尔 x x x\boldsymbol{x}bold italic x medium 中等的 slow 
SiDA \checkmarkcheck mark Extremely high 极高的 Extremely high 极其高的
Methods Data-aware "Effective GPU memory utilization" "Inference speed on large MoE" Standard x low slow Deepspeed x medium slow Tutel x medium slow SiDA ✓ Extremely high Extremely high| Methods | Data-aware | Effective GPU <br> memory utilization | Inference speed <br> on large MoE | | :---: | :---: | :---: | :---: | | Standard | $\boldsymbol{x}$ | low | slow | | Deepspeed | $\boldsymbol{x}$ | medium | slow | | Tutel | $\boldsymbol{x}$ | medium | slow | | SiDA | $\checkmark$ | Extremely high | Extremely high |
the model's performance. Thirdly, the data-awareness offers better adaptability since the framework varies according to data distribution.
模型的性能。第三,数据感知能力由于框架根据数据分布的不同而变化,提供了更好的适应性。
In this paper, we present an efficient inference system, i.e., SiDA (Sparsity-inspired Data-Aware), for serving large MoE models. By noticing that modern server CPUs support terabytes (TB) of main memory, dwarfing GPU capacity, SiDA dynamically leverages both main memory and GPU memory by exploiting sparsity in MoE models in a dataaware manner. We summarize the comparison in Table 1 between SiDA and baselines. Specifically, SiDA contains two threads that run in parallel, an inference thread and a hash-building thread. The hash-building thread exploits the sparsity of expert activation in a data-aware manner, whose core is a network-based hash function. Specifically, the hash function is an offline trained predictor that predicts the experts to be activated. In this work, we employ a LSTM (Hochreiter & Schmidhuber, 1997) with sparse attention and a truncated knowledge distillation to boost the performance of the hash function. The inference thread offloads inactivated experts predicted by the hash-building thread to maximize effective GPU memory utilization. Besides, SiDA also brings significant speedup during inference.
在本文中,我们提出了一个高效的推理系统,即SiDA(受稀疏性启发的数据感知),用于服务大型MoE模型。通过注意到现代服务器CPU支持TB级别的主内存,远超GPU容量,SiDA通过以数据感知的方式利用MoE模型中的稀疏性,动态地利用主内存和GPU内存。我们在表1中总结了SiDA与基线之间的比较。具体来说,SiDA包含两个并行运行的线程,一个是推理线程,另一个是哈希构建线程。哈希构建线程以数据感知的方式利用专家激活的稀疏性,其核心是一个基于网络的哈希函数。具体而言,哈希函数是一个离线训练的预测器,用于预测将要被激活的专家。在这项工作中,我们采用了一个带有稀疏注意力和截断知识蒸馏的LSTM(Hochreiter & Schmidhuber, 1997)来提升哈希函数的性能。推理线程将哈希构建线程预测的未激活专家卸载,以最大化GPU内存的有效利用。此外,SiDA在推理过程中也带来了显著的加速。
Our contributions are summarized as follows:
我们的贡献总结如下:
  • To the best of our knowledge, SiDA is the first sparsityinspired data-aware system serving for efficient and scalable inference on large MoE models.
    据我们所知,SiDA是首个受稀疏性启发的数据感知系统,用于在大型MoE模型上进行高效且可扩展的推理。
  • We propose an offline training strategy to build a dataaware hash function deployed in SiDA that replaces the router function in MoE layers. Our design boosts the throughput of MoE models up to 3.93 × 3.93 × 3.93 xx3.93 \times3.93 times and reduces the latency down to 25 % 25 % 25%25 \%25 percent sign.
    我们提出了一种离线训练策略,用于构建在SiDA中部署的数据感知哈希函数,该函数替代了MoE层中的路由函数。我们的设计将MoE模型的吞吐量提高到 3.93 × 3.93 × 3.93 xx3.93 \times3.93 times ,并将延迟降低到 25 % 25 % 25%25 \%25 percent sign
  • Our offloading scheme achieves up to 80 % 80 % 80%80 \%80 percent sign GPU memory saving with only less than 1 % 1 % 1%1 \%1 percent sign performance drop. Our hash function can achieve up to 99 % 99 % 99%99 \%99 percent sign prediction accuracy on expert activation.
    我们的卸载方案实现了高达 80 % 80 % 80%80 \%80 percent sign 的GPU内存节省,仅损失了不到 1 % 1 % 1%1 \%1 percent sign 的性能。我们的哈希函数在专家激活上能够达到高达 99 % 99 % 99%99 \%99 percent sign 的预测准确率。
The paper is organized in the following manner: In Section 2 , we introduce the background and motivation. Section 3 is
本文的结构如下:第二部分介绍背景和动机。第三部分是

devoted to the framework of SiDA. In Section 4, we present our experimental results. Sections 5, 6 and 7 are devoted to related works, discussions, and conclusions, respectively.
致力于SiDA框架。在第4节,我们展示了我们的实验结果。第5、6和7节分别致力于相关工作、讨论和结论。

2 BaCkground and Motivation
背景与动机

We introduce the background and motivation for SiDA in this section. For notation, we use a , a , a , A , A a , a , a , A , A a,a,a,A,Aa, \boldsymbol{a}, \mathbf{a}, \boldsymbol{A}, \mathbb{A}a comma bold italic a comma bold a comma bold italic upper A comma double struck upper A to denote a scalar, vector, random vector variable, matrix, and set, respectively. We use [ K ] [ K ] [K][K]left bracket upper K right bracket to denote { 1 , 2 , , K } { 1 , 2 , , K } {1,2,dots,K}\{1,2, \ldots, K\}StartSet 1 comma 2 comma ellipsis comma upper K EndSet.
在本节中,我们将介绍SiDA的背景和动机。对于符号表示,我们分别使用 a , a , a , A , A a , a , a , A , A a,a,a,A,Aa, \boldsymbol{a}, \mathbf{a}, \boldsymbol{A}, \mathbb{A}a comma bold italic a comma bold a comma bold italic upper A comma double struck upper A 来表示标量、向量、随机向量变量、矩阵和集合。我们使用 [ K ] [ K ] [K][K]left bracket upper K right bracket 来表示 { 1 , 2 , , K } { 1 , 2 , , K } {1,2,dots,K}\{1,2, \ldots, K\}StartSet 1 comma 2 comma ellipsis comma upper K EndSet

2.1 Mixture of Experts
2.1 专家混合模型

Since the first proposal of Mixture-of-Experts (MoE) (Jacobs et al., 1991; Jordan & Jacobs, 1994), different MoE models have been proposed based on various experts models, for example, hidden Markov models (Jordan et al., 1996), Gaussian Process (Tresp, 2000), and support vector machine (Collobert et al., 2001). With the rise of deep learning, Eigen et al. propose the use of several sets of routers and experts to build a stacked model, namely Deep MoE (Eigen et al., 2013).
自从混合专家模型(MoE)首次被提出(Jacobs等人,1991年;Jordan & Jacobs,1994年)以来,基于各种专家模型,已经提出了不同的MoE模型,例如,隐马尔可夫模型(Jordan等人,1996年),高斯过程(Tresp,2000年),以及支持向量机(Collobert等人,2001年)。随着深度学习的兴起,Eigen等人提出使用多组路由器和专家来构建一个堆叠模型,即深度MoE(Eigen等人,2013年)。
A MoE layer consists of a router function, denoted as h ( ; W r ) h ; W r h(*;W_(r))h\left(\cdot ; \boldsymbol{W}_{r}\right)h left parenthesis dot semicolon bold italic upper W Subscript r Baseline right parenthesis, followed by K K KKupper K experts in parallel, denoted as { f i ( ; θ i ) } i = 1 K f i ; θ i i = 1 K {f_(i)(*;theta_(i))}_(i=1)^(K)\left\{f_{i}\left(\cdot ; \boldsymbol{\theta}_{i}\right)\right\}_{i=1}^{K}StartSet f Subscript i Baseline left parenthesis dot semicolon bold italic theta Subscript i Baseline right parenthesis EndSet Subscript i equals 1 Superscript upper K. Usually, the router function is set as a linear function, i.e., h ( x ; W r ) = W r x h x ; W r = W r x h(x;W_(r))=W_(r)^(TT)xh\left(\mathbf{x} ; \boldsymbol{W}_{r}\right)=\boldsymbol{W}_{r}^{\top} \mathbf{x}h left parenthesis bold x semicolon bold italic upper W Subscript r Baseline right parenthesis equals bold italic upper W Subscript r Superscript down tack Baseline bold x where W r R d × K W r R d × K W_(r)inR^(d xx K)\boldsymbol{W}_{r} \in \mathbb{R}^{d \times K}bold italic upper W Subscript r Baseline element of double struck upper R Superscript d times upper K for input x R d x R d xinR^(d)\mathbf{x} \in \mathbb{R}^{d}bold x element of double struck upper R Superscript d, and experts are multi-layer perceptrons (MLPs) with a non-linear activation function (Chen et al., 2022; Fedus et al., 2022; Shazeer et al., 2017). The output of a MoE layer takes the form:
MoE层由一个路由函数 h ( ; W r ) h ; W r h(*;W_(r))h\left(\cdot ; \boldsymbol{W}_{r}\right)h left parenthesis dot semicolon bold italic upper W Subscript r Baseline right parenthesis 组成,后面跟着并行的 K K KKupper K 个专家 { f i ( ; θ i ) } i = 1 K f i ; θ i i = 1 K {f_(i)(*;theta_(i))}_(i=1)^(K)\left\{f_{i}\left(\cdot ; \boldsymbol{\theta}_{i}\right)\right\}_{i=1}^{K}StartSet f Subscript i Baseline left parenthesis dot semicolon bold italic theta Subscript i Baseline right parenthesis EndSet Subscript i equals 1 Superscript upper K 。通常,路由函数被设置为线性函数,即 h ( x ; W r ) = W r x h x ; W r = W r x h(x;W_(r))=W_(r)^(TT)xh\left(\mathbf{x} ; \boldsymbol{W}_{r}\right)=\boldsymbol{W}_{r}^{\top} \mathbf{x}h left parenthesis bold x semicolon bold italic upper W Subscript r Baseline right parenthesis equals bold italic upper W Subscript r Superscript down tack Baseline bold x ,其中 W r R d × K W r R d × K W_(r)inR^(d xx K)\boldsymbol{W}_{r} \in \mathbb{R}^{d \times K}bold italic upper W Subscript r Baseline element of double struck upper R Superscript d times upper K 为输入 x R d x R d xinR^(d)\mathbf{x} \in \mathbb{R}^{d}bold x element of double struck upper R Superscript d ,而专家是带有非线性激活函数的多层感知机(MLPs)(Chen等,2022;Fedus等,2022;Shazeer等,2017)。MoE层的输出形式为:
(1) M ( x ; W r , θ 1 , , θ K ) = i I α i ( x ) f i ( x ; θ i ) , (1) M x ; W r , θ 1 , , θ K = i I α i ( x ) f i x ; θ i , {:(1)M(x;W_(r),theta_(1),dots,theta_(K))=sum_(i inI)alpha_(i)(x)f_(i)(x;theta_(i))",":}\begin{equation*} M\left(\mathbf{x} ; \boldsymbol{W}_{r}, \boldsymbol{\theta}_{1}, \ldots, \boldsymbol{\theta}_{K}\right)=\sum_{i \in \mathbb{I}} \alpha_{i}(\mathbf{x}) f_{i}\left(\mathbf{x} ; \boldsymbol{\theta}_{i}\right), \tag{1} \end{equation*}StartLayout 1st Row with Label left parenthesis 1 right parenthesis EndLabel upper M left parenthesis bold x semicolon bold italic upper W Subscript r Baseline comma bold italic theta 1 comma ellipsis comma bold italic theta Subscript upper K Baseline right parenthesis equals sigma summation Underscript i element of double struck upper I Endscripts alpha Subscript i Baseline left parenthesis bold x right parenthesis f Subscript i Baseline left parenthesis bold x semicolon bold italic theta Subscript i Baseline right parenthesis comma EndLayout
where I I I\mathbb{I}double struck upper I contains the selected indices of experts and the scaling factor α i α i alpha_(i)\alpha_{i}alpha Subscript i is defined as
其中 I I I\mathbb{I}double struck upper I 包含了专家的选定索引,而缩放因子 α i α i alpha_(i)\alpha_{i}alpha Subscript i 被定义为
α i ( x ) = exp { W r [ : , i ] x } j = 1 K exp { W r [ : , j ] x } α i ( x ) = exp W r [ : , i ] x j = 1 K exp W r [ : , j ] x alpha_(i)(x)=(exp{W_(r)[:,i]^(TT)x})/(sum_(j=1)^(K)exp{W_(r)[:,j]^(TT)x})\alpha_{i}(\mathbf{x})=\frac{\exp \left\{\boldsymbol{W}_{r}[:, i]^{\top} \mathbf{x}\right\}}{\sum_{j=1}^{K} \exp \left\{\boldsymbol{W}_{r}[:, j]^{\top} \mathbf{x}\right\}}alpha Subscript i Baseline left parenthesis bold x right parenthesis equals StartFraction exp left brace bold italic upper W Subscript r Baseline left bracket colon comma i right bracket Superscript down tack Baseline bold x right brace Over sigma summation Underscript j equals 1 Overscript upper K Endscripts exp left brace bold italic upper W Subscript r Baseline left bracket colon comma j right bracket Superscript down tack Baseline bold x right brace EndFraction
Different selection mechanism of I I I\mathbb{I}double struck upper I leads to different models. The soft-routing model (Jordan & Jacobs, 1994) selects all experts, i.e., I = [ K ] I = [ K ] I=[K]\mathbb{I}=[K]double struck upper I equals left bracket upper K right bracket, which leads to high computational overheads. The switch-routing model (Fedus et al., 2022) selects the top-1 expert, i.e., I = arg max i [ K ] α i ( ) I = arg max i [ K ] α i ( ) I=arg max_(i in[K])alpha_(i)(*)\mathbb{I}=\arg \max _{i \in[K]} \alpha_{i}(\cdot)double struck upper I equals arg max Underscript i element of left bracket upper K right bracket Endscripts alpha Subscript i Baseline left parenthesis dot right parenthesis, introducing little extra computational overhead.
不同的选择机制会导致不同的模型。软路由模型(Jordan & Jacobs, 1994)选择所有专家,即 I = [ K ] I = [ K ] I=[K]\mathbb{I}=[K]double struck upper I equals left bracket upper K right bracket ,这导致了高计算开销。开关路由模型(Fedus等,2022)选择排名第一的专家,即 I = arg max i [ K ] α i ( ) I = arg max i [ K ] α i ( ) I=arg max_(i in[K])alpha_(i)(*)\mathbb{I}=\arg \max _{i \in[K]} \alpha_{i}(\cdot)double struck upper I equals arg max Underscript i element of left bracket upper K right bracket Endscripts alpha Subscript i Baseline left parenthesis dot right parenthesis ,引入了很少的额外计算开销。

2.2 Low Effective Utilization of GPU Memory
2.2 GPU内存的低效利用

Encouraged by the advantage of MoE-based large models that drastically increasing the number of parameters leads to little computational overhead, many large-scale architectures have been proposed such as the Sparsely-Gated MoE (Shazeer et al., 2017), Gshard (Lepikhin et al., 2020), and Switch Transformers (Fedus et al., 2022). Specifically,
受到基于MoE的大型模型的优势鼓舞,即大幅增加参数数量导致的计算开销很小,许多大规模架构被提出,例如稀疏门控MoE(Shazeer等,2017年)、Gshard(Lepikhin等,2020年)和Switch Transformers(Fedus等,2022年)。具体来说,
Figure 2. Memory Efficiency of Switch Transformers on SST2. The x x xxx-axis represents the length of the sentence and the bar records the counts of sentences of corresponding length. The line represents the effective memory utilization for Switch Transformer on SST2 with a varied sentence length. Down to 5 % 5 % 5%5 \%5 percent sign utilization can be observed for large models.
图2. 在SST2上Switch Transformers的内存效率。横轴代表句子长度,柱状图记录了相应长度句子的数量。曲线表示了在SST2上,随着句子长度变化,Switch Transformer的有效内存利用率。对于大型模型,可以观察到内存利用率降低至 5 % 5 % 5%5 \%5 percent sign
the Sparsely-Gated MoE proposes a trainable router function to determine the expert to be activated for each sample, which makes it possible to build very large MoE-based models as it improves the computational efficiency by a large margin compared to the soft-routing selecting all experts. The Sparsely-Gated MoE scales LSTM models to 137 billion parameters achieving outstanding performance. Switch Transformers, the most widely used transformer-based large MoE, converts T5 models (Raffel et al., 2020) to their MoE versions. All Switch Transformers outperform their foundation dense model with the same FLOPs.
稀疏门控MoE提出了一种可训练的路由函数,用于确定每个样本要激活的专家,这使得构建非常大的基于MoE的模型成为可能,因为与选择所有专家的软路由相比,它大幅提高了计算效率。稀疏门控MoE将LSTM模型扩展到了1370亿参数,取得了卓越的性能。Switch Transformers是最广泛使用的基于transformer的大型MoE,它将T5模型(Raffel等人,2020)转换为它们的MoE版本。所有Switch Transformers在相同的FLOPs下都超过了它们的基础密集模型。
In our study, we found that large MoE models do not efficiently utilize GPUs. As shown in Eq. 1, we denote an expert as activated if i I i I i inIi \in \mathbb{I}i element of double struck upper I. Inactivated experts remain idle in the forward pass, leading to low effective GPU memory utilization. Effective GPU memory refers to the memory storing parameters that are effective for the forwarding of the model. The inactivated experts occupy a large amount of GPU memory while remaining idle, leading to low effective GPU memory utilization. To quantitatively analyze the GPU memory utilization, we provide a summary of Switch Transformers on model size and MoE layer size in Table 2. It is shown that for all Switch Transformers, especially the large ones, MoE layers occupy a large portion of GPU memory. Meanwhile, most of the parameters of the MoE layers are idle during one forward pass. To ascertain the amount of ineffective GPU memory, we feed samples from the SST2 dataset to Switch Transformers and record the corresponding effective memory utilization rates. The results are depicted in Fig. 2. For large Switch Transformers such as Switch-base-128 and Switch-base-256, the ineffective GPU memory for short sentences is around 24GB and 50GB, respectively. Even for the longest sentences with 80 tokens, the ineffective GPU memory is around 20GB and 46GB, respectively. Our method, SiDA, can save all ineffective GPU memory, outperforming baselines by a large margin. Further results on GPU memory reduction across datasets can be found in Section 4.
在我们的研究中,我们发现大型MoE模型并不能有效利用GPU。如方程1所示,我们将一个专家定义为激活状态,如果 i I i I i inIi \in \mathbb{I}i element of double struck upper I 。未激活的专家在前向传播中保持空闲,导致有效GPU内存利用率低。有效GPU内存指的是存储对模型前向传播有效的参数的内存。未激活的专家占用了大量GPU内存同时保持空闲,导致有效GPU内存利用率低。为了定量分析GPU内存的利用率,我们在表2中提供了Switch Transformers模型大小和MoE层大小的总结。结果显示,对于所有Switch Transformers,特别是大型的,MoE层占用了大量GPU内存。同时,MoE层的大多数参数在一次前向传播中处于空闲状态。为了确定无效GPU内存的数量,我们向Switch Transformers输入SST2数据集的样本,并记录相应的有效内存利用率。结果如图2所示。对于大型Switch Transformers,如Switch-base-128和Switch-base-256,短句子的无效GPU内存分别约为24GB和50GB。即使对于最长的含有80个词汇的句子,无效GPU内存也分别约为20GB和46GB。我们的方法SiDA可以节省所有无效GPU内存,大幅度超过基准线。关于跨数据集GPU内存减少的进一步结果可以在第4节找到。
Table 2. Memory Occupation of Switch Transformers. This table highlights the allocation of parameters in gigabytes (GB) for different models. MoE parameters dominate memory usage, especially in larger models. In contrast, mainstream GPUs peak at 48GB, with many at 24GB, while mobile GPUs range from 4GB to 12GB.
表2. 交换式变换器的内存占用。此表格突出了不同模型中以吉字节(GB)为单位的参数分配。MoE参数在内存使用中占主导地位,尤其是在较大的模型中。相比之下,主流GPU的峰值为48GB,许多为24GB,而移动GPU的范围则是4GB到12GB。
Model (GB) 型号(英国) MoE (GB) 教育部(英国) Percentage (%) 百分比(%)
Switch-base-8 切换到八进制基数 2.298 1.7932 78.03
Switch-base-64 切换至Base-64 14.112 13.608 96.42
Switch-base-128 切换基数为128 27.614 27.11 98.17
Switch-base-256 切换基数为256 54.62 54.114 99.07
Model (GB) MoE (GB) Percentage (%) Switch-base-8 2.298 1.7932 78.03 Switch-base-64 14.112 13.608 96.42 Switch-base-128 27.614 27.11 98.17 Switch-base-256 54.62 54.114 99.07| | Model (GB) | MoE (GB) | Percentage (%) | | :---: | :---: | :---: | :---: | | Switch-base-8 | 2.298 | 1.7932 | 78.03 | | Switch-base-64 | 14.112 | 13.608 | 96.42 | | Switch-base-128 | 27.614 | 27.11 | 98.17 | | Switch-base-256 | 54.62 | 54.114 | 99.07 |
Figure 3. Expert Selection Overhead on SST2. The bar depicts the percentage breakdown for expert selection overhead and total inference latency. Up to 74 % 74 % 74%74 \%74 percent sign time on Switch-base-256 are occupied by expert selection. Notably, the occupation of expert selection overhead scales up as model size increases.
图3. SST2上的专家选择开销。该柱状图显示了专家选择开销和总推理延迟的百分比分布。在Switch-base-256上,专家选择占用了高达 74 % 74 % 74%74 \%74 percent sign 的时间。值得注意的是,随着模型大小的增加,专家选择开销的占用比例也在增加。

2.3 High Expert Selection Overhead
2.3 高专家选择成本

Apart from the low effective GPU memory utilization, we also observed a high overhead on expert selection in the feedforward pass of MoE. Specifically, in all baseline implementations of MoE models, a non-negligible amount of time is consumed in the process of selecting the most suitable experts. We conduct experiments on SST2 with multiple MoE models and provide the profiling results of averaged inference time and expert selection overhead in Fig. 3. It is shown that the expert selection process consumes nearly 75 % 75 % 75%75 \%75 percent sign of the total inference time for Switch-base-256, which is a bottleneck of the inference latency. Notably, the overhead associated with expert selection escalates with the scale of the model, further emphasizing the imperative of addressing the bottleneck in inference efficiency.
除了低有效GPU内存利用率外,我们还观察到MoE前馈过程中专家选择的高开销。具体来说,在所有MoE模型的基线实现中,选择最合适的专家的过程消耗了不可忽视的时间。我们在SST2上使用多个MoE模型进行实验,并在图3中提供了平均推理时间和专家选择开销的分析结果。结果显示,对于Switch-base-256,专家选择过程消耗了总推理时间的近一半,这是推理延迟的瓶颈。值得注意的是,随着模型规模的扩大,与专家选择相关的开销增加,进一步强调了解决推理效率瓶颈的重要性。

2.4 Sparse Activation of Experts in Large MoE Models
大型MoE模型中专家的稀疏激活

The sparse selection of experts is one of the critical observations that motivate SiDA. Our observation verifies that only a small portion of experts will be activated during inference.
稀疏专家选择是激励SiDA的关键观察之一。我们的观察验证了在推理过程中只有一小部分专家会被激活。
For each token, the router function will select either top K K KKupper K (Shazeer et al., 2017) or top-1 (Fedus et al., 2022) experts inducing a token level expert activation sparsity. However, the sparsity on sentences, typically with 512 or 768 tokens, remains elusive. Not to mention in the training stage, an expert loading balance loss must be applied, which forces the router to assign an almost equal number of tokens to
对于每个标记,路由函数将选择顶尖的 K K KKupper K (Shazeer等人,2017年)或顶尖的1(Fedus等人,2022年)专家,引发标记级别的专家激活稀疏性。然而,对于通常包含512或768个标记的句子,其稀疏性仍然难以捉摸。更不用说在训练阶段,必须应用一个专家负载平衡损失,这迫使路由器分配几乎相等数量的标记给
Figure 4. Expert Activation in Switch Transformers on SST2. The x x xxx-axis denotes sentence length, with bars illustrating the counts of given lengths. The line depicts the ration of idle experts. Notably, Switch-base-256 and Switch-base-128 activate less than 20 % 20 % 20%20 \%20 percent sign and 40 % 40 % 40%40 \%40 percent sign of their experts, respectively.
图4. 在SST2上Switch Transformers的专家激活情况。横轴表示句子长度,条形图显示了给定长度的计数。该线表示空闲专家的比例。值得注意的是,Switch-base-256和Switch-base-128分别激活了不到 20 % 20 % 20%20 \%20 percent sign 40 % 40 % 40%40 \%40 percent sign 的专家。
each expert. Otherwise, router's outputs will collapse to few experts leading to capacity degradation (Chen et al., 2022).
每个专家。否则,路由器的输出将会倒塌至少数专家,导致容量下降(陈等,2022)。
We test Switch Transformers with different number of experts on the SST2 dataset and report the sentence level sparsity in Fig. 4. Our observation verifies that the sparse activation pattern still exists at the sentence level for large MoE models such as Switch-base-128 and Switch-base-256. As shown in the figure, down to less than 40 % 40 % 40%40 \%40 percent sign experts and 20 % 20 % 20%20 \%20 percent sign experts are activated for Switch-base-128 and Switchbase-256, respectively. Even for the longest sentences with around 80 tokens, the ratio of idle experts is still higher than 70 % 70 % 70%70 \%70 percent sign for Switch-base- 128 and 80 % 80 % 80%80 \%80 percent sign for Switch-base-256.
我们在SST2数据集上测试了具有不同专家数量的Switch Transformers,并在图4中报告了句子级别的稀疏性。我们的观察验证了,对于大型MoE模型(如Switch-base-128和Switch-base-256)而言,稀疏激活模式在句子级别仍然存在。如图所示,对于Switch-base-128和Switch-base-256,激活的专家数量分别减少到少于 40 % 40 % 40%40 \%40 percent sign 个和 20 % 20 % 20%20 \%20 percent sign 个。即使对于大约有80个词符的最长句子,空闲专家的比例仍然高于Switch-base-128的 70 % 70 % 70%70 \%70 percent sign 和Switch-base-256的 80 % 80 % 80%80 \%80 percent sign

3 SIDA 3 艾滋病

3.1 Overview: workflow 3.1 概览:工作流程

We introduce a novel framework, Sparsity-inspired DataAware (SiDA), for efficient inference of large MoE models, whose overview is shown in Fig. 5. SiDA contains two parallel threads that run simultaneously, namely the Inference thread and the Hash-building thread. Consider a sequence of incoming batches, batch X j X j X_(j)\mathbb{X}_{j}double struck upper X Subscript j is fed to the hash-building thread to build the hash table H j H j H_(j)\mathbb{H}_{j}double struck upper H Subscript j storing expert activation patterns for batch X j X j X_(j)\mathbb{X}_{j}double struck upper X Subscript j, which will be pushed to the hash table queue. At the same time, the inference thread is handling the precedent batch X i X i X_(i)\mathbb{X}_{i}double struck upper X Subscript i and operating dynamical offloading on MoE layers based on the hash table H i H i H_(i)\mathbb{H}_{i}double struck upper H Subscript i.
我们提出了一个新颖的框架,即受稀疏性启发的数据感知(SiDA),用于高效推理大型MoE模型,其概览如图5所示。SiDA包含两个并行线程,它们同时运行,分别是推理线程和哈希构建线程。考虑一个序列的传入批次,批次 X j X j X_(j)\mathbb{X}_{j}double struck upper X Subscript j 被送入哈希构建线程以构建哈希表 H j H j H_(j)\mathbb{H}_{j}double struck upper H Subscript j ,用于存储批次 X j X j X_(j)\mathbb{X}_{j}double struck upper X Subscript j 的专家激活模式,该模式将被推送到哈希表队列中。同时,推理线程正在处理前一个批次 X i X i X_(i)\mathbb{X}_{i}double struck upper X Subscript i ,并根据哈希表 H i H i H_(i)\mathbb{H}_{i}double struck upper H Subscript i 对MoE层进行动态卸载。
Hash-building thread. The Hash-building thread consists of two components, a hash function and a hash table queue. For each incoming batch (1)-a), the hash function will determine experts to be activated for each token at each layer and the corresponding scaling factor α α alpha\alphaalpha (1)-b). The predictions are stored in the hash table H j H j H_(j)\mathbb{H}_{j}double struck upper H Subscript j for the batch X j X j X_(j)\mathbb{X}_{j}double struck upper X Subscript j and pushed to the hash table queue (1)-c). The hash function can be a predefined hash function if the MoE model is trained with the Hash layer (Roller et al., 2021). More commonly, for the MoE model using trained router functions,
构建哈希线程。哈希构建线程由两部分组成,一个哈希函数和一个哈希表队列。对于每个传入的批次(1)-a),哈希函数将确定在每一层为每个令牌激活的专家及其相应的缩放因子 α α alpha\alphaalpha (1)-b)。预测结果存储在哈希表 H j H j H_(j)\mathbb{H}_{j}double struck upper H Subscript j 中,针对批次 X j X j X_(j)\mathbb{X}_{j}double struck upper X Subscript j 并被推送到哈希表队列(1)-c)。如果MoE模型是与哈希层一起训练的,哈希函数可以是预定义的哈希函数(Roller等人,2021)。更常见的是,对于使用训练过的路由函数的MoE模型,
Figure 5. Overview of SiDA. SiDA contains two threads, the inference and hash-building thread, that run concurrently. As each batch X j X j X_(j)\mathbb{X}_{j}double struck upper X Subscript j arrives, the hash-building thread constructs the expert hash table H j H j H_(j)\mathbb{H}_{j}double struck upper H Subscript j and queues it. In tandem, the inference thread processes the preceding batch X i X i X_(i)\mathbb{X}_{i}double struck upper X Subscript i, dynamically managing experts in MoE layers based on the hash table H i H i H_(i)\mathbb{H}_{i}double struck upper H Subscript i.
图5. SiDA概览。SiDA包含两个并行运行的线程:推理线程和哈希构建线程。随着每个批次 X j X j X_(j)\mathbb{X}_{j}double struck upper X Subscript j 的到来,哈希构建线程构建专家哈希表 H j H j H_(j)\mathbb{H}_{j}double struck upper H Subscript j 并将其排队。同时,推理线程处理前一个批次 X i X i X_(i)\mathbb{X}_{i}double struck upper X Subscript i ,基于哈希表 H i H i H_(i)\mathbb{H}_{i}double struck upper H Subscript i 动态管理MoE层中的专家。
such as Switch Transformers, the hash function will be offline trained. We propose hash function training techniques dedicated to modern MoE models, which will be introduced in later sections.
例如在Switch Transformers中,哈希函数将会进行离线训练。我们提出了专门针对现代MoE模型的哈希函数训练技术,这些技术将在后续章节中介绍。
Inference thread. The inference thread performs two tasks, i.e., dynamically load activated experts and offload inactivated experts according to the hash table built by the hashbuilding thread, and use the SiDA MoE layers to inference input batches. Specifically, for each incoming batch X i X i X_(i)\mathbb{X}_{i}double struck upper X Subscript i (2)-a), the inference thread will first pop the hash table H i H i H_(i)\mathbb{H}_{i}double struck upper H Subscript i from the hash table queue (2)-b) and remain idle if H i H i H_(i)\mathbb{H}_{i}double struck upper H Subscript i is not found. Notably, in practice, the inference thread takes a longer time to inference a batch than the hash-building thread to build a hash table for a batch. As a result, the inference thread never idles except at the very beginning. With the popped hash table H i H i H_(i)\mathbb{H}_{i}double struck upper H Subscript i, the next step is to dynamically load and offload experts. Based on GPU memory budgets and the expert activation pattern of the current batch, the inference thread will load activated experts to GPU and offload inactivated experts to RAM (2)-c). A first-in-first-out (FIFO) scheme is applied on experts if no memory budgets remain. The dynamical loading task of a MoE layer will be done right after the finish of inference on the previous batch following the pipeline parallelism mechanism (Huang et al., 2019). Note that, in our system, all routers are offloaded to the main memory and do not participate in the forward pass. Lastly, the incoming batch X i X i X_(i)\mathbb{X}_{i}double struck upper X Subscript i will be forwarded using the SiDA MoE layers specific to X i ( 2 ) X i ( 2 ) X_(i)(2)\mathbb{X}_{i}(2)double struck upper X Subscript i Baseline left parenthesis 2 right parenthesis-d).
推理线程。推理线程执行两项任务,即根据哈希构建线程建立的哈希表动态加载激活的专家并卸载未激活的专家,并使用SiDA MoE层对输入批次进行推理。具体来说,对于每个传入的批次 X i X i X_(i)\mathbb{X}_{i}double struck upper X Subscript i (2)-a),推理线程首先会从哈希表队列中弹出哈希表 H i H i H_(i)\mathbb{H}_{i}double struck upper H Subscript i (2)-b),如果找不到 H i H i H_(i)\mathbb{H}_{i}double struck upper H Subscript i 则保持空闲。值得注意的是,在实践中,推理线程对一个批次进行推理的时间比哈希构建线程为一个批次构建哈希表的时间要长。因此,除了最开始之外,推理线程从不空闲。有了弹出的哈希表 H i H i H_(i)\mathbb{H}_{i}double struck upper H Subscript i ,下一步是动态加载和卸载专家。根据GPU内存预算和当前批次的专家激活模式,推理线程将加载激活的专家到GPU并将未激活的专家卸载到RAM(2)-c)。如果没有剩余的内存预算,将对专家应用先进先出(FIFO)方案。在上一个批次的推理完成后,将立即完成MoE层的动态加载任务,遵循管道并行机制(Huang等,2019)。请注意,在我们的系统中,所有路由器都被卸载到主内存中,不参与前向传递。最后,传入的批次 X i X i X_(i)\mathbb{X}_{i}double struck upper X Subscript i 将使用针对 X i ( 2 ) X i ( 2 ) X_(i)(2)\mathbb{X}_{i}(2)double struck upper X Subscript i Baseline left parenthesis 2 right parenthesis -d)的SiDA MoE层进行转发。

3.2 Design challenges 3.2 设计挑战

In the design of SiDA, we spot three key challenges.
在SiDA的设计中,我们发现了三个关键挑战。
Challenge 1: How to efficiently obtain experts that are to be offloaded beforehand? Given the observation that experts are activated sparsely, it is trivial to save GPU memory by offloading inactivated experts to RAM. However, this naive implementation sacrifices the latency since ex- pert activation patterns are inaccessible without the output of the router functions. It incurs large overheads to move experts between CPU and GPU after each router function as it breaks the forwarding pipeline. We propose to use an offline-trained hash function to acquire the expert activation pattern before inference starts for each batch. Furthermore, we design the hash function to run independently of model inference and build a hash-building thread running in parallel with the inference thread to achieve the efficiency requirements. By employing the hash-building thread, SiDA achieves outstanding latency compared to baselines since the expert selection, dynamical offloading, and inference all run in parallel.
挑战1:如何高效地提前获取将要卸载的专家?鉴于观察到专家被稀疏激活的现象,通过将未激活的专家卸载到RAM中可以节省GPU内存,这一点是显而易见的。然而,这种天真的实现牺牲了延迟,因为在没有路由函数输出的情况下,无法访问专家激活模式。每次路由函数后在CPU和GPU之间移动专家会造成大量开销,因为它打破了前向传播管道。我们提出使用一个离线训练的哈希函数,在每个批次的推理开始前获取专家激活模式。此外,我们设计哈希函数独立于模型推理运行,并构建一个与推理线程并行运行的哈希构建线程,以满足效率要求。通过使用哈希构建线程,SiDA与基准相比实现了卓越的延迟,因为专家选择、动态卸载和推理都并行运行。

Challenge 2: How to leverage sparse cross-embedding
挑战 2:如何利用稀疏交叉嵌入

dependency on experts activation to design a lightweight offline trained hash function? Considering the inference efficiency and the GPU memory consumption of the system, the hash function must be a lightweight predictor. However, simple predictors can hardly capture the contextual information of the sequence and can be easily distracted. Hence, it becomes crucial to enforce the predictor to focus on critical information. We empirically verify that there exists a sparse cross-embedding dependency on expert activation, i.e., a limited number of embeddings in the sequence jointly affect expert activation. This sparse cross-embedding dependency sheds light on the success of lightweight predictors. However, it is impractical and inefficient to rule out all possible outcomes to find the cross-embedding dependency for every token. In response to the challenge, we propose a sparse attention mechanism on LSTM that enforces the predictor to focus on the most important embedding automatically.
依赖专家激活来设计轻量级离线训练的哈希函数?考虑到系统的推理效率和GPU内存消耗,哈希函数必须是一个轻量级的预测器。然而,简单的预测器很难捕捉序列的上下文信息,并且很容易被分散注意力。因此,强制预测器专注于关键信息变得至关重要。我们通过实证验证发现,存在一种稀疏的交叉嵌入依赖于专家激活,即序列中有限数量的嵌入共同影响专家激活。这种稀疏的交叉嵌入依赖为轻量级预测器的成功提供了线索。然而,排除所有可能的结果以找到每个令牌的交叉嵌入依赖是不切实际且低效的。为了应对这一挑战,我们提出了一种基于LSTM的稀疏注意力机制,自动强制预测器专注于最重要的嵌入。
Challenge 3: How to improve the expert selection accuracy and approximate the scaling factor simultaneously? The hash function needs to determine not only the expert activation but also the scaling factor α α alpha\alphaalpha in Eq. 1. As the scaling factor is derived from the SoftMax logits output from the
挑战 3:如何同时提高专家选择的准确性和近似缩放因子?哈希函数不仅需要确定专家激活,还需要确定方程式 1 中的缩放因子 α α alpha\alphaalpha 。由于缩放因子是从 SoftMax logits 输出中得出的,

model, it is natural to apply knowledge distillation (KD), setting the router functions as teacher models and the hash function as the student model. However, it is impossible for the hash function to approximate the scaling factor distribution over all experts by KD due to the limited capacity of the hash function. To solve this challenge, we propose to use a truncated knowledge distillation (TKD), where the KD loss is computed over the top- T T TTupper T experts. However, the TKD cannot guarantee adequate prediction accuracy. We further add a cross-entropy loss to boost the prediction accuracy.
在该模型中,自然而然地应用知识蒸馏(KD),将路由函数设为教师模型,哈希函数设为学生模型。然而,由于哈希函数的容量有限,它不可能通过KD来近似所有专家的缩放因子分布。为了解决这一挑战,我们提出使用截断知识蒸馏(TKD),其中KD损失是在前 T T TTupper T 个专家上计算的。然而,TKD不能保证足够的预测准确性。我们进一步添加了交叉熵损失以提高预测准确性。
We introduce how SiDA deals with each challenge in detail in the following sections.
在接下来的章节中,我们将详细介绍SiDA如何应对每一个挑战。

3.3 Data-Aware and Efficient Expert Activation Prediction
3.3 数据感知与高效的专家激活预测

SiDA proposes a data-aware solution to efficiently obtain the experts to be offloaded beforehand. Specifically, we propose to use a trained hash function that takes the sequence of embedding as input and predicts all the activated experts for each token in the sequence. SiDA, augmented by the data-aware expert activation prediction, enjoys two advantages while compromising little loss of model performance down to less than 1 % 1 % 1%1 \%1 percent sign. Firstly, the system can acquire the activation pattern of each sample beforehand and operate dynamically loading and offloading according to the GPU memory budget without interrupting the inference process. Secondly, since the hash function determines the expert activation across all the MoE layers for a sample independently of the inference, the system can build the hash function in a hash-building thread running in parallel with the inference thread. By doing this, we can remove the overhead caused by expert selection from the inference time, which boosts the throughput up to 3.93 × 3.93 × 3.93 xx3.93 \times3.93 times.
SiDA 提出了一种数据感知的解决方案,以高效地预先获取需要卸载的专家。具体来说,我们提出使用一个训练好的哈希函数,该函数以嵌入序列为输入,并预测序列中每个令牌的所有激活专家。通过数据感知的专家激活预测增强的 SiDA,在几乎不损失模型性能的情况下(降低到小于 0),享有两大优势。首先,系统可以预先获取每个样本的激活模式,并根据 GPU 内存预算动态地加载和卸载,而不中断推理过程。其次,由于哈希函数独立于推理过程确定样本在所有 MoE 层中的专家激活,系统可以在与推理线程并行运行的哈希构建线程中构建哈希函数。通过这样做,我们可以去除推理时间中由专家选择引起的开销,从而将吞吐量提高到 <1>。
Previous works have also been proposed to improve the router function of MoE, such as the Hash layer (Roller et al., 2021) and the Base layer (Lewis et al., 2021). SiDA is orthogonal to these router functions as they can be accommodated in the hash-building thread. For MoE models with trained routers, we propose to train an LSTM as the hash function with the sparse attention boosted with our truncated knowledge distillation, detailed in the following sections.
之前的研究也提出了改进MoE路由器功能的方法,例如哈希层(Roller等,2021年)和基础层(Lewis等,2021年)。SiDA与这些路由器功能是正交的,因为它们可以被容纳在哈希构建线程中。对于具有训练过的路由器的MoE模型,我们提议使用LSTM作为哈希函数,并通过我们的截断知识蒸馏增强稀疏注意力,详细内容将在以下部分中说明。

3.4 LSTM with Sparse Attention
3.4 带有稀疏注意力的LSTM

3.4.1 Sparse cross-embedding dependency on expert activation
3.4.1 专家激活的稀疏交叉嵌入依赖

In the MoE layer, each word embedding will be fed to the router function to decide which expert to activate for inference of the token. However, the expert activation does not solely depend on the embedding corresponding to the token due to the self-attention layer before each MoE layer (shown in Fig. 1), where the word embedding is mixed to-
在MoE层中,每个词嵌入都将被送入路由函数以决定激活哪个专家进行令牌的推断。然而,专家的激活并不完全依赖于与令牌相对应的嵌入,因为在每个MoE层之前的自注意力层(如图1所示),词嵌入会被混合。
Figure 6. Visualization of Eq. 2 over Different p p ppp and c c ccc.
图 6. 在不同的 p p ppp c c ccc 下方程 2 的可视化。
(a) Tokens dependency. 令牌依赖。
(b) Positions dependency.
职位依赖性。

Figure 7. Cross-embedding Dependency for Expert Activation on Switch-base- 128 on C4. The x x xxx-axis shows the proportion of corruption, while the y y yyy-axis represents the empirical probability of expert activation change. Over 100 random embedding positions are examined, with the average trend displayed.
图7. 在C4上基于开关的128专家激活的交叉嵌入依赖性。X轴显示了损坏的比例,而Y轴代表了专家激活变化的经验概率。检查了100多个随机嵌入位置,显示了平均趋势。
gether. Because of the positional embedding, the position of tokens will also affect the expert activation. While the process by which embeddings collectively influence expert activation is complex, we identify a sparse cross-embedding dependency on expert activation, indicating that only a limited number of other tokens and positions are critical to the expert activation for the current token.
由于位置嵌入的存在,标记的位置也会影响专家激活。虽然嵌入共同影响专家激活的过程是复杂的,我们发现了一个稀疏的跨嵌入对专家激活的依赖,表明只有有限数量的其他标记和位置对当前标记的专家激活至关重要。
Suppose a sequence of length L L LLupper L, and let c i c i c_(i)c_{i}c Subscript i denote the number of critical tokens for the token at position i i iii. We define the critical tokens as tokens in the sequence other than the selected i i iii-th token, whose changes lead to a change in expert activation of the i i iii-th token. In order to empirically verify that c i c i c_(i)c_{i}c Subscript i is a small number for all i i iii, we consider finding a combinatorial equation involving c i c i c_(i)c_{i}c Subscript i and quantities we can measure. Consider selecting a set of tokens from the sequence excluding the i i iii-th token, the probability that the set contains a critical token is formulated as below:
假设一个长度为 L L LLupper L 的序列,并且让 c i c i c_(i)c_{i}c Subscript i 表示位于 i i iii 位置的标记的关键标记数量。我们将关键标记定义为序列中除了选定的第 i i iii 个标记之外的标记,其变化会导致第 i i iii 个标记的专家激活发生变化。为了实证验证 c i c i c_(i)c_{i}c Subscript i 对所有 i i iii 来说是一个小数,我们考虑找到一个涉及 c i c i c_(i)c_{i}c Subscript i 和我们可以测量的量的组合方程。考虑从序列中选择一组标记,排除第 i i iii 个标记,该组包含关键标记的概率如下所述:
(2) E [ p ^ i ] = 1 ( L 1 c i p L ) ( L 1 p L ) (2) E p ^ i = 1 L 1 c i p L L 1 p L {:(2)E[ hat(p)_(i)]=1-(([L-1-c_(i)],[|__ pL __|]))/(([L-1],[|__ pL __|])):}\mathbb{E}\left[\hat{p}_{i}\right]=1-\frac{\left(\begin{array}{c} L-1-c_{i} \tag{2}\\ \lfloor p L\rfloor \end{array}\right)}{\left(\begin{array}{c} L-1 \\ \lfloor p L\rfloor \end{array}\right)}StartLayout 1st Row with Label left parenthesis 2 right parenthesis EndLabel double struck upper E left bracket ModifyingAbove p With caret Subscript i Baseline right bracket equals 1 minus StartFraction StartBinomialOrMatrix upper L minus 1 minus c Subscript i Baseline Choose left floor p upper L right floor EndBinomialOrMatrix Over StartBinomialOrMatrix upper L minus 1 Choose left floor p upper L right floor EndBinomialOrMatrix EndFraction EndLayout
where p L p L |__ pL __|\lfloor p L\rfloorleft floor p upper L right floor denotes the size of the set and p p ppp denotes the portion of selection over the sequence. Note that the probability that the selected set of tokens contains a critical token is equal to the probability that the i i iii-th token's expert activation changes, denoted as p ^ i p ^ i hat(p)_(i)\hat{p}_{i}ModifyingAbove p With caret Subscript i, if we change all selected
其中 p L p L |__ pL __|\lfloor p L\rfloorleft floor p upper L right floor 表示集合的大小, p p ppp 表示在序列上选择的部分。注意,选定的令牌集包含关键令牌的概率等于第 i i iii 个令牌的专家激活发生变化的概率,如果我们更改所有选定的,这一概率表示为 p ^ i p ^ i hat(p)_(i)\hat{p}_{i}ModifyingAbove p With caret Subscript i

tokens in the set. We denote the process of changing the tokens in a sequence as 'corruption.' Given Eq. 2, p p ppp and p ^ p ^ hat(p)\hat{p}ModifyingAbove p With caret are quantities that we can empirically acquire, that is, by randomly selecting a portion p p ppp of tokens, we can empirically measure the probability that the i i iii-th token's expert activation changes. We show in Fig. 6 the relation between c c ccc and p ^ p ^ hat(p)\hat{p}ModifyingAbove p With caret under different p p ppp.
集合中的标记。我们将序列中改变标记的过程称为“损坏”。根据方程2, p p ppp p ^ p ^ hat(p)\hat{p}ModifyingAbove p With caret 是我们可以通过经验获得的量,即通过随机选择一部分 p p ppp 的标记,我们可以经验性地测量第 i i iii 个标记的专家激活改变的概率。我们在图6中展示了在不同 p p ppp c c ccc p ^ p ^ hat(p)\hat{p}ModifyingAbove p With caret 之间的关系。
Empirically, to study the token dependency of the token at position i i iii, the corruption is executed by randomly modifying a fraction p p ppp of chosen tokens from [ L ] { i } [ L ] { i } [L]-{i}[L]-\{i\}left bracket upper L right bracket minus StartSet i EndSet to values distinct from their original and the i i iii-th token. To examine the position dependency for the i i iii-th token, the corruption also involves randomly choosing a fraction p p ppp of positions from [ L ] { i } [ L ] { i } [L]-{i}[L]-\{i\}left bracket upper L right bracket minus StartSet i EndSet and swapping the token positions. We use the English division in the dataset C4 (Raffel et al., 2020) to measure the probability that the i i iii-th token's expert activation changes under different p p ppp, depicted in Fig. 7. We set the length L = 512 L = 512 L=512L=512upper L equals 512 and truncate or pad sentences which are not of length 512. We randomly test over 100 word embedding positions (i.e., 100 i 100 i 100 i100 i100 i 's) on Switch-base-128 and plot all of them in Fig. 7 with the average trend shown. Fig. 7a and Fig. 7b show the cross-embedding dependency of the token and position, respectively. Only a large portion of corruption leads to high chances of expert activation change, which demonstrates that most of the other tokens do not have an impact on the expert activation of the current token.
从经验上来看,为了研究位于位置 i i iii 的令牌的依赖性,通过随机修改选定令牌中的一部分 p p ppp ,将其值改为与原始值及第 i i iii 个令牌的值不同的值来执行破坏操作。为了检查第 i i iii 个令牌的位置依赖性,破坏操作还包括从 [ L ] { i } [ L ] { i } [L]-{i}[L]-\{i\}left bracket upper L right bracket minus StartSet i EndSet 中随机选择一部分位置并交换令牌位置。我们使用数据集C4(Raffel等人,2020)中的英文部分来测量在不同 p p ppp 下,第 i i iii 个令牌的专家激活改变的概率,如图7所示。我们设置长度 L = 512 L = 512 L=512L=512upper L equals 512 ,并截断或填充不是512长度的句子。我们在Switch-base-128上随机测试了100个词嵌入位置(即 100 i 100 i 100 i100 i100 i 的位置),并将它们全部绘制在图7中,显示平均趋势。图7a和图7b分别展示了令牌和位置的跨嵌入依赖性。只有大量的破坏才会导致专家激活改变的高概率,这表明大多数其他令牌对当前令牌的专家激活没有影响。
By combining Fig. 6 and Fig. 7, we can read the best approximation of c i c i c_(i)c_{i}c Subscript i based on different pairs of ( p , p ^ ) ( p , p ^ ) (p, hat(p))(p, \hat{p})left parenthesis p comma ModifyingAbove p With caret right parenthesis in Fig. 7, where we find that the best approximation of c ^ c ^ hat(c)\hat{c}ModifyingAbove c With caret ranges from 1 to 4 demonstrating the sparse cross-embedding dependency.
通过结合图6和图7,我们可以根据图7中不同的 ( p , p ^ ) ( p , p ^ ) (p, hat(p))(p, \hat{p})left parenthesis p comma ModifyingAbove p With caret right parenthesis 对读出 c i c i c_(i)c_{i}c Subscript i 的最佳近似值,在此我们发现 c ^ c ^ hat(c)\hat{c}ModifyingAbove c With caret 的最佳近似值范围从1到4,展示了稀疏的交叉嵌入依赖性。

3.4.2 Design of the hash function
3.4.2 哈希函数的设计

The design of the hash function must satisfy the following conditions: (1) be able to capture the sequential information, (2) be lightweight to preserve efficiency, and (3) be able to extract and focus on the critical embedding automatically. We adopt a 2-layer LSTM followed by a fully connected layer to align the first two conditions. Further, we add one fully connected layer to compress the embedding dimension. To achieve the third condition, we adopt the sparse attention mechanism with the SparseMax activation (Martins & Astudillo, 2016).
哈希函数的设计必须满足以下条件:(1)能够捕获序列信息,(2)要轻量以保持效率,(3)能够自动提取并关注关键嵌入。我们采用了一个2层的LSTM,后接一个全连接层来满足前两个条件。进一步地,我们增加了一个全连接层来压缩嵌入维度。为了实现第三个条件,我们采用了带有SparseMax激活函数的稀疏注意力机制(Martins & Astudillo, 2016)。
Attention mechanism. The attention mechanism was first proposed in (Bahdanau et al., 2015), which has been proven to be influential in the realm of deep learning. The attention mechanism was proposed to allow the decoder to focus on different parts, resolving the problem that the encoder encodes the entire sentence. Given a query q q q\boldsymbol{q}bold italic q and a set of key-value pairs ( k , v ) ( k , v ) (k,v)(\boldsymbol{k}, \boldsymbol{v})left parenthesis bold italic k comma bold italic v right parenthesis, the attention mechanism computes a weighted sum of values based on the similarity of the query to the keys. Formally, the attention weights w w w\boldsymbol{w}bold italic w and the output o o o\boldsymbol{o}bold italic o are computed as o = i w i v i o = i w i v i o=sum_(i)w_(i)v_(i)\boldsymbol{o}=\sum_{i} w_{i} \boldsymbol{v}_{i}bold italic o equals sigma summation Underscript i Endscripts w Subscript i Baseline bold italic v Subscript i with
注意力机制。注意力机制最初由(Bahdanau等人,2015年)提出,已被证明在深度学习领域具有重要影响。注意力机制的提出是为了让解码器能够关注不同的部分,解决了编码器将整个句子编码的问题。给定一个查询 q q q\boldsymbol{q}bold italic q 和一组键值对 ( k , v ) ( k , v ) (k,v)(\boldsymbol{k}, \boldsymbol{v})left parenthesis bold italic k comma bold italic v right parenthesis ,注意力机制根据查询与键的相似度计算值的加权和。形式上,注意力权重 w w w\boldsymbol{w}bold italic w 和输出 o o o\boldsymbol{o}bold italic o 的计算如 o = i w i v i o = i w i v i o=sum_(i)w_(i)v_(i)\boldsymbol{o}=\sum_{i} w_{i} \boldsymbol{v}_{i}bold italic o equals sigma summation Underscript i Endscripts w Subscript i Baseline bold italic v Subscript i 所示。
w i = exp ( score ( q , k i ) ) j exp ( score ( q , k j ) ) w i = exp score q , k i j exp score q , k j w_(i)=(exp(score(q,k_(i))))/(sum_(j)exp(score(q,k_(j))))w_{i}=\frac{\exp \left(\operatorname{score}\left(\boldsymbol{q}, \boldsymbol{k}_{i}\right)\right)}{\sum_{j} \exp \left(\operatorname{score}\left(\boldsymbol{q}, \boldsymbol{k}_{j}\right)\right)}w Subscript i Baseline equals StartFraction exp left parenthesis s c o r e left parenthesis bold italic q comma bold italic k Subscript i Baseline right parenthesis right parenthesis Over sigma summation Underscript j Endscripts exp left parenthesis s c o r e left parenthesis bold italic q comma bold italic k Subscript j Baseline right parenthesis right parenthesis EndFraction
where score ( q , k ) ( q , k ) (q,k)(\boldsymbol{q}, \boldsymbol{k})left parenthesis bold italic q comma bold italic k right parenthesis is a function that calculates the similarity between the query and a key. One common choice for score is the dot product of the query and key.
其中 score ( q , k ) ( q , k ) (q,k)(\boldsymbol{q}, \boldsymbol{k})left parenthesis bold italic q comma bold italic k right parenthesis 是一个计算查询和键之间相似度的函数。一个常见的选择是查询和键的点积。
We append one attention layer right after the LSTM layer where the key, value, and query are all set as the output sequence from LSTM. Consequently, each embedding will be a weighted sum of the sequence with weights proportional to the similarity between two vectors. The attention mechanism allows the predictor to pay different attention to different embeddings. However, the naive attention mechanism cannot impose a sparse focus. We further apply the SparseMax activation over w w w\boldsymbol{w}bold italic w.
我们在LSTM层之后紧接着添加了一个注意力层,其中键、值和查询都被设置为LSTM的输出序列。因此,每个嵌入都将是序列的加权和,权重与两个向量之间的相似度成正比。注意力机制允许预测器对不同的嵌入给予不同的关注。然而,朴素的注意力机制不能施加稀疏聚焦。我们进一步在此基础上应用了SparseMax激活函数。
SparseMax activation. In contrast to the SoftMax activation, which provides a dense distribution, that is, non-zero probabilities assigned to all classes or positions, the SparseMax provides a sparse distribution, where zero probability is assigned to many positions. We apply the SparseMax activation over the attention weights w w w\boldsymbol{w}bold italic w to obtain a sparse attention mechanism. Given an input vector w R L w R L w inR^(L)\boldsymbol{w} \in \mathbb{R}^{L}bold italic w element of double struck upper R Superscript upper L, the SparseMax transformation is defined as:
SparseMax激活。与提供密集分布的SoftMax激活不同,即为所有类别或位置分配非零概率,SparseMax提供一个稀疏分布,其中许多位置被分配了零概率。我们在注意力权重上应用SparseMax激活,以获得一个稀疏的注意力机制。给定一个输入向量,SparseMax转换定义为:
SparseMax ( w ) = argmin u Δ L 1 u w 2 2 SparseMax ( w ) = argmin u Δ L 1 u w 2 2 SparseMax(w)=argmin_(u inDelta^(L-1))||u-w||_(2)^(2)\operatorname{SparseMax}(\boldsymbol{w})=\operatorname{argmin}_{\boldsymbol{u} \in \Delta^{L-1}}\|\boldsymbol{u}-\boldsymbol{w}\|_{2}^{2}upper S p a r s e upper M a x left parenthesis bold italic w right parenthesis equals a r g m i n Subscript bold italic u element of normal upper Delta Sub Superscript upper L minus 1 Baseline StartMetric bold italic u minus bold italic w EndMetric Subscript 2 Superscript 2
where Δ L 1 Δ L 1 Delta^(L-1)\Delta^{L-1}normal upper Delta Superscript upper L minus 1 denotes the ( L 1 ) ( L 1 ) (L-1)(L-1)left parenthesis upper L minus 1 right parenthesis-dimensional simplex, i.e.,
其中 Δ L 1 Δ L 1 Delta^(L-1)\Delta^{L-1}normal upper Delta Superscript upper L minus 1 表示 ( L 1 ) ( L 1 ) (L-1)(L-1)left parenthesis upper L minus 1 right parenthesis 维单纯形,即,
Δ L 1 = { u R L u 0 , i = 1 L u i = 1 } Δ L 1 = u R L u 0 , i = 1 L u i = 1 Delta^(L-1)={u inR^(L)∣u >= 0,sum_(i=1)^(L)u_(i)=1}\Delta^{L-1}=\left\{\boldsymbol{u} \in \mathbb{R}^{L} \mid \boldsymbol{u} \geq 0, \sum_{i=1}^{L} u_{i}=1\right\}normal upper Delta Superscript upper L minus 1 Baseline equals left brace bold italic u element of double struck upper R Superscript upper L Baseline vertical bar bold italic u greater than or equals 0 comma sigma summation Underscript i equals 1 Overscript upper L Endscripts u Subscript i Baseline equals 1 right brace
Although the expert selection is affected by other tokens in the sequence, the current token is always the most crucial on expert selection. Hence, we adopt the residual connection (He et al., 2016) to boost the performance right before the final fully connected layer.
尽管专家选择受序列中其他标记的影响,但当前标记在专家选择上始终是最关键的。因此,我们采用了残差连接(He等人,2016年)来在最后的全连接层之前提升性能。

3.5 Truncated Knowledge Distillation
3.5 截断式知识蒸馏

The hash function of SiDA is required to predict the expert to be activated and the corresponding scaling factor α α alpha\alphaalpha. Knowledge distillation (KD) (Hinton et al., 2015), which aims to minimize the distance of logits between the teacher and student model, should be the best training strategy for our hash function. However, the capacity of our hash function, 2-layer LSTM, is far less capable than the MoE model. The predictor cannot fully capture the behavior of logits of the router functions in the MoE model. The naive usage of K D K D KD\mathrm{KD}normal upper K normal upper D greatly harms the performance of the system.
SiDA的哈希函数需要预测要激活的专家和相应的缩放因子 α α alpha\alphaalpha 。知识蒸馏(KD)(Hinton等人,2015年),旨在最小化教师模型和学生模型之间的logits距离,应该是我们哈希函数的最佳训练策略。然而,我们的哈希函数,2层LSTM的容量远远小于MoE模型。预测器无法完全捕捉MoE模型中路由函数的logits行为。 K D K D KD\mathrm{KD}normal upper K normal upper D 的简单使用极大地损害了系统的性能。
We propose Truncated KD (TKD) to tackle the challenge. Different from the traditional KD, the truncated KD only considers positions with top- T T TTupper T SoftMax logit, which helps
我们提出了截断式知识蒸馏(TKD)来应对这一挑战。与传统的知识蒸馏不同,截断式知识蒸馏仅考虑SoftMax逻辑值排名前 T T TTupper T 的位置,这有助于

the hash function focus more on predicting the scaling factor for experts with a higher chance of being activated. Notably, large T T TTupper T can provide a smooth ground truth for the hash function, while small T T TTupper T enforces the hash function to be more focused on fewer experts. Further, we add the cross entropy loss to ensure the prediction accuracy. The training objective is λ L C E + L TKD ( T ) λ L C E + L TKD  ( T ) lambdaL_(CE)+L_("TKD ")(T)\lambda \mathcal{L}_{\mathrm{CE}}+\mathcal{L}_{\text {TKD }}(T)lamda script upper L Subscript normal upper C normal upper E plus script upper L Subscript TKD Baseline left parenthesis upper T right parenthesis.
哈希函数更注重预测有更高激活几率的专家的缩放因子。值得注意的是,较大的 T T TTupper T 可以为哈希函数提供一个平滑的真实基准,而较小的 T T TTupper T 则迫使哈希函数更专注于较少的专家。此外,我们添加了交叉熵损失以确保预测准确性。训练目标是 λ L C E + L TKD ( T ) λ L C E + L TKD  ( T ) lambdaL_(CE)+L_("TKD ")(T)\lambda \mathcal{L}_{\mathrm{CE}}+\mathcal{L}_{\text {TKD }}(T)