这是用户在 2024-6-24 23:40 为 https://explained.ai/matrix-calculus/index.html#sec:1.5 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
Brought to you by explained.ai
由 explained.ai 为您带来

The Matrix Calculus You Need For Deep Learning
深度学习所需的矩阵微积分

Terence Parr and Jeremy Howard
特伦斯-帕尔和杰里米-霍华德

(Terence is a tech lead at Google and ex-Professor of computer/data science in University of San Francisco's MS in Data Science program. You might know Terence as the creator of the ANTLR parser generator. For more material, see Jeremy's fast.ai courses and University of San Francisco's Data Institute in-person version of the deep learning course.)
(特伦斯是谷歌的技术主管,曾任旧金山大学数据科学硕士课程的计算机/数据科学教授。您可能知道特伦斯是 ANTLR 解析器生成器的创建者。欲了解更多信息,请参阅杰里米的 fast.ai 课程和旧金山大学数据学院的深度学习面授课程)。

Please send comments, suggestions, or fixes to Terence.
请将意见、建议或修正发送给 Terence。

Printable version (This HTML was generated from markup using bookish). A Chinese version is also available (content not verified by us).
可打印版本(此 HTML 由使用 bookish 的标记生成)。另有中文版本(内容未经我们核实)。

Abstract 摘要

This paper is an attempt to explain all the matrix calculus you need in order to understand the training of deep neural networks. We assume no math knowledge beyond what you learned in calculus 1, and provide links to help you refresh the necessary math where needed. Note that you do not need to understand this material before you start learning to train and use deep learning in practice; rather, this material is for those who are already familiar with the basics of neural networks, and wish to deepen their understanding of the underlying math. Don't worry if you get stuck at some point along the way—-just go back and reread the previous section, and try writing down and working through some examples. And if you're still stuck, we're happy to answer your questions in the Theory category at forums.fast.ai. Note: There is a reference section at the end of the paper summarizing all the key matrix calculus rules and terminology discussed here.
本文试图解释理解深度神经网络训练所需的所有矩阵微积分。我们假定,除了微积分 1 所学的知识外,您没有其他数学知识,并在需要时提供链接帮助您复习必要的数学知识。请注意,在开始学习训练和在实践中使用深度学习之前,你并不需要理解这些材料;相反,这些材料是为那些已经熟悉神经网络基础知识,并希望加深对底层数学理解的人准备的。如果你在学习过程中遇到困难,也不用担心--只需回头重读前面的章节,并尝试写下和处理一些示例。如果您仍有困惑,我们很乐意在forums.fast.ai的理论分类中回答您的问题。注:本文末尾有一个参考文献部分,总结了本文讨论的所有关键矩阵微积分规则和术语。

Contents 目录

Introduction 导言

Most of us last saw calculus in school, but derivatives are a critical part of machine learning, particularly deep neural networks, which are trained by optimizing a loss function. Pick up a machine learning paper or the documentation of a library such as PyTorch and calculus comes screeching back into your life like distant relatives around the holidays. And it's not just any old scalar calculus that pops up—-you need differential matrix calculus, the shotgun wedding of linear algebra and multivariate calculus.
我们大多数人最后一次接触微积分还是在学校里,但导数是机器学习,尤其是深度神经网络的重要组成部分,而深度神经网络是通过优化损失函数来训练的。拿起一篇机器学习论文或 PyTorch 等库的文档,微积分就会像节日里的远房亲戚一样,突然回到你的生活中。而且,你需要的不是普通的标量微积分,而是微分矩阵微积分,它是线性代数和多元微积分的 "猎枪婚礼"。

Well... maybe need isn't the right word; Jeremy's courses show how to become a world-class deep learning practitioner with only a minimal level of scalar calculus, thanks to leveraging the automatic differentiation built in to modern deep learning libraries. But if you really want to really understand what's going on under the hood of these libraries, and grok academic papers discussing the latest advances in model training techniques, you'll need to understand certain bits of the field of matrix calculus.
好吧......也许用 "需要 "这个词并不恰当;杰里米的课程展示了如何利用现代深度学习库中内置的自动微分功能,成为一名世界级的深度学习实践者,而这只需要最低程度的标量微积分。但是,如果你真的想真正理解这些库引擎盖下发生了什么,并理解讨论模型训练技术最新进展的学术论文,你就需要理解矩阵微积分领域的某些内容。

For example, the activation of a single computation unit in a neural network is typically calculated using the dot product (from linear algebra) of an edge weight vector w with an input vector x plus a scalar bias (threshold): . Function is called the unit's affine function and is followed by a rectified linear unit, which clips negative values to zero: . Such a computational unit is sometimes referred to as an “artificial neuron” and looks like:
例如,神经网络中单个计算单元的激活通常使用边缘权重向量 w 与输入向量 x 加上标量偏置(阈值)的点积(线性代数)来计算: 。函数 被称为单元的仿射函数,其后是一个整流线性单元,可将负值剪辑为零: 。这样的计算单元有时被称为 "人工神经元",其外形如下:

neuron.png

Neural networks consist of many of these units, organized into multiple collections of neurons called layers. The activation of one layer's units become the input to the next layer's units. The activation of the unit or units in the final layer is called the network output.
神经网络由许多这样的单元组成,这些单元被组织成多个神经元集合,称为 "层"。一层单元的激活将成为下一层单元的输入。最后一层单元的激活称为网络输出。

Training this neuron means choosing weights w and bias b so that we get the desired output for all N inputs x. To do that, we minimize a loss function that compares the network's final with the (desired output of x) for all input x vectors. To minimize the loss, we use some variation on gradient descent, such as plain stochastic gradient descent (SGD), SGD with momentum, or Adam. All of those require the partial derivative (the gradient) of with respect to the model parameters w and b. Our goal is to gradually tweak w and b so that the overall loss function keeps getting smaller across all x inputs.
为此,我们需要最小化一个损失函数,该函数用于比较所有输入 x 向量的网络最终 (x 的期望输出)。为了最小化损失,我们会使用梯度下降法的一些变体,例如普通随机梯度下降法(SGD)、带动量的梯度下降法或亚当法。所有这些方法都需要 相对于模型参数 w 和 b 的偏导数(梯度)。我们的目标是逐步调整 w 和 b,从而使所有 x 输入的整体损失函数不断变小。

If we're careful, we can derive the gradient by differentiating the scalar version of a common loss function (mean squared error):
如果我们小心谨慎,就可以通过微分一个常见损失函数(均方误差)的标量版本来推导梯度:

But this is just one neuron, and neural networks must train the weights and biases of all neurons in all layers simultaneously. Because there are multiple inputs and (potentially) multiple network outputs, we really need general rules for the derivative of a function with respect to a vector and even rules for the derivative of a vector-valued function with respect to a vector.
但这只是一个神经元,神经网络必须同时训练所有层中所有神经元的权重和偏置。由于存在多个输入和(潜在的)多个网络输出,我们确实需要关于矢量函数导数的一般规则,甚至需要关于矢量函数导数的规则。

This article walks through the derivation of some important rules for computing partial derivatives with respect to vectors, particularly those useful for training neural networks. This field is known as matrix calculus, and the good news is, we only need a small subset of that field, which we introduce here. While there is a lot of online material on multivariate calculus and linear algebra, they are typically taught as two separate undergraduate courses so most material treats them in isolation. The pages that do discuss matrix calculus often are really just lists of rules with minimal explanation or are just pieces of the story. They also tend to be quite obscure to all but a narrow audience of mathematicians, thanks to their use of dense notation and minimal discussion of foundational concepts. (See the annotated list of resources at the end.)
本文将介绍计算向量偏导数的一些重要规则的推导,尤其是那些对训练神经网络有用的规则。这一领域被称为矩阵微积分,好消息是,我们只需要这一领域的一小部分,我们将在此介绍。虽然网上有很多关于多元微积分和线性代数的资料,但它们通常是作为两门独立的本科课程来讲授的,因此大多数资料都将它们分开处理。讨论矩阵微积分的网页通常只是罗列了一些规则,解释很少,或者只是故事的一部分。此外,由于使用了密集的符号,对基础概念的讨论也很少,因此除了少数数学家外,其他读者往往对它们感到相当晦涩难懂。(请参阅文末附带注释的资源列表)。

In contrast, we're going to rederive and rediscover some key matrix calculus rules in an effort to explain them. It turns out that matrix calculus is really not that hard! There aren't dozens of new rules to learn; just a couple of key concepts. Our hope is that this short paper will get you started quickly in the world of matrix calculus as it relates to training neural networks. We're assuming you're already familiar with the basics of neural network architecture and training. If you're not, head over to Jeremy's course and complete part 1 of that, then we'll see you back here when you're done. (Note that, unlike many more academic approaches, we strongly suggest first learning to train and use neural networks in practice and then study the underlying math. The math will be much more understandable with the context in place; besides, it's not necessary to grok all this calculus to become an effective practitioner.)
与此相反,我们将重新理解和发现一些关键的矩阵微积分规则,努力解释它们。事实证明,矩阵微积分其实并不难!没有几十条新规则要学,只有几个关键概念。我们希望这篇短文能让你快速进入矩阵微积分的世界,因为它与训练神经网络有关。我们假设你已经熟悉神经网络架构和训练的基础知识。如果您还不熟悉,请前往杰里米的课程,完成其中的第 1 部分,完成后我们在这里再见。(请注意,与许多学术方法不同,我们强烈建议首先学习在实践中训练和使用神经网络,然后再学习底层数学。有了上下文,数学知识会更容易理解;此外,要想成为一名有效的实践者,也没必要弄懂所有这些微积分)。

A note on notation: Jeremy's course exclusively uses code, instead of math notation, to explain concepts since unfamiliar functions in code are easy to search for and experiment with. In this paper, we do the opposite: there is a lot of math notation because one of the goals of this paper is to help you understand the notation that you'll see in deep learning papers and books. At the end of the paper, you'll find a brief table of the notation used, including a word or phrase you can use to search for more details.
关于符号的说明:杰里米的课程完全使用代码而非数学符号来解释概念,因为代码中的陌生函数很容易搜索和实验。在本文中,我们的做法恰恰相反:本文使用了大量数学符号,因为本文的目的之一就是帮助你理解深度学习论文和书籍中的符号。在本文末尾,你会看到一个所用符号的简表,包括一个你可以用来搜索更多细节的单词或短语。

Review: Scalar derivative rules
回顾:标量导数规则

Hopefully you remember some of these main scalar derivative rules. If your memory is a bit fuzzy on this, have a look at Khan academy vid on scalar derivative rules.
希望你还记得其中一些主要的标量导数规则。如果您对此记忆模糊,请观看可汗学院关于标量导数规则的视频。

Rule 规则Scalar derivative notation with respect to x
关于 x 的标量导数符号
Example 示例
Constant 恒定 c
Multiplication by constant
常数乘法
cf
Power Rule 权力规则
Sum Rule 总和规则
Difference Rule 差异规则
Product Rule 产品规则fg
Chain Rule 连锁规则, let
,让

There are other rules for trigonometry, exponentials, etc., which you can find at Khan Academy differential calculus course.
可汗学院微分课程中还有其他关于三角函数、指数等的规则。

When a function has a single parameter, , you'll often see and used as shorthands for . We recommend against this notation as it does not make clear the variable we're taking the derivative with respect to.
当一个函数只有一个参数 时,你经常会看到 用作 的简称。我们建议不要使用这种符号,因为它没有明确说明我们要对哪个变量进行导数运算。

You can think of as an operator that maps a function of one parameter to another function. That means that maps to its derivative with respect to x, which is the same thing as . Also, if , then . Thinking of the derivative as an operator helps to simplify complicated derivatives because the operator is distributive and lets us pull out constants. For example, in the following equation, we can pull out the constant 9 and distribute the derivative operator across the elements within the parentheses.
您可以将 视为将一个参数的函数映射到另一个函数的运算符。这意味着 映射为它关于 x 的导数,这与 是一回事。同样,如果 ,则 。将导数视为运算符有助于简化复杂的导数,因为运算符具有分配性,可以让我们取出常数。例如,在下面的等式中,我们可以取出常数 9,然后将导数运算符分配给括号内的元素。

That procedure reduced the derivative of to a bit of arithmetic and the derivatives of x and , which are much easier to solve than the original derivative.
这一过程将 的导数简化为一些算术运算以及 x 和 的导数,这比原始导数更容易求解。

Introduction to vector calculus and partial derivatives
向量微积分和偏导数入门

Neural network layers are not single functions of a single parameter, . So, let's move on to functions of multiple parameters such as . For example, what is the derivative of xy (i.e., the multiplication of x and y)? In other words, how does the product xy change when we wiggle the variables? Well, it depends on whether we are changing x or y. We compute derivatives with respect to one variable (parameter) at a time, giving us two different partial derivatives for this two-parameter function (one for x and one for y). Instead of using operator , the partial derivative operator is (a stylized d and not the Greek letter ). So, and are the partial derivatives of xy; often, these are just called the partials. For functions of a single parameter, operator is equivalent to (for sufficiently smooth functions). However, it's better to use to make it clear you're referring to a scalar derivative.
神经网络层不是单一参数 的单一函数。因此,让我们继续讨论多个参数的函数,例如 。例如,什么是 xy 的导数(即 x 和 y 的乘积)?换句话说,当我们摆动变量时,乘积 xy 会发生怎样的变化?我们每次计算一个变量(参数)的导数,就能得到这个双参数函数的两个不同偏导数(一个是 x,一个是 y)。偏导数运算符不是运算符 ,而是 (一个样式化的 d,而不是希腊字母 )。因此, 是 xy 的偏导数;通常,这些偏导数被称为偏导数。对于单参数函数,运算符 等同于 (对于足够平滑的函数)。不过,最好使用 ,以明确您指的是标量导数。

The partial derivative with respect to x is just the usual scalar derivative, simply treating any other variable in the equation as a constant. Consider function . The partial derivative with respect to x is written . There are three constants from the perspective of : 3, 2, and y. Therefore, . The partial derivative with respect to y treats x like a constant: . It's a good idea to derive these yourself before continuing otherwise the rest of the article won't make sense. Here's the Khan Academy video on partials if you need help.
相对于 x 的偏导数就是通常的标量导数,只需将方程中的其他变量视为常数即可。考虑函数 .关于 x 的偏导数写为 。从 的角度看,有三个常数:3、2 和 y。相对于 y 的偏导数将 x 视为常数: 。在继续阅读之前,最好先自己推导出这些结果,否则文章的其余部分就没有意义了。如果你需要帮助,这里有可汗学院关于偏导数的视频。

To make it clear we are doing vector calculus and not just multivariate calculus, let's consider what we do with the partial derivatives and (another way to say and ) that we computed for . Instead of having them just floating around and not organized in any way, let's organize them into a horizontal vector. We call this vector the gradient of and write it as:
为了明确我们是在做向量微积分,而不仅仅是多元微积分,让我们考虑一下如何处理我们为 计算的偏导数 (另一种说法是 )。与其让它们以任何方式浮动和组织,不如将它们组织成一个水平向量。我们称这个向量为 的梯度,并将其写成

So the gradient of is simply a vector of its partials. Gradients are part of the vector calculus world, which deals with functions that map n scalar parameters to a single scalar. Now, let's get crazy and consider derivatives of multiple functions simultaneously.
因此, 的梯度只是其偏置的一个向量。梯度是向量微积分世界的一部分,它处理的是将 n 个标量参数映射到单个标量的函数。现在,让我们疯狂一下,同时考虑多个函数的导数。

Matrix calculus 矩阵微积分

When we move from derivatives of one function to derivatives of many functions, we move from the world of vector calculus to matrix calculus. Let's compute partial derivatives for two functions, both of which take two parameters. We can keep the same from the last section, but let's also bring in . The gradient for g has two entries, a partial derivative for each parameter:
当我们从一个函数的导数转向多个函数的导数时,我们就从向量微积分世界转向了矩阵微积分。让我们计算两个函数的偏导数,这两个函数都有两个参数。我们可以保留上一节中的 ,但同时引入 。g 的梯度有两个条目,即每个参数的偏导数:

and

giving us gradient .
从而得到梯度

Gradient vectors organize all of the partial derivatives for a specific scalar function. If we have two functions, we can also organize their gradients into a matrix by stacking the gradients. When we do so, we get the Jacobian matrix (or just the Jacobian) where the gradients are rows:
梯度向量将特定标量函数的所有偏导数组织起来。如果有两个函数,我们也可以通过堆叠梯度将它们组织成一个矩阵。这样,我们就得到了雅各布矩阵(或雅各布矩阵),其中梯度为行:

Welcome to matrix calculus!
欢迎来到矩阵微积分!

Note that there are multiple ways to represent the Jacobian. We are using the so-called numerator layout but many papers and software will use the denominator layout. This is just transpose of the numerator layout Jacobian (flip it around its diagonal):
请注意,表示雅各布因子有多种方法。我们使用的是所谓的分子布局,但许多论文和软件会使用分母布局。这只是分子布局雅各布的转置(绕对角线翻转):

Generalization of the Jacobian
雅各布的一般化

So far, we've looked at a specific example of a Jacobian matrix. To define the Jacobian matrix more generally, let's combine multiple parameters into a single vector argument: . (You will sometimes see notation for vectors in the literature as well.) Lowercase letters in bold font such as x are vectors and those in italics font like x are scalars. xi is the element of vector x and is in italics because a single vector element is a scalar. We also have to define an orientation for vector x. We'll assume that all vectors are vertical by default of size :
到目前为止,我们已经了解了雅各布矩阵的一个具体例子。为了更广泛地定义雅各布矩阵,让我们将多个参数合并为一个向量参数: (有时您也会在文献中看到矢量的符号 )。x 是向量 x 的 元素,之所以用斜体是因为单个向量元素就是一个标量。我们还必须定义向量 x 的方向。我们假设所有向量都是垂直的,默认大小为

With multiple scalar-valued functions, we can combine them all into a vector just like we did with the parameters. Let be a vector of m scalar-valued functions that each take a vector x of length where is the cardinality (count) of elements in x. Each fi function within f returns a scalar just as in the previous section:

For instance, we'd represent and from the last section as

It's very often the case that because we will have a scalar function result for each element of the x vector. For example, consider the identity function :

So we have functions and parameters, in this case. Generally speaking, though, the Jacobian matrix is the collection of all possible partial derivatives (m rows and n columns), which is the stack of m gradients with respect to x:

Each is a horizontal n-vector because the partial derivative is with respect to a vector, x, whose length is . The width of the Jacobian is n if we're taking the partial derivative with respect to x because there are n parameters we can wiggle, each potentially changing the function's value. Therefore, the Jacobian is always m rows for m equations. It helps to think about the possible Jacobian shapes visually:


\begin{tabular}{c|ccl}
  & \begin{tabular}[t]{c}
  scalar\\
  \framebox(18,18){$x$}\\
  \end{tabular} & \begin{tabular}{c}
  vector\\
  \framebox(18,40){$\mathbf{x}$}
  \end{tabular}\\
\hline
%\[\dimexpr-\normalbaselineskip+5pt]
\begin{tabular}[b]{c}
  scalar\\
  \framebox(18,18){$f$}\\
  \end{tabular} &\framebox(18,18){$\frac{\partial f}{\partial {x}}$} & \framebox(40,18){$\frac{\partial f}{\partial {\mathbf{x}}}$}&\\
\begin{tabular}[b]{c}
  vector\\
  \framebox(18,40){$\mathbf{f}$}\\
  \end{tabular} & \framebox(18,40){$\frac{\partial \mathbf{f}}{\partial {x}}$} & \framebox(40,40){$\frac{\partial \mathbf{f}}{\partial \mathbf{x}}$}\\
\end{tabular}

The Jacobian of the identity function , with , has n functions and each function has n parameters held in a single vector x. The Jacobian is, therefore, a square matrix since :


\begin{eqnarray*}
	\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{bmatrix}
	\frac{\partial}{\partial \mathbf{x}} f_1(\mathbf{x}) \\
	\frac{\partial}{\partial \mathbf{x}} f_2(\mathbf{x})\\
	\ldots\\
	\frac{\partial}{\partial \mathbf{x}} f_m(\mathbf{x})
	\end{bmatrix} &=& \begin{bmatrix}
	\frac{\partial}{\partial {x_1}} f_1(\mathbf{x})~ \frac{\partial}{\partial {x_2}} f_1(\mathbf{x}) ~\ldots~  \frac{\partial}{\partial {x_n}} f_1(\mathbf{x}) \\
	\frac{\partial}{\partial {x_1}} f_2(\mathbf{x})~ \frac{\partial}{\partial {x_2}} f_2(\mathbf{x}) ~\ldots~  \frac{\partial}{\partial {x_n}} f_2(\mathbf{x}) \\
	\ldots\\
	~\frac{\partial}{\partial {x_1}} f_m(\mathbf{x})~ \frac{\partial}{\partial {x_2}} f_m(\mathbf{x}) ~\ldots~ \frac{\partial}{\partial {x_n}} f_m(\mathbf{x}) \\
	\end{bmatrix}\\\\
	& = & \begin{bmatrix}
	\frac{\partial}{\partial {x_1}} x_1~ \frac{\partial}{\partial {x_2}} x_1 ~\ldots~ \frac{\partial}{\partial {x_n}} x_1 \\
	\frac{\partial}{\partial {x_1}} x_2~ \frac{\partial}{\partial {x_2}} x_2 ~\ldots~ \frac{\partial}{\partial {x_n}} x_2 \\
	\ldots\\
	~\frac{\partial}{\partial {x_1}} x_n~ \frac{\partial}{\partial {x_2}} x_n ~\ldots~ \frac{\partial}{\partial {x_n}} x_n \\
	\end{bmatrix}\\\\
	& & (\text{and since } \frac{\partial}{\partial {x_j}} x_i = 0 \text{ for } j \neq i)\\
	 & = & \begin{bmatrix}
	\frac{\partial}{\partial {x_1}} x_1 & 0 & \ldots& 0 \\
	0 & \frac{\partial}{\partial {x_2}} x_2 &\ldots & 0 \\
	& & \ddots\\
	0 & 0 &\ldots& \frac{\partial}{\partial {x_n}} x_n \\
	\end{bmatrix}\\\\
	 & = & \begin{bmatrix}
	1 & 0 & \ldots& 0 \\
	0 &1 &\ldots & 0 \\
	& & \ddots\\
	0 & 0 & \ldots &1 \\
	\end{bmatrix}\\\\
	& = & I ~~~(I \text{ is the identity matrix with ones down the diagonal})\\
	\end{eqnarray*}

Make sure that you can derive each step above before moving on. If you get stuck, just consider each element of the matrix in isolation and apply the usual scalar derivative rules. That is a generally useful trick: Reduce vector expressions down to a set of scalar expressions and then take all of the partials, combining the results appropriately into vectors and matrices at the end.

Also be careful to track whether a matrix is vertical, x, or horizontal, where means x transpose. Also make sure you pay attention to whether something is a scalar-valued function, , or a vector of functions (or a vector-valued function), .

Derivatives of vector element-wise binary operators

Element-wise binary operations on vectors, such as vector addition , are important because we can express many common vector operations, such as the multiplication of a vector by a scalar, as element-wise binary operations. By “element-wise binary operations” we simply mean applying an operator to the first item of each vector to get the first item of the output, then to the second items of the inputs for the second item of the output, and so forth. This is how all the basic math operators are applied by default in numpy or tensorflow, for example. Examples that often crop up in deep learning are and (returns a vector of ones and zeros).

We can generalize the element-wise binary operations with notation where . (Reminder: is the number of items in x.) The symbol represents any element-wise operator (such as ) and not the function composition operator. Here's what equation looks like when we zoom in to examine the scalar equations:

where we write n (not m) equations vertically to emphasize the fact that the result of element-wise operators give sized vector results.

Using the ideas from the last section, we can see that the general case for the Jacobian with respect to w is the square matrix:

and the Jacobian with respect to x is:

That's quite a furball, but fortunately the Jacobian is very often a diagonal matrix, a matrix that is zero everywhere but the diagonal. Because this greatly simplifies the Jacobian, let's examine in detail when the Jacobian reduces to a diagonal matrix for element-wise operations.

In a diagonal Jacobian, all elements off the diagonal are zero, where . (Notice that we are taking the partial derivative with respect to wj not wi.) Under what conditions are those off-diagonal elements zero? Precisely when fi and gi are contants with respect to wj, . Regardless of the operator, if those partial derivatives go to zero, the operation goes to zero, no matter what, and the partial derivative of a constant is zero.

Those partials go to zero when fi and gi are not functions of wj. We know that element-wise operations imply that fi is purely a function of wi and gi is purely a function of xi. For example, sums . Consequently, reduces to and the goal becomes . and look like constants to the partial differentiation operator with respect to wj when so the partials are zero off the diagonal. (Notation is technically an abuse of our notation because fi and gi are functions of vectors not individual elements. We should really write something like , but that would muddy the equations further, and programmers are comfortable overloading functions, so we'll proceed with the notation anyway.)

We'll take advantage of this simplification later and refer to the constraint that and access at most wi and xi, respectively, as the element-wise diagonal condition.

Under this condition, the elements along the diagonal of the Jacobian are :

(The large “0”s are a shorthand indicating all of the off-diagonal are 0.)

More succinctly, we can write:

and

where constructs a matrix whose diagonal elements are taken from vector x.

Because we do lots of simple vector arithmetic, the general function in the binary element-wise operation is often just the vector w. Any time the general function is a vector, we know that reduces to . For example, vector addition fits our element-wise diagonal condition because has scalar equations that reduce to just with partial derivatives:

That gives us , the identity matrix, because every element along the diagonal is 1. I represents the square identity matrix of appropriate dimensions that is zero everywhere but the diagonal, which contains all ones.

Given the simplicity of this special case, reducing to , you should be able to derive the Jacobians for the common element-wise binary operations on vectors:

The and operators are element-wise multiplication and division; is sometimes called the Hadamard product. There isn't a standard notation for element-wise multiplication and division so we're using an approach consistent with our general binary operation notation.

Derivatives involving scalar expansion

When we multiply or add scalars to vectors, we're implicitly expanding the scalar to a vector and then performing an element-wise binary operation. For example, adding scalar z to vector x, , is really where and . (The notation represents a vector of ones of appropriate length.) z is any scalar that doesn't depend on x, which is useful because then for any xi and that will simplify our partial derivative computations. (It's okay to think of variable z as a constant for our discussion here.) Similarly, multiplying by a scalar, , is really where is the element-wise multiplication (Hadamard product) of the two vectors.

The partial derivatives of vector-scalar addition and multiplication with respect to vector x use our element-wise rule:

This follows because functions and clearly satisfy our element-wise diagonal condition for the Jacobian (that refer at most to xi and refers to the value of the vector).

Using the usual rules for scalar partial derivatives, we arrive at the following diagonal elements of the Jacobian for vector-scalar addition:

So, .

Computing the partial derivative with respect to the scalar parameter z, however, results in a vertical vector, not a diagonal matrix. The elements of the vector are:

Therefore, .

The diagonal elements of the Jacobian for vector-scalar multiplication involve the product rule for scalar derivatives:

So, .

The partial derivative with respect to scalar parameter z is a vertical vector whose elements are:

This gives us .

Vector sum reduction

Summing up the elements of a vector is an important operation in deep learning, such as the network loss function, but we can also use it as a way to simplify computing the derivative of vector dot product and other operations that reduce vectors to scalars.

Let . Notice we were careful here to leave the parameter as a vector x because each function fi could use all values in the vector, not just xi. The sum is over the results of the function and not the parameter. The gradient ( Jacobian) of vector summation is:

(The summation inside the gradient elements can be tricky so make sure to keep your notation consistent.)

Let's look at the gradient of the simple . The function inside the summation is just and the gradient is then:
让我们来看看简单的 的梯度。求和内部的函数只是 ,梯度就是这样:

Because for , we can simplify to:
因为 对于 ,我们可以简化为

Notice that the result is a horizontal vector full of 1s, not a vertical vector, and so the gradient is . (The T exponent of represents the transpose of the indicated vector. In this case, it flips a vertical vector to a horizontal vector.) It's very important to keep the shape of all of your vectors and matrices in order otherwise it's impossible to compute the derivatives of complex functions.
请注意,结果是一个充满 1 的水平矢量,而不是垂直矢量,因此梯度为 的 T 指数代表所指矢量的转置。在这种情况下,它将垂直向量翻转为水平向量)。保持所有向量和矩阵的形状整齐非常重要,否则就无法计算复变函数的导数。

As another example, let's sum the result of multiplying a vector by a constant scalar. If then . The gradient is:
再举一个例子,让我们对一个向量乘以一个常量标量的结果求和。如果 ,那么 。梯度为

The derivative with respect to scalar variable z is :
关于标量变量 z 的导数为

The Chain Rules

We can't compute partial derivatives of very complicated functions using just the basic matrix calculus rules we've seen so far. For example, we can't take the derivative of nested expressions like directly without reducing it to its scalar equivalent. We need to be able to combine our basic vector rules using what we can call the vector chain rule. Unfortunately, there are a number of rules for differentiation that fall under the name “chain rule” so we have to be careful which chain rule we're talking about. Part of our goal here is to clearly define and name three different chain rules and indicate in which situation they are appropriate. To get warmed up, we'll start with what we'll call the single-variable chain rule, where we want the derivative of a scalar function with respect to a scalar. Then we'll move on to an important concept called the total derivative and use it to define what we'll pedantically call the single-variable total-derivative chain rule. Then, we'll be ready for the vector chain rule in its full glory as needed for neural networks.

The chain rule is conceptually a divide and conquer strategy (like Quicksort) that breaks complicated expressions into subexpressions whose derivatives are easier to compute. Its power derives from the fact that we can process each simple subexpression in isolation yet still combine the intermediate results to get the correct overall result.

The chain rule comes into play when we need the derivative of an expression composed of nested subexpressions. For example, we need the chain rule when confronted with expressions like . The outermost expression takes the sin of an intermediate result, a nested subexpression that squares x. Specifically, we need the single-variable chain rule, so let's start by digging into that in more detail.

Single-variable chain rule

Let's start with the solution to the derivative of our nested expression: . It doesn't take a mathematical genius to recognize components of the solution that smack of scalar differentiation rules, and . It looks like the solution is to multiply the derivative of the outer expression by the derivative of the inner expression or “chain the pieces together,” which is exactly right. In this section, we'll explore the general principle at work and provide a process that works for highly-nested expressions of a single variable.

Chain rules are typically defined in terms of nested functions, such as for single-variable chain rules. (You will also see the chain rule defined using function composition , which is the same thing.) Some sources write the derivative using shorthand notation , but that hides the fact that we are introducing an intermediate variable: , which we'll see shortly. It's better to define the single-variable chain rule of explicitly so we never take the derivative with respect to the wrong variable. Here is the formulation of the single-variable chain rule we recommend:

To deploy the single-variable chain rule, follow these steps:

  1. Introduce intermediate variables for nested subexpressions and subexpressions for both binary and unary operators; e.g., is binary, and other trigonometric functions are usually unary because there is a single operand. This step normalizes all equations to single operators or function applications.
  2. Compute derivatives of the intermediate variables with respect to their parameters.
  3. Combine all derivatives of intermediate variables by multiplying them together to get the overall result.
  4. Substitute intermediate variables back in if any are referenced in the derivative equation.

The third step puts the “chain” in “chain rule” because it chains together intermediate results. Multiplying the intermediate derivatives together is the common theme among all variations of the chain rule.

Let's try this process on :

  1. Introduce intermediate variables. Let represent subexpression (shorthand for ). This gives us:

    The order of these subexpressions does not affect the answer, but we recommend working in the reverse order of operations dictated by the nesting (innermost to outermost). That way, expressions and derivatives are always functions of previously-computed elements.

  2. Compute derivatives.
  3. Combine.
  4. Substitute.

Notice how easy it is to compute the derivatives of the intermediate variables in isolation! The chain rule says it's legal to do that and tells us how to combine the intermediate results to get .

You can think of the combining step of the chain rule in terms of units canceling. If we let y be miles, x be the gallons in a gas tank, and u as gallons we can interpret as . The gallon denominator and numerator cancel.

Another way to to think about the single-variable chain rule is to visualize the overall expression as a dataflow diagram or chain of operations (or abstract syntax tree for compiler people):

sin-square.png

Changes to function parameter x bubble up through a squaring operation then through a sin operation to change result y. You can think of as “getting changes from x to u” and as “getting changes from u to y.” Getting from x to y requires an intermediate hop. The chain rule is, by convention, usually written from the output variable down to the parameter(s), . But, the x-to-y perspective would be more clear if we reversed the flow and used the equivalent .

Conditions under which the single-variable chain rule applies. Notice that there is a single dataflow path from x to the root y. Changes in x can influence output y in only one way. That is the condition under which we can apply the single-variable chain rule. An easier condition to remember, though one that's a bit looser, is that none of the intermediate subexpression functions, and , have more than one parameter. Consider , which would become after introducing intermediate variable u. As we'll see in the next section, has multiple paths from x to y. To handle that situation, we'll deploy the single-variable total-derivative chain rule.


As an aside for those interested in automatic differentiation, papers and library documentation use terminology forward differentiation and backward differentiation (for use in the back-propagation algorithm). From a dataflow perspective, we are computing a forward differentiation because it follows the normal data flow direction. Backward differentiation, naturally, goes the other direction and we're asking how a change in the output would affect function parameter x. Because backward differentiation can determine changes in all function parameters at once, it turns out to be much more efficient for computing the derivative of functions with lots of parameters. Forward differentiation, on the other hand, must consider how a change in each parameter, in turn, affects the function output y. The following table emphasizes the order in which partial derivatives are computed for the two techniques.

Forward differentiation from x to yBackward differentiation from y to x

Automatic differentiation is beyond the scope of this article, but we're setting the stage for a future article.


Many readers can solve in their heads, but our goal is a process that will work even for very complicated expressions. This process is also how automatic differentiation works in libraries like PyTorch. So, by solving derivatives manually in this way, you're also learning how to define functions for custom neural networks in PyTorch.

With deeply nested expressions, it helps to think about deploying the chain rule the way a compiler unravels nested function calls like into a sequence (chain) of calls. The result of calling function fi is saved to a temporary variable called a register, which is then passed as a parameter to . Let's see how that looks in practice by using our process on a highly-nested equation like :

  1. Introduce intermediate variables.
  2. Compute derivatives.
  3. Combine four intermediate values.
  4. Substitute.

Here is a visualization of the data flow through the chain of operations from x to y:

chain-tree.png

At this point, we can handle derivatives of nested expressions of a single variable, x, using the chain rule but only if x can affect y through a single data flow path. To handle more complicated expressions, we need to extend our technique, which we'll do next.

Single-variable total-derivative chain rule

Our single-variable chain rule has limited applicability because all intermediate variables must be functions of single variables. But, it demonstrates the core mechanism of the chain rule, that of multiplying out all derivatives of intermediate subexpressions. To handle more general expressions such as , however, we need to augment that basic chain rule.

Of course, we immediately see , but that is using the scalar addition derivative rule, not the chain rule. If we tried to apply the single-variable chain rule, we'd get the wrong answer. In fact, the previous chain rule is meaningless in this case because derivative operator does not apply to multivariate functions, such as among our intermediate variables:

Let's try it anyway to see what happens. If we pretend that and , then instead of the right answer .

Because has multiple parameters, partial derivatives come into play. Let's blindly apply the partial derivative operator to all of our equations and see what we get:

Ooops! The partial is wrong because it violates a key assumption for partial derivatives. When taking the partial derivative with respect to x, the other variables must not vary as x varies. Otherwise, we could not act as if the other variables were constants. Clearly, though, is a function of x and therefore varies with x. because . A quick look at the data flow diagram for shows multiple paths from x to y, thus, making it clear we need to consider direct and indirect (through ) dependencies on x:

plus-square.png

A change in x affects y both as an operand of the addition and as the operand of the square operator. Here's an equation that describes how tweaks to x affect the output:

Then, , which we can read as “the change in y is the difference between the original y and y at a tweaked x.”

If we let , then . If we bump x by 1, , then . The change in y is not , as would lead us to believe, but !

Enter the “law” of total derivatives, which basically says that to compute , we need to sum up all possible contributions from changes in x to the change in y. The total derivative with respect to x assumes all variables, such as in this case, are functions of x and potentially vary as x varies. The total derivative of that depends on x directly and indirectly via intermediate variable is given by:

Using this formula, we get the proper answer:
利用这个公式,我们就能得到正确的答案:

That is an application of what we can call the single-variable total-derivative chain rule:
这就是我们所说的单变量总衍生物链式法则的应用:

The total derivative assumes all variables are potentially codependent whereas the partial derivative assumes all variables but x are constants.
总导数假定所有变量都可能相互依赖,而偏导数则假定除 x 外的所有变量都是常数。

There is something subtle going on here with the notation. All of the derivatives are shown as partial derivatives because f and ui are functions of multiple variables. This notation mirrors that of MathWorld's notation but differs from Wikipedia, which uses instead (possibly to emphasize the total derivative nature of the equation). We'll stick with the partial derivative notation so that it's consistent with our discussion of the vector chain rule in the next section.
这里的符号有些微妙。由于 f 和 u 是多个变量的函数,所有导数都显示为偏导数。这种符号与 MathWorld 的符号相同,但与维基百科不同,后者使用 代替(可能是为了强调方程的全导数性质)。我们将坚持使用偏导数符号,以便与下一节对向量链法则的讨论保持一致。

In practice, just keep in mind that when you take the total derivative with respect to x, other variables might also be functions of x so add in their contributions as well. The left side of the equation looks like a typical partial derivative but the right-hand side is actually the total derivative. It's common, however, that many temporary variables are functions of a single parameter, which means that the single-variable total-derivative chain rule degenerates to the single-variable chain rule.
在实际应用中,只需记住,在求关于 x 的总导数时,其他变量也可能是 x 的函数,因此也要加上它们的贡献。等式的左边看起来像一个典型的偏导数,但右边实际上是总导数。然而,许多临时变量都是单个参数的函数,这就意味着单变量总导数链式法则退化为单变量链式法则。

Let's look at a nested subexpression, such as . We introduce three intermediate variables:
让我们来看看嵌套子表达式,例如 。我们引入三个中间变量:

and partials:

where both and have terms that take into account the total derivative.

Also notice that the total derivative formula always sums versus, say, multiplies terms . It's tempting to think that summing up terms in the derivative makes sense because, for example, adds two terms. Nope. The total derivative is adding terms because it represents a weighted sum of all x contributions to the change in y. For example, given instead of , the total-derivative chain rule formula still adds partial derivative terms. ( simplifies to but for this demonstration, let's not combine the terms.) Here are the intermediate variables and partial derivatives:

The form of the total derivative remains the same, however:

It's the partials (weights) that change, not the formula, when the intermediate variable operators change.

Those readers with a strong calculus background might wonder why we aggressively introduce intermediate variables even for the non-nested subexpressions such as in . We use this process for three reasons: (i) computing the derivatives for the simplified subexpressions is usually trivial, (ii) we can simplify the chain rule, and (iii) the process mirrors how automatic differentiation works in neural network libraries.

Using the intermediate variables even more aggressively, let's see how we can simplify our single-variable total-derivative chain rule to its final form. The goal is to get rid of the sticking out on the front like a sore thumb:

We can achieve that by simply introducing a new temporary variable as an alias for x: . Then, the formula reduces to our final form:

This total-derivative chain rule degenerates to the single-variable chain rule when all intermediate variables are functions of a single variable. Consequently, you can remember this more general formula to cover both cases. As a bit of dramatic foreshadowing, notice that the summation sure looks like a vector dot product, , or a vector multiply .

Before we move on, a word of caution about terminology on the web. Unfortunately, the chain rule given in this section, based upon the total derivative, is universally called “multivariable chain rule” in calculus discussions, which is highly misleading! Only the intermediate variables are multivariate functions. The overall function, say, , is a scalar function that accepts a single parameter x. The derivative and parameter are scalars, not vectors, as one would expect with a so-called multivariate chain rule. (Within the context of a non-matrix calculus class, “multivariate chain rule” is likely unambiguous.) To reduce confusion, we use “single-variable total-derivative chain rule” to spell out the distinguishing feature between the simple single-variable chain rule, , and this one.

Vector chain rule

Now that we've got a good handle on the total-derivative chain rule, we're ready to tackle the chain rule for vectors of functions and vector variables. Surprisingly, this more general chain rule is just as simple looking as the single-variable chain rule for scalars. Rather than just presenting the vector chain rule, let's rediscover it ourselves so we get a firm grip on it. We can start by computing the derivative of a sample vector function with respect to a scalar, , to see if we can abstract a general formula.

Let's introduce two intermediate variables, and , one for each fi so that y looks more like :

The derivative of vector y with respect to scalar x is a vertical vector with elements computed using the single-variable total-derivative chain rule:

Ok, so now we have the answer using just the scalar rules, albeit with the derivatives grouped into a vector. Let's try to abstract from that result what it looks like in vector form. The goal is to convert the following vector of scalar operations to a vector operation.

If we split the terms, isolating the terms into a vector, we get a matrix by vector multiplication:

That means that the Jacobian is the multiplication of two other Jacobians, which is kinda cool. Let's check our results:

Whew! We get the same answer as the scalar approach. This vector chain rule for vectors of functions and a single parameter appears to be correct and, indeed, mirrors the single-variable chain rule. Compare the vector rule:

with the single-variable chain rule:

To make this formula work for multiple parameters or vector x, we just have to change x to vector x in the equation. The effect is that and the resulting Jacobian, , are now matrices instead of vertical vectors. Our complete vector chain rule is:

The beauty of the vector formula over the single-variable chain rule is that it automatically takes into consideration the total derivative while maintaining the same notational simplicity. The Jacobian contains all possible combinations of fi with respect to gj and gi with respect to xj. For completeness, here are the two Jacobian components in their full glory:

where , , and . The resulting Jacobian is (an matrix multiplied by a matrix).

Even within this formula, we can simplify further because, for many applications, the Jacobians are square () and the off-diagonal entries are zero. It is the nature of neural networks that the associated mathematics deals with functions of vectors not vectors of functions. For example, the neuron affine function has term and the activation function is ; we'll consider derivatives of these functions in the next section.

As we saw in a previous section, element-wise operations on vectors w and x yield diagonal matrices with elements because wi is a function purely of xi but not xj for . The same thing happens here when fi is purely a function of gi and gi is purely a function of xi:

In this situation, the vector chain rule simplifies to:

Therefore, the Jacobian reduces to a diagonal matrix whose elements are the single-variable chain rule values.

After slogging through all of that mathematics, here's the payoff. All you need is the vector chain rule because the single-variable formulas are special cases of the vector chain rule. The following table summarizes the appropriate components to multiply in order to get the Jacobian.


\begin{tabular}[t]{c|cccc}
  & 
\multicolumn{2}{c}{
  \begin{tabular}[t]{c}
  scalar\\
  \framebox(18,18){$x$}\\
  \end{tabular}} & &\begin{tabular}{c}
  vector\\
  \framebox(18,40){$\mathbf{x}$}\\
  \end{tabular} \\
  
  \begin{tabular}{c}$\frac{\partial}{\partial \mathbf{x}} \mathbf{f}(\mathbf{g}(\mathbf{x}))$
	   = $\frac{\partial \mathbf{f}}{\partial \mathbf{g}}\frac{\partial\mathbf{g}}{\partial \mathbf{x}}$
		\\
		\end{tabular} & \begin{tabular}[t]{c}
  scalar\\
  \framebox(18,18){$u$}\\
  \end{tabular} & \begin{tabular}{c}
  vector\\
  \framebox(18,40){$\mathbf{u}$}
  \end{tabular}& & \begin{tabular}{c}
  vector\\
  \framebox(18,40){$\mathbf{u}$}\\
  \end{tabular} \\
\hline
%\[dimexpr-\normalbaselineskip+5pt]

\begin{tabular}[b]{c}
  scalar\\
  \framebox(18,18){$f$}\\
  \end{tabular} &\framebox(18,18){$\frac{\partial f}{\partial {u}}$} \framebox(18,18){$\frac{\partial u}{\partial {x}}$} ~~~& \raisebox{22pt}{\framebox(40,18){$\frac{\partial f}{\partial {\mathbf{u}}}$}} \framebox(18,40){$\frac{\partial \mathbf{u}}{\partial x}$} & ~~~&
\raisebox{22pt}{\framebox(40,18){$\frac{\partial f}{\partial {\mathbf{u}}}$}} \framebox(40,40){$\frac{\partial \mathbf{u}}{\partial \mathbf{x}}$}
\\
  
\begin{tabular}[b]{c}
  vector\\
  \framebox(18,40){$\mathbf{f}$}\\
  \end{tabular} & \framebox(18,40){$\frac{\partial \mathbf{f}}{\partial {u}}$} \raisebox{22pt}{\framebox(18,18){$\frac{\partial u}{\partial {x}}$}} & \framebox(40,40){$\frac{\partial \mathbf{f}}{\partial \mathbf{u}}$} \framebox(18,40){$\frac{\partial \mathbf{u}}{\partial x}$} & & \framebox(40,40){$\frac{\partial \mathbf{f}}{\partial \mathbf{u}}$} \framebox(40,40){$\frac{\partial \mathbf{u}}{\partial \mathbf{x}}$}\\
  
\end{tabular}

The gradient of neuron activation
神经元激活梯度

We now have all of the pieces needed to compute the derivative of a typical neuron activation for a single neural network computation unit with respect to the model parameters, w and b:
现在,我们已经掌握了计算单个神经网络计算单元的典型神经元激活与模型参数 w 和 b 的导数所需的全部信息:

(This represents a neuron with fully connected weights and rectified linear unit activation. There are, however, other affine functions such as convolution and other activation functions, such as exponential linear units, that follow similar logic.)
(这表示一个神经元具有全连接权重和整流线性单元激活)。不过,还有其他仿射函数(如卷积)和其他激活函数(如指数线性单元)也遵循类似的逻辑)。

Let's worry about max later and focus on computing and . (Recall that neural networks learn through optimization of their weights and biases.) We haven't discussed the derivative of the dot product yet, , but we can use the chain rule to avoid having to memorize yet another rule. (Note notation y not y as the result is a scalar not a vector.)
我们稍后再讨论最大值,重点是计算 (回想一下,神经网络是通过优化权重和偏置来学习的)。我们还没有讨论过点积的导数 ,但我们可以使用链式规则来避免记忆另一条规则。(注意符号 y 不是 y,因为结果是标量而不是矢量)。

The dot product is just the summation of the element-wise multiplication of the elements: . (You might also find it useful to remember the linear algebra notation .) We know how to compute the partial derivatives of and but haven't looked at partial derivatives for . We need the chain rule for that and so we can introduce an intermediate vector variable u just as we did using the single-variable chain rule:
点积 只是各元素相乘的求和: (记住线性代数符号 可能也会有所帮助)。我们知道如何计算 的偏导数,但还没有研究 的偏导数。为此,我们需要使用链式法则,因此我们可以引入一个中间向量变量 u,就像使用单变量链式法则那样:

Once we've rephrased y, we recognize two subexpressions for which we already know the partial derivatives:
一旦我们重新表述了 y,我们就能识别出两个子表达式,而我们已经知道了这两个子表达式的偏导数:

The vector chain rule says to multiply the partials:
矢量链法则说的是将部分相乘:

To check our results, we can grind the dot product down into a pure scalar function:
为了验证我们的结果,我们可以将点乘分解为一个纯标量函数:

Then: 那么

Hooray! Our scalar results match the vector chain rule results.
万幸!我们的标量结果与矢量链式规则的结果相吻合。

Now, let , the full expression within the max activation function call. We have two different partials to compute, but we don't need the chain rule:
现在,让 ,即最大激活函数调用中的完整表达式。我们有两个不同的部分需要计算,但不需要链式规则:

Let's tackle the partials of the neuron activation, . The use of the function call on scalar z just says to treat all negative z values as 0. The derivative of the max function is a piecewise function. When , the derivative is 0 because z is a constant. When , the derivative of the max function is just the derivative of z, which is :
让我们来处理神经元激活的偏值 。在标量 z 上使用 函数调用只是表示将所有负的 z 值视为 0。当 时,导数为 0,因为 z 是常数。当 时,max 函数的导数只是 z 的导数,即


An aside on broadcasting functions across scalars. When one or both of the max arguments are vectors, such as , we broadcast the single-variable function max across the elements. This is an example of an element-wise unary operator. Just to be clear:

For the derivative of the broadcast version then, we get a vector of zeros and ones where:
对于广播版的导数,我们可以得到一个 0 和 1 的向量,其中


To get the derivative of the function, we need the chain rule because of the nested subexpression, . Following our process, let's introduce intermediate scalar variable z to represent the affine function giving:
要求得 函数的导数,我们需要使用链式法则,因为存在嵌套子表达式 。按照我们的流程,让我们引入中间标量变量 z 来表示仿射函数给定:

The vector chain rule tells us:
向量链法则告诉我们

which we can rewrite as follows:
我们可以将其改写如下

and then substitute back in:
然后将 替换回去:

That equation matches our intuition. When the activation function clips affine function output z to 0, the derivative is zero with respect to any weight wi. When , it's as if the max function disappears and we get just the derivative of z with respect to the weights.
这个等式与我们的直觉相吻合。当激活函数夹仿射函数输出 z 为 0 时,相对于任何权重 w 的导数都为零。当 时,就好像最大函数消失了,我们得到的只是 z 相对于权重的导数。

Turning now to the derivative of the neuron activation with respect to b, we get:
现在,我们来看看神经元激活相对于 b 的导数:

Let's use these partial derivatives now to handle the entire loss function.
现在,让我们利用这些偏导数来处理整个损失函数。

The gradient of the neural network loss function
神经网络损失函数的梯度

Training a neuron requires that we take the derivative of our loss or “cost” function with respect to the parameters of our model, w and b. For this example, we'll use mean-squared-error as our loss function. Because we train with multiple vector inputs (e.g., multiple images) and scalar targets (e.g., one classification per image), we need some more notation. Let
训练神经元时,我们需要求出损失或 "成本 "函数与模型参数 w 和 b 的导数。在本例中,我们将使用均方误差作为损失函数。由于我们使用多个向量输入(如多幅图像)和标量目标(如每幅图像一个分类)进行训练,因此我们需要更多的符号。假设

where , and then let
其中 ,然后让

where yi is a scalar. Then the cost equation becomes:
其中 y 是一个标量。那么成本方程就变成了

Following our chain rule process introduces these intermediate variables:
按照我们的链式规则流程,会引入这些中间变量:

Let's compute the gradient with respect to w first.
先计算与 w 有关的梯度。

The gradient with respect to the weights
相对于权重的梯度

From before, we know:
我们以前就知道:

and

Then, for the overall gradient, we get:
那么,对于整体梯度,我们可以得到


\begin{eqnarray*}
 \frac{\partial C(v)}{\partial \mathbf{w}} & = & \frac{\partial }{\partial \mathbf{w}}\frac{1}{N} \sum_{i=1}^N v^2\\\\
 & = & \frac{1}{N} \sum_{i=1}^N \frac{\partial}{\partial \mathbf{w}} v^2\\\\
 & = & \frac{1}{N} \sum_{i=1}^N \frac{\partial v^2}{\partial v} \frac{\partial v}{\partial \mathbf{w}} \\\\
 & = & \frac{1}{N} \sum_{i=1}^N 2v \frac{\partial v}{\partial \mathbf{w}} \\\\
 & = & \frac{1}{N} \sum_{i=1}^N \begin{cases}
	2v\overrightarrow{\mathbf 0}^T = \overrightarrow{\mathbf 0}^T & \mathbf{w} \cdot \mathbf{x}_i + b \leq 0\\
	-2v\mathbf{x}_i^T & \mathbf{w} \cdot \mathbf{x}_i + b > 0\\
\end{cases}\\\\
 & = & \frac{1}{N} \sum_{i=1}^N \begin{cases}
	\overrightarrow{\mathbf 0}^T & \mathbf{w} \cdot \mathbf{x}_i + b \leq 0\\
	-2(y_i-u)\mathbf{x}_i^T & \mathbf{w} \cdot \mathbf{x}_i + b > 0\\
\end{cases}\\\\
 & = & \frac{1}{N} \sum_{i=1}^N \begin{cases}
	\overrightarrow{\mathbf 0}^T & \mathbf{w} \cdot \mathbf{x}_i + b \leq 0\\
	-2(y_i-max(0, \mathbf{w}\cdot\mathbf{x}_i+b))\mathbf{x}_i^T & \mathbf{w} \cdot \mathbf{x}_i + b > 0\\
\end{cases}\\
\phantom{\frac{\partial C(v)}{\partial \mathbf{w}}} & = & \frac{1}{N} \sum_{i=1}^N \begin{cases}
	\overrightarrow{\mathbf 0}^T & \mathbf{w} \cdot \mathbf{x}_i + b \leq 0\\
	-2(y_i-(\mathbf{w}\cdot\mathbf{x}_i+b))\mathbf{x}_i^T & \mathbf{w} \cdot \mathbf{x}_i + b > 0\\
\end{cases}\\\\
 & = & \begin{cases}
	\overrightarrow{\mathbf 0}^T & \mathbf{w} \cdot \mathbf{x}_i + b \leq 0\\
	\frac{-2}{N} \sum_{i=1}^N (y_i-(\mathbf{w}\cdot\mathbf{x}_i+b))\mathbf{x}_i^T & \mathbf{w} \cdot \mathbf{x}_i + b > 0\\
\end{cases}\\\\
 & = & \begin{cases}
	\overrightarrow{\mathbf 0}^T & \mathbf{w} \cdot \mathbf{x}_i + b \leq 0\\
	\frac{2}{N} \sum_{i=1}^N (\mathbf{w}\cdot\mathbf{x}_i+b-y_i)\mathbf{x}_i^T & \mathbf{w} \cdot \mathbf{x}_i + b > 0\\
\end{cases}
\end{eqnarray*}

To interpret that equation, we can substitute an error term yielding:
为了解释这个等式,我们可以用误差项 来代替,得出

From there, notice that this computation is a weighted average across all xi in X. The weights are the error terms, the difference between the target output and the actual neuron output for each xi input. The resulting gradient will, on average, point in the direction of higher cost or loss because large ei emphasize their associated xi. Imagine we only had one input vector, , then the gradient is just . If the error is 0, then the gradient is zero and we have arrived at the minimum loss. If is some small positive difference, the gradient is a small step in the direction of . If is large, the gradient is a large step in that direction. If is negative, the gradient is reversed, meaning the highest cost is in the negative direction.
权重是误差项,即每个 x i 输入的目标输出与实际神经元输出之间的差值。由此产生的梯度平均会指向成本或损失较高的方向,因为大的 e 会强调其相关的 x i 。假设我们只有一个输入向量 ,那么梯度就是 。如果误差为 0,那么梯度为零,我们就达到了最小损失。如果 是某个很小的正差,那么梯度就是朝着 方向迈出的一小步。如果 较大,梯度就是朝着这个方向迈出的一大步。如果 为负值,则梯度相反,即在负方向上的成本最高。

Of course, we want to reduce, not increase, the loss, which is why the gradient descent recurrence relation takes the negative of the gradient to update the current position (for scalar learning rate ):
当然,我们希望减少而不是增加损失,这就是梯度下降递推关系取梯度负值来更新当前位置的原因(对于标量学习率 而言):

Because the gradient indicates the direction of higher cost, we want to update w in the opposite direction.
由于梯度表示成本较高的方向,因此我们希望沿相反的方向更新 w。

The derivative with respect to the bias
与偏差有关的导数

To optimize the bias, b, we also need the partial with respect to b. Here are the intermediate variables again:
为了优化偏差 b,我们还需要关于 b 的局部变量:

We computed the partial with respect to the bias for equation previously:
我们之前计算了方程 的偏差部分:

For v, the partial is:
对 v 而言,部分是:

And for the partial of the cost function itself we get:
对于成本函数本身的局部,我们可以得到


\begin{eqnarray*}
 \frac{\partial C(v)}{\partial b} & = & \frac{\partial }{\partial b}\frac{1}{N} \sum_{i=1}^N v^2\\\\
 & = & \frac{1}{N} \sum_{i=1}^N \frac{\partial}{\partial b} v^2\\\\
 & = & \frac{1}{N} \sum_{i=1}^N \frac{\partial v^2}{\partial v} \frac{\partial v}{\partial b} \\\\
 & = & \frac{1}{N} \sum_{i=1}^N 2v \frac{\partial v}{\partial b} \\\\
 & = & \frac{1}{N} \sum_{i=1}^N \begin{cases}
 	0 & \mathbf{w} \cdot \mathbf{x}_i + b \leq 0\\
 	-2v & \mathbf{w} \cdot \mathbf{x}_i + b > 0\\
 \end{cases}\\\\
 & = & \frac{1}{N} \sum_{i=1}^N \begin{cases}
 	0 & \mathbf{w} \cdot \mathbf{x}_i + b \leq 0\\
 	-2(y_i-max(0, \mathbf{w}\cdot\mathbf{x}_i+b)) & \mathbf{w} \cdot \mathbf{x}_i + b > 0\\
 \end{cases}\\\\
 & = & \frac{1}{N} \sum_{i=1}^N \begin{cases}
 	0 & \mathbf{w} \cdot \mathbf{x}_i + b \leq 0\\
 	2(\mathbf{w}\cdot\mathbf{x}_i+b-y_i) & \mathbf{w} \cdot \mathbf{x}_i + b > 0\\
 \end{cases}\\\\
& = & \frac{2}{N} \sum_{i=1}^N \begin{cases}
	0 & \mathbf{w} \cdot \mathbf{x}_i + b \leq 0\\
	\mathbf{w}\cdot\mathbf{x}_i+b-y_i & \mathbf{w} \cdot \mathbf{x}_i + b > 0\\
 \end{cases}
\end{eqnarray*}

As before, we can substitute an error term:
和以前一样,我们可以用误差项来代替:

The partial derivative is then just the average of the error or zero, according to the activation level. To update the neuron bias, we nudge it in the opposite direction of increased cost:
根据激活水平,偏导数就是误差的平均值或零。为了更新神经元偏置,我们将其推向成本增加的相反方向:

In practice, it is convenient to combine w and b into a single vector parameter rather than having to deal with two different partials: . This requires a tweak to the input vector x as well but simplifies the activation function. By tacking a 1 onto the end of x, , becomes .
在实践中,将 w 和 b 合并为一个向量参数比处理两个不同的部分更方便: 。这也需要对输入向量 x 进行调整,但简化了激活函数。在 x 的末尾加上一个 1,即 就变成了

This finishes off the optimization of the neural network loss function because we have the two partials necessary to perform a gradient descent.
这样就完成了神经网络损失函数的优化,因为我们已经获得了进行梯度下降所需的两个部分。

Summary 摘要

Hopefully you've made it all the way through to this point. You're well on your way to understanding matrix calculus! We've included a reference that summarizes all of the rules from this article in the next section. Also check out the annotated resource link below.
希望您已经读到这里。你已经能够很好地理解矩阵微积分了!我们在下一节中提供了一份参考资料,总结了本文中的所有规则。还可以查看下面的注释资源链接。

Your next step would be to learn about the partial derivatives of matrices not just vectors. For example, you can take a look at the matrix differentiation section of Matrix calculus.
下一步是学习矩阵的偏导数,而不仅仅是向量的偏导数。例如,你可以看看《矩阵微积分》中的矩阵微分部分。

Acknowledgements. We thank Yannet Interian (Faculty in MS data science program at University of San Francisco) and David Uminsky (Faculty/director of MS data science) for their help with the notation presented here.
致谢。我们感谢 Yannet Interian(旧金山大学理学硕士数据科学专业教师)和 David Uminsky(理学硕士数据科学专业教师/主任)在本文所使用的符号方面提供的帮助。

Matrix Calculus Reference
矩阵微积分参考

Gradients and Jacobians 梯度和雅各布

The gradient of a function of two variables is a horizontal 2-vector:
两变量函数的梯度是一个水平的 2 向量:

The Jacobian of a vector-valued function that is a function of a vector is an ( and ) matrix containing all possible scalar partial derivatives:
矢量值函数的雅各比是一个 ( ) 矩阵,包含所有可能的标量偏导数:

The Jacobian of the identity function is I.
同一函数 的雅各布因子为 I。

Element-wise operations on vectors
矢量上的随元素运算

Define generic element-wise operations on vectors w and x using operator such as :
使用运算符 对向量 w 和 x 进行通用的元素顺向运算,例如

The Jacobian with respect to w (similar for x) is:
与 w 有关的雅各布因子(与 x 相似)为

Given the constraint (element-wise diagonal condition) that and access at most wi and xi, respectively, the Jacobian simplifies to a diagonal matrix:
鉴于 分别最多访问 w 和 x 的约束条件(元素对角条件),雅各布矩阵简化为对角矩阵:

Here are some sample element-wise operators:
下面是一些元素顺运算符示例:

Scalar expansion 标量扩展

Adding scalar z to vector x, , is really where and .
将标量 z 添加到矢量 x 中,实际上就是 其中的

Scalar multiplication yields:
标量相乘的结果

Vector reductions 矢量削减

The partial derivative of a vector sum with respect to one of the vectors is:
矢量和相对于其中一个矢量的偏导数是:

For : 对于

For and , we get:
对于 ,我们可以得到:

Vector dot product . Substituting and using the vector chain rule, we get:
向量点积 。代入 并使用向量链法则,我们就得到了:

Similarly, . 同样,

Chain rules 连锁规则

The vector chain rule is the general form as it degenerates to the others. When f is a function of a single variable x and all intermediate variables u are functions of a single variable, the single-variable chain rule applies. When some or all of the intermediate variables are functions of multiple variables, the single-variable total-derivative chain rule applies. In all other cases, the vector chain rule applies.
向量链规则是一般形式,因为它退化为其他形式。当 f 是单一变量 x 的函数,且所有中间变量 u 都是单一变量的函数时,单变量链式法则适用。当部分或全部中间变量是多个变量的函数时,则适用单变量总派生链式法则。在所有其他情况下,适用向量链规则。

Single-variable rule 单变量规则Single-variable total-derivative rule
单变量总衍生规则
Vector rule 矢量规则

Notation 符号

Lowercase letters in bold font such as x are vectors and those in italics font like x are scalars. xi is the element of vector x and is in italics because a single vector element is a scalar. means “length of vector x.”
x 是向量 x 的 元素,用斜体表示是因为单个向量元素是一个标量。 表示 "向量 x 的长度"。

The T exponent of represents the transpose of the indicated vector.
的 T 指数表示指定向量的转置。

is just a for-loop that iterates i from a to b, summing all the xi.
只是一个 for 循环,它将 i 从 a 循环到 b,并对所有 x 求和。

Notation refers to a function called f with an argument of x.
符号 指的是参数为 x 的函数 f。

I represents the square “identity matrix” of appropriate dimensions that is zero everywhere but the diagonal, which contains all ones.
I 代表适当维数的正方形 "同一矩阵",除对角线外,其他地方都为零,对角线上包含所有的 1。

constructs a matrix whose diagonal elements are taken from vector x.
构造一个矩阵,其对角元素取自向量 x。

The dot product is the summation of the element-wise multiplication of the elements: . Or, you can look at it as .
点积 是各元素按元素顺序相乘的和: 。或者,你也可以把它看作

Differentiation is an operator that maps a function of one parameter to another function. That means that maps to its derivative with respect to x, which is the same thing as . Also, if , then .
微分 是将一个参数的函数映射到另一个函数的运算符。这意味着 映射为它关于 x 的导数,这与 是一回事。此外,如果 ,则

The partial derivative of the function with respect to x, , performs the usual scalar derivative holding all other variables constant.
函数关于 x 的偏导数 在所有其他变量不变的情况下,执行通常的标量导数。

The gradient of f with respect to vector x, , organizes all of the partial derivatives for a specific scalar function.
f 相对于向量 x 的梯度 是一个特定标量函数的所有偏导数。

The Jacobian organizes the gradients of multiple functions into a matrix by stacking them:
雅各布矩阵将多个函数的梯度叠加到一个矩阵中:

The following notation means that y has the value a upon and value b upon .
以下符号表示 y 在 时具有值 a,在 时具有值 b。

Resources 资源

Wolfram Alpha can do symbolic matrix algebra and there is also a cool dedicated matrix calculus differentiator.
Wolfram Alpha 可以进行符号矩阵代数,还有一个很酷的专用矩阵微积分微分器。

When looking for resources on the web, search for “matrix calculus” not “vector calculus.” Here are some comments on the top links that come up from a Google search:
在网上查找资源时,请搜索 "矩阵微积分",而不是 "向量微积分"。以下是对谷歌搜索中出现的热门链接的一些评论:

To learn more about neural networks and the mathematics behind optimization and back propagation, we highly recommend Michael Nielsen's book.
要了解神经网络以及优化和反向传播背后的数学知识,我们强烈推荐迈克尔-尼尔森(Michael Nielsen)的书。

For those interested specifically in convolutional neural networks, check out A guide to convolution arithmetic for deep learning.

We reference the law of total derivative, which is an important concept that just means derivatives with respect to x must take into consideration the derivative with respect x of all variables that are a function of x.