PyTorch 一小时教程：从张量到在多个 GPU 上训练神经网络

This tutorial aims to introduce you to the most essential topics of the popular open-source deep learning library, PyTorch, in about one hour of reading time. My primary goal is to get you up to speed with the essentials so that you can get started with using and implementing deep neural networks, such as large language models (LLMs).
本教程旨在以大约一小时的阅读时间向您介绍流行的开源深度学习库 PyTorch 的最基本主题。我的主要目标是让您快速掌握基本知识，以便您可以开始使用和实现深度神经网络，例如大型语言模型（LLM）。

This tutorial covers the following topics:
本教程涵盖以下主题：

An overview of the PyTorch deep learning library
PyTorch 深度学习库概述
Setting up an environment and workspace for deep learning
为深度学习设置环境和工作区
Tensors as a fundamental data structure for deep learning
张量作为深度学习的基本数据结构
The mechanics of training deep neural networks
训练深度神经网络的机制
Training models on GPUs
在 GPU 上训练模型

You’ll learn about the essential concept of tensors and their usage in PyTorch. We will also go over PyTorch’s automatic differentiation engine, a feature that enables us to conveniently and efficiently use backpropagation, which is a crucial aspect of neural network training.
您将了解张量的基本概念及其在 PyTorch 中的用法。我们还将介绍 PyTorch 的自动微分引擎，该功能使我们能够方便高效地使用反向传播，这是神经网络训练的一个重要方面。

Note that this tutorial is meant as a primer for those who are new to deep learning in PyTorch. While this chapter explains PyTorch from the ground up, it’s not meant to be an exhaustive coverage of the PyTorch library. Instead, this chapter focuses on the PyTorch fundamentals that are useful to, for example, implement LLMs.
请注意，本教程是为 PyTorch 中深度学习的新手提供的入门知识。虽然本章从头开始介绍了 PyTorch，但它并不意味着要详尽地介绍 PyTorch 库。相反，本章重点介绍 PyTorch 基础知识，这些基础知识对于实现 LLM 等非常有用。

I’ve spent nearly a decade using, building with, and teaching PyTorch. And in this tutorial, I try to distill what I believe are the most essential concepts. Everything you need to know to get started, and but nothing more, since your time is valuable, and you want to get to building things!
我花了将近十年的时间使用、构建和教授 PyTorch。在本教程中，我尝试提炼我认为最基本的概念。开始之前你需要知道的一切，但仅此而已，因为你的时间很宝贵，而且你想开始构建东西！

1. What is PyTorch
1. 什么是 PyTorch

PyTorch (https://pytorch.org/) is an open-source Python-based deep learning library. According to Papers With Code (https://paperswithcode.com/trends), a platform that tracks and analyzes research papers, PyTorch has been the most widely used deep learning library for research since 2019 by a wide margin. And according to the Kaggle Data Science and Machine Learning Survey 2022 (https://www.kaggle.com/c/kaggle-survey-2022), the number of respondents using PyTorch is approximately 40% and constantly grows every year.
PyTorch （https://pytorch.org/）是一个基于 Python 的开源深度学习库。根据跟踪和分析研究论文的平台 Papers With Code （https://paperswithcode.com/trends）的数据，自 2019 年以来，PyTorch 一直是研究中使用最广泛的深度学习库。根据 2022 年 Kaggle 数据科学和机器学习调查 （https://www.kaggle.com/c/kaggle-survey-2022），使用 PyTorch 的受访者数量约为 40%，并且每年都在不断增长。

One of the reasons why PyTorch is so popular is its user-friendly interface and efficiency. However, despite its accessibility, it doesn’t compromise on flexibility, providing advanced users the ability to tweak lower-level aspects of their models for customization and optimization. In short, for many practitioners and researchers, PyTorch offers just the right balance between usability and features.
PyTorch 如此受欢迎的原因之一是其用户友好的界面和效率。然而，尽管它具有可访问性，但它并没有在灵活性上妥协，为高级用户提供了调整其模型的较低级别方面以进行自定义和优化的能力。简而言之，对于许多从业者和研究人员来说，PyTorch 在可用性和功能之间提供了恰到好处的平衡。

In the following subsections, we will define the main features PyTorch has to offer.
在以下小节中，我们将定义 PyTorch 必须提供的主要功能。

1.1 The three core components of PyTorch
1.1 PyTorch 的三个核心组件

PyTorch is a relatively comprehensive library, and one way to approach it is to focus on its three broad components, which are summarized in Figure 1.
PyTorch 是一个相对全面的库，实现它的一种方法是关注它的三个主要组件，如图 1 所示。

Figure 1. PyTorch's three main components include a tensor library as a fundamental building block for computing, automatic differentiation for model optimization, and deep learning utility functions, making it easier to implement and train deep neural network models.
图 1.PyTorch 的三个主要组件包括作为计算基本构建块的张量库、用于模型优化的自动微分和深度学习实用函数，从而更轻松地实现和训练深度神经网络模型。

Firstly, PyTorch is a tensor library that extends the concept of array-oriented programming library NumPy with the additional feature of accelerated computation on GPUs, thus providing a seamless switch between CPUs and GPUs.
首先，PyTorch 是一个 张量库 ，它扩展了面向数组的编程库 NumPy 的概念，增加了在 GPU 上加速计算的功能，从而在 CPU 和 GPU 之间提供无缝切换。

Secondly, PyTorch is an automatic differentiation engine, also known as autograd, which enables the automatic computation of gradients for tensor operations, simplifying backpropagation and model optimization.
其次，PyTorch 是一个 自动微分引擎 ，也称为 autograd，它能够自动计算张量运算的梯度，从而简化反向传播和模型优化。

Finally, PyTorch is a deep learning library, meaning that it offers modular, flexible, and efficient building blocks (including pre-trained models, loss functions, and optimizers) for designing and training a wide range of deep learning models, catering to both researchers and developers.
最后，PyTorch 是一个 深度学习库 ，这意味着它提供模块化、灵活且高效的构建块（包括预训练模型、损失函数和优化器），用于设计和训练各种深度学习模型，同时满足研究人员和开发人员的需求。

After defining the term deep learning and installing PyTorch in the two following subsections, the remainder of this tutorial will go over these three core components of PyTorch in more detail, along with hands-on code examples.
在以下两个小节中定义术语 Deep learning 并安装 PyTorch 之后，本教程的其余部分将更详细地介绍 PyTorch 的这三个核心组件，并提供动手代码示例。

1.2 Defining deep learning
1.2 定义深度学习

LLMs are often referred to as AI models in the news. However, LLMs are also a type of deep neural network, and PyTorch is a deep learning library. Sounds confusing? Let’s take a brief moment and summarize the relationship between these terms before we proceed.
LLM 在新闻中通常被称为 AI 模型。但是，LLM 也是一种深度神经网络，而 PyTorch 是一个深度学习库。听起来很困惑？在继续之前，让我们花点时间总结一下这些术语之间的关系。

AI is fundamentally about creating computer systems capable of performing tasks that usually require human intelligence. These tasks include understanding natural language, recognizing patterns, and making decisions. (Despite significant progress, AI is still far from achieving this level of general intelligence.)
AI 从根本上讲是关于创建能够执行通常需要人类智能的任务的计算机系统。这些任务包括理解自然语言、识别模式和做出决策。（尽管取得了重大进展，但 AI 仍远未达到这种通用智能水平。

Machine learning represents a subfield of AI (as illustrated in Figure 2) that focuses on developing and improving learning algorithms. The key idea behind machine learning is to enable computers to learn from data and make predictions or decisions without being explicitly programmed to perform the task. This involves developing algorithms that can identify patterns and learn from historical data and improve their performance over time with more data and feedback.
机器学习 g 代表 AI 的一个子领域（如图 2 所示），专注于开发和改进学习算法。机器学习背后的关键思想是使计算机能够从数据中学习并做出预测或决策，而无需被明确编程来执行任务。这涉及开发算法，这些算法可以识别模式并从历史数据中学习，并随着时间的推移通过更多的数据和反馈来提高其性能。

Figure 2. Deep learning is a subcategory of machine learning that is focused on the implementation of deep neural networks. In turn, machine learning is a subcategory of AI that is concerned with algorithms that learn from data. AI is the broader concept of machines being able to perform tasks that typically require human intelligence.
图 2.深度学习是机器学习的一个子类别，专注于深度神经网络的实现。反过来，机器学习是 AI 的一个子类别，它与从数据中学习的算法有关。AI 是机器能够执行通常需要人类智能的任务的更广泛概念。

Machine learning has been integral in the evolution of AI, powering many of the advancements we see today, including LLMs. Machine learning is also behind technologies like recommendation systems used by online retailers and streaming services, email spam filtering, voice recognition in virtual assistants, and even self-driving cars. The introduction and advancement of machine learning have significantly enhanced AI’s capabilities, enabling it to move beyond strict rule-based systems and adapt to new inputs or changing environments.
机器学习一直是 AI 发展中不可或缺的一部分，为我们今天看到的许多进步提供动力，包括 LLM。机器学习也是在线零售商和流媒体服务使用的推荐系统、电子邮件垃圾邮件过滤、虚拟助手中的语音识别甚至自动驾驶汽车等技术的幕后推手。机器学习的引入和进步显著增强了 AI 的能力，使其能够超越严格的基于规则的系统，适应新的输入或不断变化的环境。

Deep learning is a subcategory of machine learning that focuses on the training and application of deep neural networks. These deep neural networks were originally inspired by how the human brain works, particularly the interconnection between many neurons. The “deep” in deep learning refers to the multiple hidden layers of artificial neurons or nodes that allow them to model complex, nonlinear relationships in the data.
深度学习 是机器学习的一个子类别，专注于深度神经网络的训练和应用。这些深度神经网络最初的灵感来自人脑的工作方式，尤其是许多神经元之间的互连。深度学习中的“深度”是指人工神经元或节点的多个隐藏层，这些节点允许它们对数据中复杂的非线性关系进行建模。

Unlike traditional machine learning techniques that excel at simple pattern recognition, deep learning is particularly good at handling unstructured data like images, audio, or text, so deep learning is particularly well suited for LLMs.
与擅长简单模式识别的传统机器学习技术不同，深度学习特别擅长处理图像、音频或文本等非结构化数据，因此深度学习特别适合 LLM。

The typical predictive modeling workflow (also referred to as supervised learning) in machine learning and deep learning is summarized in Figure 3.
图 3 总结了机器学习和深度学习中的典型预测建模工作流程（也称为 监督学习 ）。

Figure 3. The supervised learning workflow for predictive modeling consists of a training stage where a model is trained on labeled examples in a training dataset. The trained model can then be used to predict the labels of new observations.
图 3.预测建模的监督式学习工作流程由一个训练阶段组成，在该阶段中，模型在训练数据集中的标记样本上进行训练。然后，可以使用经过训练的模型来预测新观测值的标签。

Using a learning algorithm, a model is trained on a training dataset consisting of examples and corresponding labels. In the case of an email spam classifier, for example, the training dataset consists of emails and their spam and not-spam labels that a human identified. Then, the trained model can be used on new observations (new emails) to predict their unknown label (spam or not spam).
使用学习算法，在由示例和相应标签组成的训练数据集上训练模型。例如，对于电子邮件垃圾邮件分类器，训练数据集由电子邮件及其人类识别的 垃圾邮件 和非 垃圾邮件 标签组成。然后，可以将经过训练的模型用于新观察（新电子邮件）来预测其未知标签（ 垃圾邮件 或 非垃圾邮件 ）。

Of course, we also want to add a model evaluation between the training and inference stages to ensure that the model satisfies our performance criteria before using it in a real-world application.
当然，我们还希望在训练和推理阶段之间添加模型评估，以确保模型在实际应用程序中使用之前满足我们的性能标准。

Note that the workflow for training and using LLMs, for example, is similar to the workflow depicted in Figure 3 if we train them to classify texts. And if we are interested in training LLMs for generating texts, for example as covered in my Build A Large Language Model (From Scratch) book, Figure 3 still applies. In this case, the labels during pretraining can be derived from the text itself. And the LLM will generate entirely new text (instead of predicting labels) given an input prompt during inference.
请注意，例如，训练和使用 LLM 的工作流程类似于图 3 中描述的工作流程，如果我们训练它们对文本进行分类。如果我们对训练 LLM 以生成文本感兴趣，例如我的 Build A Large Language Model （From Scratch）一书中介绍的，图 3 仍然适用。在这种情况下，预训练期间的标签可以从文本本身派生。在推理过程中，LLM 将在给定输入提示的情况下生成全新的文本（而不是预测标签）。

1.3 Installing PyTorch 1.3 安装 PyTorch

PyTorch can be installed just like any other Python library or package. However, since PyTorch is a comprehensive library featuring CPU- and GPU-compatible codes, the installation may require additional explanation.
PyTorch 可以像任何其他 Python 库或软件包一样安装。但是，由于 PyTorch 是一个具有 CPU 和 GPU 兼容代码的综合库，因此安装可能需要额外的说明。

Python version. Many scientific computing libraries do not immediately support the newest version of Python. Therefore, when installing PyTorch, it’s advisable to use a version of Python that is one or two releases older. For instance, if the latest version of Python is 3.13, using Python 3.11 or 3.12 is recommended.
Python 版本。 许多科学计算库不会立即支持最新版本的 Python。因此，在安装 PyTorch 时，建议使用比 Python 版本低一两个版本的版本。例如，如果 Python 的最新版本是 3.13，建议使用 Python 3.11 或 3.12。

For instance, there are two versions of PyTorch: a leaner version that only supports CPU computing and a version that supports both CPU and GPU computing. If your machine has a CUDA-compatible GPU that can be used for deep learning (ideally an NVIDIA T4, RTX 2080 Ti, or newer), I recommend installing the GPU version. Regardless, the default command for installing PyTorch is as follows in a code terminal:
例如，PyTorch 有两个版本：一个仅支持 CPU 计算的精简版本和一个同时支持 CPU 和 GPU 计算的版本。如果您的机器具有可用于深度学习的兼容 CUDA 的 GPU（最好是 NVIDIA T4、RTX 2080 Ti 或更高版本），我建议安装 GPU 版本。无论如何，安装 PyTorch 的默认命令在代码终端中如下所示：

pip install torch

Suppose your computer supports a CUDA-compatible GPU. In that case, this will automatically install the PyTorch version that supports GPU acceleration via CUDA, given that the Python environment you’re working on has the necessary dependencies (like pip) installed.
假设您的计算机支持兼容 CUDA 的 GPU。在这种情况下，这将自动安装支持通过 CUDA 进行 GPU 加速的 PyTorch 版本，前提是您正在使用的 Python 环境已安装必要的依赖项（如 pip）。

AMD GPUs for deep learning. As of this writing, PyTorch has also added experimental support for AMD GPUs via ROCm. Please see https://pytorch.org for additional instructions.
用于深度学习的 AMD GPU。 在撰写本文时，PyTorch 还通过 ROCm 添加了对 AMD GPU 的实验性支持。有关其他说明，请参阅 https://pytorch.org 。

However, to explicitly install the CUDA-compatible version of PyTorch, it’s often better to specify the CUDA you want PyTorch to be compatible with. PyTorch’s official website (https://pytorch.org) provides commands to install PyTorch with CUDA support for different operating systems as shown in Figure 4.
但是，要显式安装 PyTorch 的 CUDA 兼容版本，通常最好指定您希望 PyTorch 兼容的 CUDA。PyTorch 的官方网站（https://pytorch.org）提供了为不同作系统安装支持 CUDA 的 PyTorch 的命令，如图 4 所示。

Figure 4. Access the PyTorch installation recommendation on https://pytorch.org to customize and select the installation command for your system.
图 4.访问 https://pytorch.org 上的 PyTorch 安装建议以自定义并选择适用于您的系统的安装命令。

As of this writing, this tutorial is based on PyTorch 2.4.1, so it’s recommended to use the following installation command to install the exact version to guarantee compatibility with this tutorial:
在撰写本文时，本教程基于 PyTorch 2.4.1，因此建议使用以下安装命令来安装确切的版本，以保证与本教程的兼容性：

pip install torch==2.4.1

However, as mentioned earlier, given your operating system, the installation command might slightly differ from the one shown above. Thus, I recommend visiting the https://pytorch.org website and using the installation menu (see Figure 4) to select the installation command for your operating system and replace torch with torch==2.4.1 in this command.
但是，如前所述，根据您的作系统，安装命令可能与上面显示的命令略有不同。因此，我建议访问 https://pytorch.org 网站并使用安装菜单（参见图 4）为您的作系统选择安装命令，并将此命令中的 torch 替换为 torch==2.4.1 。

To check the version of PyTorch, you can execute the following code in PyTorch:
要查看 PyTorch 的版本，您可以在 PyTorch 中执行以下代码：

import torch
torch.__version__

This prints: 这将打印出：

'2.4.1'

PyTorch and Torch. Note that the Python library is named “torch” primarily because it’s a continuation of the Torch library but adapted for Python (hence, “PyTorch”). The name “torch” acknowledges the library’s roots in Torch, a scientific computing framework with wide support for machine learning algorithms, which was initially created using the Lua programming language.
PyTorch 和 Torch 的 Torch 中。 请注意，Python 库被命名为“torch”，主要是因为它是 Torch 库的延续，但适用于 Python（因此称为“PyTorch”）。“torch”这个名字承认了该库起源于 Torch，Torch 是一个科学计算框架，广泛支持机器学习算法，最初是使用 Lua 编程语言创建的。

After installing PyTorch, you can check whether your installation recognizes your built-in NVIDIA GPU by running the following code in Python:
安装 PyTorch 后，您可以通过在 Python 中运行以下代码来检查您的安装是否识别您的内置 NVIDIA GPU：

import torch
torch.cuda.is_available()
This returns: 这将返回：

True

If the command returns True, you are all set. If the command returns False, your computer may not have a compatible GPU, or PyTorch does not recognize it. While GPUs are not required for training neural network models in PyTorch, they can significantly speed up deep learning-related computations and train these models magnitudes faster.
如果命令返回 True，则表示您已完成所有设置。如果命令返回 False，则您的计算机可能没有兼容的 GPU，或者 PyTorch 无法识别它。虽然在 PyTorch 中训练神经网络模型不需要 GPU，但它们可以显著加快与深度学习相关的计算速度，并更快地训练这些模型。

If you don’t have access to a GPU, there are several cloud computing providers where users can run GPU computations against an hourly cost. A popular Jupyter-notebook-like environment is Google Colab (https://colab.research.google.com), which provides time-limited access to GPUs as of this writing. Using the “Runtime” menu, it is possible to select a GPU, as shown in the screenshot in Figure 5.
如果您无法访问 GPU，则有几家云计算提供商，用户可以按小时付费运行 GPU 计算。一个流行的类似 Jupyter 笔记本的环境是 Google Colab （https://colab.research.google.com），在撰写本文时，它提供对 GPU 的限时访问。使用 “Runtime” 菜单，可以选择一个 GPU，如图 5 中的屏幕截图所示。

Figure 5. Select a GPU device for Google Colab under the *Runtime/Change runtime type* menu.
图 5.在 *Runtime/Change runtime type* 菜单下为 Google Colab 选择一个 GPU 设备。

PyTorch on Apple Silicon. If you have an Apple Mac with an Apple Silicon chip (like the M1, M2, M3, M4 or newer models), you have the option to leverage its capabilities to accelerate PyTorch code execution. To use your Apple Silicon chip for PyTorch, you first need to install PyTorch as you normally would. Then, to check if your Mac supports PyTorch acceleration with its Apple Silicon chip, you can run a simple code snippet in Python: print(torch.backends.mps.is_available()). If it returns True, it means that your Mac has an Apple Silicon chip that can be used to accelerate PyTorch code.
Apple Silicon 上的 PyTorch。 如果您的 Apple Mac 配备 Apple Silicon 芯片（如 M1、M2、M3、M4 或更新型号），则可以选择利用其功能来加速 PyTorch 代码执行。要将 Apple Silicon 芯片用于 PyTorch，您首先需要像往常一样安装 PyTorch。然后，要检查您的 Mac 是否支持使用 Apple Silicon 芯片进行 PyTorch 加速，您可以在 Python 中运行一个简单的代码片段： print(torch.backends.mps.is_available()) 。如果返回 True，则表示您的 Mac 具有可用于加速 PyTorch 代码的 Apple Silicon 芯片。

2 Understanding tensors 2 了解张量

Tensors represent a mathematical concept that generalizes vectors and matrices to potentially higher dimensions. In other words, tensors are mathematical objects that can be characterized by their order (or rank), which provides the number of dimensions. For example, a scalar (just a number) is a tensor of rank 0, a vector is a tensor of rank 1, and a matrix is a tensor of rank 2, as illustrated in Figure 6.
张量表示一个数学概念，它将向量和矩阵推广到可能更高的维度。换句话说，张量是数学对象，可以通过它们的顺序（或秩）来表征，这提供了维度数。例如，标量（只是一个数字）是 0 阶的张量，向量是 1 阶的张量，矩阵是 2 阶的张量，如图 6 所示。

Figure 6. An illustration of tensors with different ranks. Here 0D corresponds to rank 0, 1D to rank 1, and 2D to rank 2. Note that a 3D vector, which consists of 3 elements, is still a rank 1 tensor.
图 6.具有不同秩的张量的图示。这里 0D 对应于等级 0,1D 对应于等级 1,2D 对应于等级 2。请注意，由 3 个元素组成的 3D 向量仍然是 1 阶张量。

From a computational perspective, tensors serve as data containers. For instance, they hold multi-dimensional data, where each dimension represents a different feature. Tensor libraries, such as PyTorch, can create, manipulate, and compute with these multi-dimensional arrays efficiently. In this context, a tensor library functions as an array library.
从计算角度来看，张量充当数据容器。例如，它们保存多维数据，其中每个维度代表不同的特征。PyTorch 等张量库可以有效地使用这些多维数组创建、作和计算。在这种情况下，张量库充当数组库。

PyTorch tensors are similar to NumPy arrays but have several additional features important for deep learning. For example, PyTorch adds an automatic differentiation engine, simplifying computing gradients, as discussed later in section 2.4. PyTorch tensors also support GPU computations to speed up deep neural network training, which we will discuss later in section 2.8.
PyTorch 张量类似于 NumPy 数组，但具有一些对深度学习很重要的附加功能。例如，PyTorch 添加了一个自动微分引擎，简化了 计算梯度 ，如后面的 2.4 节所述。PyTorch 张量还支持 GPU 计算以加速深度神经网络训练，我们将在后面的 2.8 节中讨论。

PyTorch has a NumPy-like API. As you will see in the upcoming sections, PyTorch adopts most of the NumPy array API and syntax for its tensor operations. If you are new to NumPy, you can get a brief overview of the most relevant concepts via my article Scientific Computing in Python: Introduction to NumPy and Matplotlib at https://sebastianraschka.com/blog/2020/numpy-intro.html.
PyTorch 有一个类似 NumPy 的 API。 正如您将在接下来的部分中看到的那样，PyTorch 在其张量运算中采用了大部分 NumPy 数组 API 和语法。如果你是 NumPy 的新手，你可以通过我的文章 Python 中的科学计算：https://sebastianraschka.com/blog/2020/numpy-intro.html 的 NumPy 和 Matplotlib 简介，简要了解最相关的概念。

The following subsections will look at the basic operations of the PyTorch tensor library, showing how to create simple tensors and going over some of the essential operations.
以下小节将介绍 PyTorch 张量库的基本作，展示如何创建简单的张量并介绍一些基本作。

2.1 Scalars, vectors, matrices, and tensors
2.1 标量、向量、矩阵和张量

As mentioned earlier, PyTorch tensors are data containers for array-like structures. A scalar is a 0-dimensional tensor (for instance, just a number), a vector is a 1-dimensional tensor, and a matrix is a 2-dimensional tensor. There is no specific term for higher-dimensional tensors, so we typically refer to a 3-dimensional tensor as just a 3D tensor, and so forth.
如前所述，PyTorch 张量是数组类结构的数据容器。标量是 0 维张量（例如，只是一个数字），向量是 1 维张量，矩阵是 2 维张量。高维张量没有特定的术语，因此我们通常将 3 维张量称为 3D 张量，依此类推。

We can create objects of PyTorch’s Tensor class using the torch.tensor function as follows:
我们可以使用 torch.tensor 函数创建 PyTorch 的 Tensor 类的对象，如下所示：

import torch

# create a 0D tensor (scalar) from a Python integer  
tensor0d = torch.tensor(1)

# create a 1D tensor (vector) from a Python list
tensor1d = torch.tensor([1, 2, 3])

# create a 2D tensor from a nested Python list
tensor2d = torch.tensor([[1, 2], [3, 4]])

# create a 3D tensor from a nested Python list
tensor3d = torch.tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

2.2 Tensor data types
2.2 Tensor 数据类型

In the previous section, we created tensors from Python integers. In this case, PyTorch adopts the default 64-bit integer data type from Python. We can access the data type of a tensor via the .dtype attribute of a tensor:
在上一节中，我们从 Python 整数创建了张量。在这种情况下，PyTorch 采用 Python 中默认的 64 位整数数据类型。我们可以通过张量的 .dtype 属性访问张量的数据类型：

tensor1d = torch.tensor([1, 2, 3])  
print(tensor1d.dtype)

This prints: 这将打印出：

torch.int64

If we create tensors from Python floats, PyTorch creates tensors with a 32-bit precision by default, as we can see below:
如果我们从 Python 浮点数创建张量，PyTorch 默认创建精度为 32 位的张量，如下所示：

floatvec = torch.tensor([1.0, 2.0, 3.0])
print(floatvec.dtype)

The output is: 输出为：

torch.float32

This choice is primarily due to the balance between precision and computational efficiency. A 32-bit floating point number offers sufficient precision for most deep learning tasks, while consuming less memory and computational resources than a 64-bit floating point number. Moreover, GPU architectures are optimized for 32-bit computations, and using this data type can significantly speed up model training and inference.
这种选择主要是由于精度和计算效率之间的平衡。32 位浮点数为大多数深度学习任务提供了足够的精度，同时比 64 位浮点数消耗更少的内存和计算资源。此外，GPU 架构针对 32 位计算进行了优化，使用这种数据类型可以显著加快模型训练和推理速度。

Moreover, it is possible to readily change the precision using a tensor’s .to method. The following code demonstrates this by changing a 64-bit integer tensor into a 32-bit float tensor:
此外，可以使用 tensor 的 .to 方法轻松更改精度。以下代码通过将 64 位整数张量更改为 32 位浮点张量来演示这一点：

floatvec = tensor1d.to(torch.float32)  
print(floatvec.dtype)

This returns: 这将返回：

torch.float32

For more information about different tensor data types available in PyTorch, I recommend checking the official documentation at https://pytorch.org/docs/stable/tensors.html.
有关 PyTorch 中可用的不同张量数据类型的更多信息，我建议您查看 https://pytorch.org/docs/stable/tensors.html 的官方文档。

2.3 Common PyTorch tensor operations
2.3 常见的 PyTorch 张量作

Comprehensive coverage of all the different PyTorch tensor operations and commands is outside the scope of this tutorial. However, we will briefly describe the essentials that you may require or stumble upon in almost any project.
全面涵盖所有不同的 PyTorch 张量作和命令不在本教程的讨论范围之内。但是，我们将简要描述您在几乎任何项目中可能需要或偶然发现的基本要素。

Before we move on to the next section covering the concept behind computation graphs, below is a list of the most essential PyTorch tensor operations.
在我们继续介绍计算图背后的概念的下一节之前，以下是最基本的 PyTorch 张量作列表。

We already introduced the torch.tensor() function to create new tensors.
我们已经引入了 torch.tensor（） 函数来创建新的张量。

tensor2d = torch.tensor([[1, 2, 3], 
                         [4, 5, 6]])
tensor2d

This prints: 这将打印出：

tensor([[1, 2, 3],
        [4, 5, 6]])

In addition, the .shape attribute allows us to access the shape of a tensor:
此外， .shape 属性允许我们访问张量的形状：

print(tensor2d.shape)

The output is: 输出为：

torch.Size([2, 3])

As you can see above, .shape returns [2, 3], which means that the tensor has 2 rows and 3 columns. To reshape the tensor into a 3 by 2 tensor, we can use the .reshape method:
如上所示， .shape 返回 [2， 3]，这意味着张量有 2 行和 3 列。要将张量重塑为 3 x 2 张量，我们可以使用 .reshape 方法：

tensor2d.reshape(3, 2)

This prints: 这将打印出：

tensor([[1, 2],
        [3, 4],
        [5, 6]])

However, note that the more common command for reshaping tensors in PyTorch is .view():
但是，请注意，在 PyTorch 中用于重塑张量的更常用命令是 .view（）：

tensor2d.view(3, 2)

The output is: 输出为：

tensor([[1, 2],
        [3, 4],
        [5, 6]])

Similar to .reshape and .view, there are several cases where PyTorch offers multiple syntax options for executing the same computation. This is because PyTorch initially followed the original Lua Torch syntax convention but then also added syntax to make it more similar to NumPy upon popular request.
与 .reshape 和 .view 类似，PyTorch 在几种情况下提供了多个语法选项来执行相同的计算。这是因为 PyTorch 最初遵循了最初的 Lua Torch 语法约定，但后来也根据普遍要求添加了语法，使其更类似于 NumPy。

Next, we can use .T to transpose a tensor, which means flipping it across its diagonal. Note that this is similar from reshaping a tensor as you can see based on the result below:
接下来，我们可以使用 .T 转置张量，这意味着将其翻转过其对角线。请注意，这与重塑张量类似，根据下面的结果可以看到：

tensor2d.T

The output is: 输出为：

tensor([[1, 4],
        [2, 5],
        [3, 6]])

Lastly, the common way to multiply two matrices in PyTorch is the .matmul method:
最后，在 PyTorch 中将两个矩阵相乘的常用方法是 .matmul 方法：

tensor2d.matmul(tensor2d.T)

The output is: 输出为：

tensor([[14, 32],
        [32, 77]])

However, we can also adopt the @ operator, which accomplishes the same thing more compactly:
但是，我们也可以采用 @ 运算符，它可以更紧凑地完成相同的作：

tensor2d @ tensor2d.T

This prints: 这将打印出：

tensor([[14, 32],
        [32, 77]])

For readers who’d like to browse through all the different tensor operations available in PyTorch (hint: we won’t need most of these), I recommend checking out the official documentation at https://pytorch.org/docs/stable/tensors.html.
对于想要浏览 PyTorch 中所有可用的不同张量运算的读者（提示：我们不需要其中的大部分），我建议您查看 https://pytorch.org/docs/stable/tensors.html 的官方文档。

3 Seeing models as computation graphs
3 将模型视为计算图

In the previous section, we covered one of the major three components of PyTorch, namely, its tensor library. Next in line is PyTorch’s automatic differentiation engine, also known as autograd. PyTorch’s autograd system provides functions to compute gradients in dynamic computational graphs automatically. But before we dive deeper into computing gradients in the next section, let’s define the concept of a computational graph.
在上一节中，我们介绍了 PyTorch 的三个主要组件之一，即其张量库。接下来是 PyTorch 的自动微分引擎，也称为 autograd。PyTorch 的 autograd 系统提供了自动计算动态计算图中梯度的函数。但在下一节深入探讨计算梯度之前，让我们定义计算图的概念。

A computational graph (or computation graph in short) is a directed graph that allows us to express and visualize mathematical expressions. In the context of deep learning, a computation graph lays out the sequence of calculations needed to compute the output of a neural network – we will need this later to compute the required gradients for backpropagation, which is the main training algorithm for neural networks.
计算图（或简称计算图）是一种有向图，它允许我们表达和可视化数学表达式。在深度学习的背景下，计算图列出了计算神经网络输出所需的计算顺序——我们稍后将需要它来计算反向传播所需的梯度，这是神经网络的主要训练算法。

Let’s look at a concrete example to illustrate the concept of a computation graph. The following code implements the forward pass (prediction step) of a simple logistic regression classifier, which can be seen as a single-layer neural network, returning a score between 0 and 1 that is compared to the true class label (0 or 1) when computing the loss:
让我们看一个具体的例子来说明计算图的概念。以下代码实现了简单 Logistic 回归分类器的前向传递（预测步骤），该分类器可以看作是一个单层神经网络，在计算损失时返回一个介于 0 和 1 之间的分数，该分数与真实的类标签（0 或 1）进行比较：

import torch.nn.functional as F

y = torch.tensor([1.0])  # true label
x1 = torch.tensor([1.1]) # input feature
w1 = torch.tensor([2.2]) # weight parameter
b = torch.tensor([0.0])  # bias unit

z = x1 * w1 + b          # net input
a = torch.sigmoid(z)     # activation & output

loss = F.binary_cross_entropy(a, y)
print(loss)

The result is: 结果是：

tensor(0.0852)

If not all components in the code above make sense to you, don’t worry. The point of this example is not to implement a logistic regression classifier but rather to illustrate how we can think of a sequence of computations as a computation graph, as shown in Figure 7.
如果不是上面代码中的所有组件都对您有意义，请不要担心。这个例子的重点不是实现一个逻辑回归分类器，而是说明我们如何将计算序列视为计算图，如图 7 所示。

Figure 7. A logistic regression forward pass as a computation graph. The input feature `x1` is multiplied by a model weight `w1` and passed through an activation function *σ* after adding the bias. The loss is computed by comparing the model output `a` with a given label `y`.
图 7.作为计算图的 Logistic 回归前向传递。输入特征 'x1' 乘以模型权重 'w1'，并在添加偏差后通过激活函数 *σ* 传递。通过将模型输出 'a' 与给定的标签 'y' 进行比较来计算损失。

In fact, PyTorch builds such a computation graph in the background, and we can use this to calculate gradients of a loss function with respect to the model parameters (here w1 and b) to train the model, which is the topic of the upcoming sections.
事实上，PyTorch 在后台构建了这样的计算图，我们可以使用它来计算损失函数相对于模型参数（此处为 w1 和 b）的梯度来训练模型，这是下一节的主题。

4 Automatic differentiation made easy
4 轻松实现自动区分

In the previous section, we introduced the concept of computation graphs. If we carry out computations in PyTorch, it will build such a graph internally by default if one of its terminal nodes has the requires_grad attribute set to True. This is useful if we want to compute gradients. Gradients are required when training neural networks via the popular backpropagation algorithm, which can be thought of as an implementation of the chain rule from calculus for neural networks, which is illustrated in Figure 8.
在上一节中，我们介绍了计算图的概念。如果我们在 PyTorch 中执行计算，如果它的一个终端节点的 requires_grad 属性设置为 True，它将默认在内部构建这样的图。如果我们想计算梯度，这很有用。通过流行的反向传播算法训练神经网络时，需要梯度，该算法可以被视为神经网络微积分链 式规则 的实现，如图 8 所示。

Figure 8. The most common way of computing the loss gradients in a computation graph involves applying the chain rule from right to left, which is also called reverse-model automatic differentiation or backpropagation. It means we start from the output layer (or the loss itself) and work backward through the network to the input layer. This is done to compute the gradient of the loss with respect to each parameter (weights and biases) in the network, which informs how we update these parameters during training.
图 8.在计算图中计算损失梯度的最常见方法涉及从右到左应用链式法则，也称为逆向模型自动微分或反向传播。这意味着我们从输出层（或损失本身）开始，然后通过网络向后工作到输入层。这样做是为了计算相对于网络中每个参数（权重和偏差）的损失梯度，从而告知我们如何在训练期间更新这些参数。

Partial derivatives and gradients. Figure 8 shows partial derivatives, which measure the rate at which a function changes with respect to one of its variables. A gradient is a vector containing all of the partial derivatives of a multivariate function, a function with more than one variable as input. If you are not familiar or don’t remember the partial derivatives, gradients, or the chain rule from calculus, don’t worry. On a high level, the chain rule is a way to compute gradients of a loss function with respect to the model’s parameters in a computation graph. This provides the information needed to update each parameter in a way that minimizes the loss function, which serves as a proxy for measuring the model’s performance, using a method such as gradient descent. We will revisit the computational implementation of this training loop in PyTorch in section 2.7, A typical training loop.
偏导数和梯度。 图 8 显示了偏导数，它测量函数相对于其变量之一的变化速率。梯度是包含多元函数的所有偏导数的向量，该函数具有多个变量作为输入。如果您不熟悉或不记得微积分中的偏导数、梯度或链式规则，请不要担心。在高层次上，链式法则是一种计算损失函数相对于计算图中模型参数的梯度的方法。这提供了以最小化损失函数的方式更新每个参数所需的信息，损失函数用作使用梯度下降等方法测量模型性能的代理。我们将在第 2.7 节 典型训练循环 中重新审视 PyTorch 中此训练循环的计算实现。

Now, how is this all related to the second component of the PyTorch library we mentioned earlier, the automatic differentiation (autograd) engine? By tracking every operation performed on tensors, PyTorch’s autograd engine constructs a computational graph in the background. Then, calling the grad function, we can compute the gradient of the loss with respect to model parameter w1 as follows:
现在，这与我们之前提到的 PyTorch 库的第二个组件自动微分（autograd）引擎有什么关系呢？通过跟踪对张量执行的每项作，PyTorch 的 autograd 引擎在后台构建了一个计算图。然后，调用 grad 函数，我们可以计算损失相对于模型参数 w1 的梯度，如下所示：

import torch.nn.functional as F
from torch.autograd import grad

y = torch.tensor([1.0])
x1 = torch.tensor([1.1])
w1 = torch.tensor([2.2], requires_grad=True)
b = torch.tensor([0.0], requires_grad=True)

z = x1 * w1 + b 
a = torch.sigmoid(z)

loss = F.binary_cross_entropy(a, y)

grad_L_w1 = grad(loss, w1, retain_graph=True)
grad_L_b = grad(loss, b, retain_graph=True)

By default, PyTorch destroys the computation graph after calculating the gradients to free memory. However, since we are going to reuse this computation graph shortly, we set retain_graph=True so that it stays in memory.
默认情况下，PyTorch 在计算梯度以释放内存后销毁计算图。但是，由于我们很快要重用这个计算图，因此我们设置 retain_graph=True ，以便它保留在内存中。

Let’s show the resulting values of the loss with respect to the model’s parameters:
让我们显示相对于模型参数的损失结果值：

print(grad_L_w1)
print(grad_L_b)

The prints: 版画：

(tensor([-0.0898]),)
(tensor([-0.0817]),)

Above, we have been using the grad function “manually,” which can be useful for experimentation, debugging, and demonstrating concepts. But in practice, PyTorch provides even more high-level tools to automate this process. For instance, we can call .backward on the loss, and PyTorch will compute the gradients of all the leaf nodes in the graph, which will be stored via the tensors’ .grad attributes:
在上面，我们一直在 “手动” 使用 grad 函数，该函数可用于实验、调试和演示概念。但在实践中，PyTorch 提供了更多高级工具来自动化此过程。例如，我们可以对损失调用 .backward，PyTorch 将计算图中所有叶节点的梯度，这些梯度将通过张量的 .grad 属性进行存储：

loss.backward()

print(w1.grad)
print(b.grad)

The outputs are: 输出为：

tensor([-0.0898])
tensor([-0.0817])

If this section is packed with a lot of information and you may be overwhelmed by the calculus concepts, don’t worry. While this calculus jargon was a means to explain PyTorch’s autograd component, all you need to take away from this section is that PyTorch takes care of the calculus for us via the .backward method – we usually don’t need to compute any derivatives or gradients by hand when using PyTorch.
如果本节包含大量信息，并且您可能会被微积分概念所淹没，请不要担心。虽然这个演算术语是解释 PyTorch 的 autograd 组件的一种方式，但您需要从本节中学到的是，PyTorch 通过 .backward 方法为我们处理演算——在使用 PyTorch 时，我们通常不需要手动计算任何导数或梯度。

5 Implementing multilayer neural networks
5 实现多层神经网络

In the previous sections, we covered PyTorch’s tensor and autograd components. This section focuses on PyTorch as a library for implementing deep neural networks.
在前面的部分中，我们介绍了 PyTorch 的 tensor 和 autograd 组件。本节重点介绍 PyTorch 作为实现深度神经网络的库。

To provide a concrete example, we focus on a multilayer perceptron, which is a fully connected neural network, as illustrated in Figure 9.
为了提供一个具体的例子，我们重点介绍一个多层感知器，它是一个完全连接的神经网络，如图 9 所示。

Figure 9. An illustration of a multilayer perceptron with 2 hidden layers. Each node represents a unit in the respective layer. Each layer has only a very small number of nodes for illustration purposes.
图 9.具有 2 个隐藏层的多层感知器的图示。每个节点代表相应层中的一个单元。每个层只有非常少量的节点用于说明目的。

When implementing a neural network in PyTorch, we typically subclass the torch.nn.Module class to define our own custom network architecture. This Module base class provides a lot of functionality, making it easier to build and train models. For instance, it allows us to encapsulate layers and operations and keep track of the model’s parameters.
在 PyTorch 中实现神经网络时，我们通常会将 torch.nn.Module 类子类化来定义我们自己的自定义网络架构。这个 Module 基类提供了许多功能，使构建和训练模型变得更加容易。例如，它允许我们封装层和作并跟踪模型的参数。

Within this subclass, we define the network layers in the __init__ constructor and specify how they interact in the forward method. The forward method describes how the input data passes through the network and comes together as a computation graph.
在这个子类中，我们在 __init__ 构造函数中定义网络层，并指定它们在 forward 方法中的交互方式。 forward 方法描述了 input 数据如何通过网络并汇集为计算图。

In contrast, the backward method, which we typically do not need to implement ourselves, is used during training to compute gradients of the loss function with respect to the model parameters, as we will see in Section 2.7, A typical training loop.
相比之下，我们通常不需要自己实现的逆向方法在训练期间用于计算损失函数相对于模型参数的梯度，正如我们将在第 2.7 节 典型的训练循环中看到的那样。

The following code implements a classic multilayer perceptron with two hidden layers to illustrate a typical usage of the Module class:
以下代码使用两个隐藏层实现一个经典的多层感知器，以说明 Module 类的典型用法：

class NeuralNetwork(torch.nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super().__init__()

        self.layers = torch.nn.Sequential(
                
            # 1st hidden layer
            torch.nn.Linear(num_inputs, 30),
            torch.nn.ReLU(),

            # 2nd hidden layer
            torch.nn.Linear(30, 20),
            torch.nn.ReLU(),

            # output layer
            torch.nn.Linear(20, num_outputs),
        )

    def forward(self, x):
        logits = self.layers(x)
        return logits

We can then instantiate a new neural network object as follows:
然后，我们可以实例化一个新的神经网络对象，如下所示：

model = NeuralNetwork(50, 3)

But before using this new model object, it is often useful to call print on the model to see a summary of its structure:
但是在使用这个新的模型对象之前，在模型上调用 print 来查看其结构的摘要通常很有用：

print(model)

This prints: 这将打印出：

NeuralNetwork(
  (layers): Sequential(
    (0): Linear(in_features=50, out_features=30, bias=True)
    (1): ReLU()
    (2): Linear(in_features=30, out_features=20, bias=True)
    (3): ReLU()
    (4): Linear(in_features=20, out_features=3, bias=True)
  )
)

Note that we used the Sequential class when we implemented the NeuralNetwork class. Using Sequential is not required, but it can make our life easier if we have a series of layers that we want to execute in a specific order, as is the case here. This way, after instantiating self.layers = Sequential(...) in the __init__ constructor, we just have to call the self.layers instead of calling each layer individually in the NeuralNetwork’s forward method.
请注意，我们在实现 NeuralNetwork 类时使用了 Sequential 类。使用 Sequential 不是必需的，但如果我们想按特定顺序执行一系列层，它可以使我们的生活更轻松，就像这里的情况一样。这样，在 __init__ 构造函数中实例化 self.layers = Sequential（...） 后，我们只需要调用 self.layers，而不是在 NeuralNetwork 的 forward 方法中单独调用每一层。

Next, let’s check the total number of trainable parameters of this model:
接下来，我们来检查一下这个模型的可训练参数总数：

num_params = sum(
    p.numel() for p in model.parameters() if p.requires_grad
)
print("Total number of trainable model parameters:", num_params)

This prints: 这将打印出：

Total number of trainable model parameters: 2213

Note that each parameter for which requires_grad=True counts as a trainable parameter and will be updated during training (more on that later in section 2.7, A typical training loop).
请注意， requires_grad=True 的每个参数都算作可训练参数，并将在训练期间更新（稍后将在第 2.7 节 典型的训练循环中详细介绍）。

In the case of our neural network model with the two hidden layers above, these trainable parameters are contained in the torch.nn.Linear layers. A linear layer multiplies the inputs with a weight matrix and adds a bias vector. This is sometimes also referred to as a feedforward or fully connected layer.
对于具有上述两个隐藏层的神经网络模型，这些可训练参数包含在 torch.nn.Linear 层中。线性层将输入与权重矩阵相乘，并添加偏置向量。这有时也称为前馈或 全连接 层。

Based on the print(model) call we executed above, we can see that the first Linear layer is at index position 0 in the layers attribute. We can access the corresponding weight parameter matrix as follows:
根据我们上面执行的 print（model） 调用，我们可以看到第一个 Linear 层位于 layers 属性中的索引位置 0。我们可以按如下方式访问相应的权重参数矩阵：

print(model.layers[0].weight)

This prints: 这将打印出：

Parameter containing:
tensor([[ 0.1182,  0.0606, -0.1292,  ..., -0.1126,  0.0735, -0.0597],
        [-0.0249,  0.0154, -0.0476,  ..., -0.1001, -0.1288,  0.1295],
        [ 0.0641,  0.0018, -0.0367,  ..., -0.0990, -0.0424, -0.0043],
        ...,
        [ 0.0618,  0.0867,  0.1361,  ..., -0.0254,  0.0399,  0.1006],
        [ 0.0842, -0.0512, -0.0960,  ..., -0.1091,  0.1242, -0.0428],
        [ 0.0518, -0.1390, -0.0923,  ..., -0.0954, -0.0668, -0.0037]],
       requires_grad=True)

Since this is a large matrix that is not shown in its entirety, let’s use the .shape attribute to show its dimensions:
由于这是一个没有完整显示的大型矩阵，因此让我们使用 .shape 属性来显示其维度：

print(model.layers[0].weight.shape)

The result is: 结果是：

torch.Size([30, 50])

(Similarly, you could access the bias vector via model.layers[0].bias.)
（同样，您可以通过 model.layers[0].bias 访问偏差向量。

The weight matrix above is a 30x50 matrix, and we can see that the requires_grad is set to True, which means its entries are trainable – this is the default setting for weights and biases in torch.nn.Linear.
上面的权重矩阵是一个 30x50 矩阵，我们可以看到 requires_grad 设置为 True，这意味着它的条目是可训练的——这是 torch.nn.Linear 中权重和偏差的默认设置。

Note that if you execute the code above on your computer, the numbers in the weight matrix will likely differ from those shown above. This is because the model weights are initialized with small random numbers, which are different each time we instantiate the network. In deep learning, initializing model weights with small random numbers is desired to break symmetry during training – otherwise, the nodes would be just performing the same operations and updates during backpropagation, which would not allow the network to learn complex mappings from inputs to outputs.
请注意，如果您在计算机上执行上述代码，则权重矩阵中的数字可能与上面显示的数字不同。这是因为模型权重是用小的随机数初始化的，每次我们实例化网络时，这些随机数都是不同的。在深度学习中，希望使用小随机数初始化模型权重以在训练期间打破对称性，否则，节点将在反向传播期间仅执行相同的作和更新，这将不允许网络学习从输入到输出的复杂映射。

However, while we want to keep using small random numbers as initial values for our layer weights, we can make the random number initialization reproducible by seeding PyTorch’s random number generator via manual_seed:
但是，虽然我们希望继续使用小随机数作为层权重的初始值，但我们可以通过 manual_seed 为 PyTorch 的随机数生成器设定种子，使随机数初始化可重现：

torch.manual_seed(123)

model = NeuralNetwork(50, 3)
print(model.layers[0].weight)

Parameter containing:
tensor([[-0.0577,  0.0047, -0.0702,  ...,  0.0222,  0.1260,  0.0865],
        [ 0.0502,  0.0307,  0.0333,  ...,  0.0951,  0.1134, -0.0297],
        [ 0.1077, -0.1108,  0.0122,  ...,  0.0108, -0.1049, -0.1063],
        ...,
        [-0.0787,  0.1259,  0.0803,  ...,  0.1218,  0.1303, -0.1351],
        [ 0.1359,  0.0175, -0.0673,  ...,  0.0674,  0.0676,  0.1058],
        [ 0.0790,  0.1343, -0.0293,  ...,  0.0344, -0.0971, -0.0509]],
       requires_grad=True)

Now, after we spent some time inspecting the NeuralNetwork instance, let’s briefly see how it’s used via the forward pass:
现在，在我们花了一些时间检查 NeuralNetwork 实例之后，让我们简要地看看它是如何通过正向传递来使用的：

torch.manual_seed(123)

X = torch.rand((1, 50))
out = model(X)
print(out)

The result is: 结果是：

tensor([[-0.1262,  0.1080, -0.1792]], grad_fn=<AddmmBackward0>)

In the code above, we generated a single random training example X as a toy input (note that our network expects 50-dimensional feature vectors) and fed it to the model, returning three scores. When we call model(x), it will automatically execute the forward pass of the model.
在上面的代码中，我们生成了一个随机训练示例 X 作为玩具输入（请注意，我们的网络需要 50 维特征向量）并将其馈送到模型，返回三个分数。当我们调用 model（x） 时，它会自动执行模型的 forward pass。

The forward pass refers to calculating output tensors from input tensors. This involves passing the input data through all the neural network layers, starting from the input layer, through hidden layers, and finally to the output layer.
前向传递是指从输入张量计算输出张量。这涉及通过所有神经网络层传递输入数据，从输入层开始，经过隐藏层，最后到输出层。

These three numbers returned above correspond to a score assigned to each of the three output nodes. Notice that the output tensor also includes a grad_fn value.
上面返回的这三个数字对应于分配给三个输出节点中每个节点的分数。请注意，输出张量还包括一个 grad_fn 值。

Here, grad_fn=<AddmmBackward0> represents the last-used function to compute a variable in the computational graph. In particular, grad_fn=<AddmmBackward0> means that the tensor we are inspecting was created via a matrix multiplication and addition operation. PyTorch will use this information when it computes gradients during backpropagation. The <AddmmBackward0> part of grad_fn=<AddmmBackward0> specifies the operation that was performed. In this case, it is an Addmm operation. Addmm stands for matrix multiplication (mm) followed by an addition (Add).
这里，grad_fn=<AddmmBackward0> 表示最后使用的函数来计算计算图中的变量。特别是，grad_fn=<AddmmBackward0> 表示我们正在检查的张量是通过矩阵乘法和加法运算创建的。PyTorch 在反向传播期间计算梯度时将使用此信息。grad_fn=<AddmmBackward0> 的 <AddmmBackward0> 部分指定执行的作。在本例中，它是一个 Addmm 作。Addmm 代表矩阵乘法（mm）后跟加法（Add）。

If we just want to use a network without training or backpropagation, for example, if we use it for prediction after training, constructing this computational graph for backpropagation can be wasteful as it performs unnecessary computations and consumes additional memory. So, when we use a model for inference (for instance, making predictions) rather than training, it is a best practice to use the torch.no_grad() context manager, as shown below. This tells PyTorch that it doesn’t need to keep track of the gradients, which can result in significant savings in memory and computation.
例如，如果我们只想使用一个没有训练或反向传播的网络，如果我们在训练后将其用于预测，那么构建这个用于反向传播的计算图可能会很浪费，因为它会执行不必要的计算并消耗额外的内存。因此，当我们使用模型进行推理（例如，进行预测）而不是训练时，最佳实践是使用 torch.no_grad（） 上下文管理器，如下所示。这告诉 PyTorch 它不需要跟踪梯度，这可以显著节省内存和计算。

with torch.no_grad():
    out = model(X)
print(out)

tensor([[-0.1262,  0.1080, -0.1792]])

In PyTorch, it’s common practice to code models such that they return the outputs of the last layer (logits) without passing them to a nonlinear activation function. That’s because PyTorch’s commonly used loss functions combine the softmax (or sigmoid for binary classification) operation with the negative log-likelihood loss in a single class. The reason for this is numerical efficiency and stability. So, if we want to compute class-membership probabilities for our predictions, we have to call the softmax function explicitly:
在 PyTorch 中，通常的做法是对模型进行编码，以便它们返回最后一层（logits）的输出，而不将其传递给非线性激活函数。这是因为 PyTorch 常用的损失函数将 softmax（或二进制分类的 sigmoid）运算与负对数似然损失组合在一个类中。其原因是数值效率和稳定性。因此，如果我们想计算预测的类隶属概率，我们必须显式调用 softmax 函数：

with torch.no_grad():
    out = torch.softmax(model(X), dim=1)
print(out)

This prints: 这将打印出：

tensor([[0.3113, 0.3934, 0.2952]])

The values can now be interpreted as class-membership probabilities that sum up to 1. The values are roughly equal for this random input, which is expected for a randomly initialized model without training.
这些值现在可以解释为总和为 1 的类成员资格概率。此随机输入的值大致相等，这对于没有训练的随机初始化模型来说是预期的。

In the following two sections, we will learn how to set up an efficient data loader and train the model.
在以下两节中，我们将学习如何设置高效的数据加载器并训练模型。

6 Setting up efficient data loaders
6 设置高效的数据加载器

In the previous section, we defined a custom neural network model. Before we can train this model, we have to briefly talk about creating efficient data loaders in PyTorch, which we will iterate over when training the model. The overall idea behind data loading in PyTorch is illustrated in Figure 10.
在上一节中，我们定义了一个自定义神经网络模型。在训练这个模型之前，我们必须简要地讨论一下在 PyTorch 中创建高效的数据加载器，我们将在训练模型时对其进行迭代。PyTorch 中数据加载的总体思路如图 10 所示。

Figure 10. PyTorch implements a `Dataset` and a `DataLoader` class. The `Dataset` class is used to instantiate objects that define how each data record is loaded. The `DataLoader` handles how the data is shuffled and assembled into batches.
图 10.PyTorch 实现了一个 'Dataset' 和一个 'DataLoader' 类。'Dataset' 类用于实例化定义每个数据记录如何加载的对象。'DataLoader' 处理数据如何被随机排列和组装成批次。

Following the illustration in Figure 10, in this section, we will implement a custom Dataset class that we will use to create a training and a test dataset that we’ll then use to create the data loaders.
按照图 10 中的插图，在本节中，我们将实现一个自定义 的 Dataset 类，我们将使用它来创建一个训练和测试数据集，然后我们将使用该数据集来创建数据加载器。

Let’s start by creating a simple toy dataset of five training examples with two features each. Accompanying the training examples, we also create a tensor containing the corresponding class labels: three examples belong to class 0, and two examples belong to class 1. In addition, we also make a test set consisting of two entries. The code to create this dataset is shown below.
我们首先创建一个简单的玩具数据集，其中包含 5 个训练示例，每个示例有两个特征。除了训练示例，我们还创建了一个包含相应类标签的张量：三个样本属于类 0，两个样本属于类 1。此外，我们还制作了一个由两个条目组成的测试集。用于创建此数据集的代码如下所示。

X_train = torch.tensor([
    [-1.2, 3.1],
    [-0.9, 2.9],
    [-0.5, 2.6],
    [2.3, -1.1],
    [2.7, -1.5]
])

y_train = torch.tensor([0, 0, 0, 1, 1])

X_test = torch.tensor([
    [-0.8, 2.8],
    [2.6, -1.6],
])

y_test = torch.tensor([0, 1])

Class label numbering PyTorch requires that class labels start with label 0, and the largest class label value should not exceed the number of output nodes minus 1 (since Python index counting starts at 0. So, if we have class labels 0, 1, 2, 3, and 4, the neural network output layer should consist of 5 nodes.
类标签编号 PyTorch 要求类标签以标签 0 开头，并且最大的类标签值不应超过输出节点数减 1（因为 Python 索引计数从 0 开始）。因此，如果我们有类标签 0、1、2、3 和 4，神经网络输出层应由 5 个节点组成。

Next, we create a custom dataset class, ToyDataset, by subclassing from PyTorch’s Dataset parent class, as shown below.
接下来，我们通过从 PyTorch 的 Dataset 父类进行子类化来创建自定义数据集类 ToyDataset，如下所示。

from torch.utils.data import Dataset


class ToyDataset(Dataset):
    def __init__(self, X, y):
        self.features = X
        self.labels = y

    def __getitem__(self, index):
        one_x = self.features[index]
        one_y = self.labels[index]        
        return one_x, one_y

    def __len__(self):
        return self.labels.shape[0]

train_ds = ToyDataset(X_train, y_train)
test_ds = ToyDataset(X_test, y_test)

This custom ToyDataset class’s purpose is to use it to instantiate a PyTorch DataLoader. But before we get to this step, let’s briefly go over the general structure of the ToyDataset code.
这个自定义 ToyDataset 类的目的是使用它来实例化 PyTorch DataLoader。但在开始此步骤之前，让我们简要地回顾一下 ToyDataset 代码的一般结构。

In PyTorch, the three main components of a custom Dataset class are the __init__ constructor, the __getitem__ method, and the __len__ method, as shown in code ToyDataset code above.
在 PyTorch 中，自定义 Dataset 类的三个主要组件是 __init__ 构造函数、 __getitem__ 方法和 __len__ 方法，如上面的代码 ToyDataset 代码所示。

In the __init__ method, we set up attributes that we can access later in the __getitem__ and __len__ methods. This could be file paths, file objects, database connectors, and so on. Since we created a tensor dataset that sits in memory, we are simply assigning X and y to these attributes, which are placeholders for our tensor objects.
在 __init__ 方法中，我们设置了稍后可以在 __getitem__ 和 __len__ 方法中访问的属性。这可以是文件路径、文件对象、数据库连接器等。由于我们创建了一个位于内存中的张量数据集，因此我们只需将 X 和 y 分配给这些属性，它们是张量对象的占位符。

In the __getitem__ method, we define instructions for returning exactly one item from the dataset via an index. This means the features and the class label corresponding to a single training example or test instance. (The data loader will provide this index, which we will cover shortly.)
在 __getitem__ 方法中，我们定义了通过索引从数据集中返回一个项目的说明。这意味着对应于单个训练示例或测试实例的特征和类标签。（数据加载器将提供此索引，我们稍后将介绍。

Finally, the __len__ method contains instructions for retrieving the length of the dataset. Here, we use the .shape attribute of a tensor to return the number of rows in the feature array. In the case of the training dataset, we have five rows, which we can double-check as follows:
最后， __len__ 方法包含用于检索数据集长度的说明。在这里，我们使用张量的 .shape 属性来返回特征数组中的行数。对于训练数据集，我们有五行，我们可以按如下方式仔细检查：

len(train_ds)

The result is 5.
结果为 5。

Now that we defined a PyTorch Dataset class we can use for our toy dataset, we can use PyTorch’s DataLoader class to sample from it, as shown in the code below:
现在我们已经定义了一个可以用于玩具数据集的 PyTorch Dataset 类，我们可以使用 PyTorch 的 DataLoader 类从中采样，如下面的代码所示：

from torch.utils.data import DataLoader

torch.manual_seed(123)

train_loader = DataLoader(
    dataset=train_ds,
    batch_size=2,
    shuffle=True,
    num_workers=0
)

test_ds = ToyDataset(X_test, y_test)

test_loader = DataLoader(
    dataset=test_ds,
    batch_size=2,
    shuffle=False,
    num_workers=0
)

After instantiating the training data loader, we can iterate over it as shown below. (The iteration over the test_loader works similarly but is omitted for brevity.)
实例化训练数据加载器后，我们可以对其进行迭代，如下所示。（ test_loader 上的迭代工作方式类似，但为简洁起见，省略了。

for idx, (x, y) in enumerate(train_loader):
    print(f"Batch {idx+1}:", x, y)

The result is: 结果是：

Batch 1: tensor([[ 2.3000, -1.1000],
        [-0.9000,  2.9000]]) tensor([1, 0])
Batch 2: tensor([[-1.2000,  3.1000],
        [-0.5000,  2.6000]]) tensor([0, 0])
Batch 3: tensor([[ 2.7000, -1.5000]]) tensor([1])

As we can see based on the output above, the train_loader iterates over the training dataset visiting each training example exactly once. This is known as a training epoch. Since we seeded the random number generator using torch.manual_seed(123) above, you should get the exact same shuffling order of training examples as shown above. However if you iterate over the dataset a second time, you will see that the shuffling order will change. This is desired to prevent deep neural networks getting caught in repetitive update cycles during training.
正如我们根据上面的输出所看到的， train_loader 迭代训练数据集，访问每个训练样本一次。这称为训练 epoch。由于我们使用上面的 torch.manual_seed（123） 为随机数生成器设定了种子，因此您应该得到与上面所示完全相同的训练示例的洗牌顺序。但是，如果您第二次迭代数据集，您将看到 shuffle 顺序将发生变化。这是为了防止深度神经网络在训练期间陷入重复的更新周期。

Note that we specified a batch size of 2 above, but the 3rd batch only contains a single example. That’s because we have five training examples, which is not evenly divisible by 2. In practice, having a substantially smaller batch as the last batch in a training epoch can disturb the convergence during training. To prevent this, it’s recommended to set drop_last=True, which will drop the last batch in each epoch, as shown below:
请注意，我们在上面指定了 batch size 2，但第 3 个批次仅包含一个示例。那是因为我们有 5 个训练示例，不能被 2 整除。在实践中，将一个小得多的批次作为训练 epoch 中的最后一个批次可能会干扰训练期间的收敛。为防止这种情况，建议设置 drop_last=True，这将删除每个 epoch 中的最后一批，如下所示：

train_loader = DataLoader(
    dataset=train_ds,
    batch_size=2,
    shuffle=True,
    num_workers=0,
    drop_last=True
)

Now, iterating over the training loader, we can see that the last batch is omitted:
现在，迭代训练加载器，我们可以看到最后一批被省略了：

for idx, (x, y) in enumerate(train_loader):
    print(f"Batch {idx+1}:", x, y)

The result is: 结果是：

Batch 1: tensor([[-1.2000,  3.1000],
        [-0.5000,  2.6000]]) tensor([0, 0])
Batch 2: tensor([[ 2.3000, -1.1000],
        [-0.9000,  2.9000]]) tensor([1, 0])

Lastly, let’s discuss the setting num_workers=0 in the DataLoader. This parameter in PyTorch’s DataLoader function is crucial for parallelizing data loading and preprocessing. When num_workers is set to 0, the data loading will be done in the main process and not in separate worker processes. This might seem unproblematic, but it can lead to significant slowdowns during model training when we train larger networks on a GPU. This is because instead of focusing solely on the processing of the deep learning model, the CPU must also take time to load and preprocess the data. As a result, the GPU can sit idle while waiting for the CPU to finish these tasks. In contrast, when num_workers is set to a number greater than zero, multiple worker processes are launched to load data in parallel, freeing the main process to focus on training your model and better utilizing your system’s resources, which is illustrated in Figure 11.
最后，让我们讨论一下 DataLoader 中的 num_workers=0 设置。PyTorch 的 DataLoader 函数中的这个参数对于并行数据加载和预处理至关重要。当 num_workers 设置为 0 时，数据加载将在主进程中完成，而不是在单独的工作进程中完成。这似乎没有问题，但当我们在 GPU 上训练更大的网络时，它可能会导致模型训练期间的明显速度变慢。这是因为 CPU 不仅仅关注深度学习模型的处理，还必须花时间来加载和预处理数据。因此，GPU 可以在等待 CPU 完成这些任务时处于空闲状态。相反，当 num_workers 设置为大于 0 的数字时，将启动多个工作进程以并行加载数据，从而使主进程能够专注于训练模型和更好地利用系统资源，如图 11 所示。

Figure 11. Loading data without multiple workers (setting `num_workers=0`) will create a data loading bottleneck where the model sits idle until the next batch is loaded as illustrated in the left subpanel. If multiple workers are enabled, the data loader can already queue up the next batch in the background as shown in the right subpanel.
图 11.在没有多个工作程序的情况下加载数据（设置 'num_workers=0'）将产生数据加载瓶颈，即模型处于空闲状态，直到加载下一批，如左侧子面板所示。如果启用了多个工作程序，则数据加载程序已经可以在后台对下一批进行排队，如右侧子面板所示。

However, if we are working with very small datasets, setting num_workers to 1 or larger may not be necessary since the total training time takes only fractions of a second anyway. On the contrary, if you are working with tiny datasets or interactive environments such as Jupyter notebooks, increasing num_workers may not provide any noticeable speedup. They might, in fact, lead to some issues. One potential issue is the overhead of spinning up multiple worker processes, which could take longer than the actual data loading when your dataset is small.
但是，如果我们使用的是非常小的数据集，则可能不需要将 num_workers 设置为 1 或更大，因为无论如何，总训练时间只需要几分之一秒。相反，如果您正在使用微小的数据集或交互式环境（如 Jupyter 笔记本），则增加 num_workers 可能不会提供任何明显的加速。事实上，它们可能会导致一些问题。一个潜在问题是启动多个工作进程的开销，当数据集较小时，这可能需要比实际数据加载更长的时间。

Furthermore, for Jupyter notebooks, setting num_workers to greater than 0 can sometimes lead to issues related to the sharing of resources between different processes, resulting in errors or notebook crashes. Therefore, it’s essential to understand the trade-off and make a calculated decision on setting the num_workers parameter. When used correctly, it can be a beneficial tool but should be adapted to your specific dataset size and computational environment for optimal results.
此外，对于 Jupyter 笔记本，将 num_workers 设置为大于 0 有时会导致与不同进程之间的资源共享相关的问题，从而导致错误或笔记本崩溃。因此，必须了解权衡并做出设置 num_workers 参数的计算决策。如果使用得当，它可能是一个有益的工具，但应适应您的特定数据集大小和计算环境以获得最佳结果。

In my experience, setting num_workers=4 usually leads to optimal performance on many real-world datasets, but optimal settings depend on your hardware and the code used for loading a training example defined in the Dataset class.
根据我的经验，设置 num_workers=4 通常会在许多实际数据集上实现最佳性能，但最佳设置取决于您的硬件和用于加载 Dataset 类中定义的训练示例的代码。

7 A typical training loop
7 典型的训练循环

So far, we’ve discussed all the requirements for training neural networks: PyTorch’s tensor library, autograd, the Module API, and efficient data loaders. Let’s now combine all these things and train a neural network on the toy dataset from the previous section. The training code is shown in code below.
到目前为止，我们已经讨论了训练神经网络的所有要求：PyTorch 的张量库、autograd、 Module API 和高效的数据加载器。现在让我们将所有这些内容结合起来，并在上一节的玩具数据集上训练神经网络。训练代码显示在下面的代码中。

import torch.nn.functional as F


torch.manual_seed(123)
model = NeuralNetwork(num_inputs=2, num_outputs=2)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

num_epochs = 3

for epoch in range(num_epochs):
    
    model.train()
    for batch_idx, (features, labels) in enumerate(train_loader):

        logits = model(features)
        
        loss = F.cross_entropy(logits, labels) # Loss function
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
        ### LOGGING
        print(f"Epoch: {epoch+1:03d}/{num_epochs:03d}"
              f" | Batch {batch_idx:03d}/{len(train_loader):03d}"
              f" | Train/Val Loss: {loss:.2f}")

    model.eval()
    # Optional model evaluation

Running the code above yields the following outputs:
运行上面的代码会产生以下输出：

Epoch: 001/003 | Batch 000/002 | Train/Val Loss: 0.75
Epoch: 001/003 | Batch 001/002 | Train/Val Loss: 0.65
Epoch: 002/003 | Batch 000/002 | Train/Val Loss: 0.44
Epoch: 002/003 | Batch 001/002 | Train/Val Loss: 0.13
Epoch: 003/003 | Batch 000/002 | Train/Val Loss: 0.03
Epoch: 003/003 | Batch 001/002 | Train/Val Loss: 0.00

As we can see, the loss reaches zero after 3 epochs, a sign that the model converged on the training set. However, before we evaluate the model’s predictions, let’s go over some of the details of the preceding code.
正如我们所看到的，损失在 3 个 epoch 后达到零，这表明模型在训练集上收敛。但是，在评估模型的预测之前，让我们先了解一下前面代码的一些细节。

First, note that we initialized a model with two inputs and two outputs. That’s because the toy dataset from the previous section has two input features and two class labels to predict. We used a stochastic gradient descent (SGD) optimizer with a learning rate (lr) of 0.5. The learning rate is a hyperparameter, meaning it’s a tunable setting that we have to experiment with based on observing the loss. Ideally, we want to choose a learning rate such that the loss converges after a certain number of epochs – the number of epochs is another hyperparameter to choose.
首先，请注意，我们初始化了一个具有两个 Importing 和两个 Outputs 的模型。这是因为上一节中的玩具数据集有两个输入特征和两个要预测的类标签。我们使用了学习率（lr）为 0.5 的随机梯度下降（SGD）优化器。学习率是一个超参数，这意味着它是一个可调设置，我们必须根据观察损失进行试验。理想情况下，我们希望选择一个学习率，以便损失在一定数量的 epoch 后收敛 – epoch 数是另一个可供选择的超参数。

In practice, we often use a third dataset, a so-called validation dataset, to find the optimal hyperparameter settings. A validation dataset is similar to a test set. However, while we only want to use a test set precisely once to avoid biasing the evaluation, we usually use the validation set multiple times to tweak the model settings.
在实践中，我们经常使用第三个数据集，即所谓的验证数据集，来查找最佳超参数设置。验证数据集类似于测试集。然而，虽然我们只想精确地使用一次测试集以避免评估偏差，但我们通常会多次使用验证集来调整模型设置。

We also introduced new settings called model.train() and model.eval(). As these names imply, these settings are used to put the model into a training and an evaluation mode. This is necessary for components that behave differently during training and inference, such as dropout or batch normalization layers. Since we don’t have dropout or other components in our NeuralNetwork class that are affected by these settings, using model.train() and model.eval() is redundant in our code above. However, it’s best practice to include them anyway to avoid unexpected behaviors when we change the model architecture or reuse the code to train a different model.
我们还引入了名为 model.train（） 和 model.eval（） 的新设置。顾名思义，这些设置用于将模型置于训练和评估模式。这对于在训练和推理期间行为不同的组件（例如 dropout 或 batch normalization 层）来说是必需的。由于我们的 NeuralNetwork 类中没有受这些设置影响的 dropout 或其他组件，因此在上面的代码中使用 model.train（） 和 model.eval（） 是多余的。但是，最佳做法是无论如何都包含它们，以避免在我们更改模型架构或重用代码来训练不同的模型时出现意外行为。

As discussed earlier, we pass the logits directly into the cross_entropy loss function, which will apply the softmax function internally for efficiency and numerical stability reasons. Then, calling loss.backward() will calculate the gradients in the computation graph that PyTorch constructed in the background. The optimizer.step() method will use the gradients to update the model parameters to minimize the loss. In the case of the SGD optimizer, this means multiplying the gradients with the learning rate and adding the scaled negative gradient to the parameters.
如前所述，我们将 logits 直接传递给 cross_entropy loss 函数，出于效率和数值稳定性的原因，该函数将在内部应用 softmax 函数。然后，调用 loss.backward（） 将计算 PyTorch 在后台构建的计算图中的梯度。 optimizer.step（） 方法将使用梯度来更新模型参数，以最大限度地减少损失。对于 SGD 优化器，这意味着将梯度乘以学习率，并将缩放后的负梯度添加到参数中。

Preventing undesired gradient accumulation. It is important to include an optimizer.zero_grad() call in each update round to reset the gradients to zero. Otherwise, the gradients will accumulate, which may be undesired.
防止不需要的梯度积累。 在每个更新轮次中包含 optimizer.zero_grad（）调用以将梯度重置为零非常重要。否则，梯度将累积，这可能是不希望的。

After we trained the model, we can use it to make predictions, as shown below:
训练模型后，我们可以使用它来进行预测，如下所示：

model.eval()

with torch.no_grad():
    outputs = model(X_train)

print(outputs)

The results are as follows:
结果如下：

tensor([[ 2.8569, -4.1618],
        [ 2.5382, -3.7548],
        [ 2.0944, -3.1820],
        [-1.4814,  1.4816],
        [-1.7176,  1.7342]])

To obtain the class membership probabilities, we can then use PyTorch’s softmax function, as follows:
为了获得类成员资格概率，我们可以使用 PyTorch 的 softmax 函数，如下所示：

torch.set_printoptions(sci_mode=False)
probas = torch.softmax(outputs, dim=1)
print(probas)

tensor([[    0.9991,     0.0009],
        [    0.9982,     0.0018],
        [    0.9949,     0.0051],
        [    0.0491,     0.9509],
        [    0.0307,     0.9693]])

Let’s consider the first row in the code output above. Here, the first value (column) means that the training example has a 99.91% probability of belonging to class 0 and a 0.09% probability of belonging to class 1. (The set_printoptions call is used here to make the outputs more legible.)
让我们考虑上面代码输出中的第一行。此处，第一个值（列）表示训练示例属于类 0 的概率为 99.91%，属于类 1 的概率为 0.09%。（此处使用 set_printoptions 调用以使输出更清晰。

We can convert these values into class labels predictions using PyTorch’s argmax function, which returns the index position of the highest value in each row if we set dim=1 (setting dim=0 would return the highest value in each column, instead):
我们可以使用 PyTorch 的 argmax 函数将这些值转换为类标签预测，如果我们设置 dim=1 ，该函数将返回每行中最大值的索引位置（设置 dim=0 将返回每列中的最大值）：

predictions = torch.argmax(probas, dim=1)
print(predictions)

This outputs: 这将输出：

tensor([0, 0, 0, 1, 1])

Note that it is unnecessary to compute softmax probabilities to obtain the class labels. We could also apply the argmax function to the logits (outputs) directly:
请注意，无需计算 softmax 概率即可获得类标签。我们还可以直接将 argmax 函数应用于 logits（ 输出 ）：

predictions = torch.argmax(outputs, dim=1)
print(predictions)

This prints: 这将打印出：

tensor([0, 0, 0, 1, 1])

Above, we computed the predicted labels for the training dataset. Since the training dataset is relatively small, we could compare it to the true training labels by eye and see that the model is 100% correct. We can double-check this using the == comparison operator:
上面，我们计算了训练数据集的预测标签。由于训练数据集相对较小，我们可以通过肉眼将其与真实的训练标签进行比较，并发现模型是 100% 正确的。我们可以使用 == comparison 运算符仔细检查这一点：

predictions == y_train

The results are: 结果如下：

tensor([True, True, True, True, True])

Using torch.sum, we can count the number of correct prediction as follows:
使用 torch.sum，我们可以计算正确预测的数量，如下所示：

torch.sum(predictions == y_train)

The output is 5.
输出为 5。

Since the dataset consists of 5 training examples, we have 5 out of 5 predictions that are correct, which equals 5/5 × 100% = 100% prediction accuracy.
由于数据集由 5 个训练样本组成，因此 5 个预测中有 5 个是正确的，这等于 5/5 × 100% = 100% 的预测准确率。

However, to generalize the computation of the prediction accuracy, let’s implement a compute_accuracy function as shown in the following code.
但是，为了概括预测准确性的计算，让我们实现一个 compute_accuracy 函数，如下面的代码所示。

def compute_accuracy(model, dataloader):

    model = model.eval()
    correct = 0.0
    total_examples = 0
    
    for idx, (features, labels) in enumerate(dataloader):
        
        with torch.no_grad():
            logits = model(features)
        
        predictions = torch.argmax(logits, dim=1)
        compare = labels == predictions
        correct += torch.sum(compare)
        total_examples += len(compare)

    return (correct / total_examples).item()

Note that the following compute_accuracy function iterates over a data loader to compute the number and fraction of the correct predictions. This is because when we work with large datasets, we typically can only call the model on a small part of the dataset due to memory limitations. The compute_accuracy function above is a general method that scales to datasets of arbitrary size since, in each iteration, the dataset chunk that the model receives is the same size as the batch size seen during training.
请注意，以下 compute_accuracy 函数迭代数据加载器以计算正确预测的数量和分数。这是因为当我们处理大型数据集时，由于内存限制，我们通常只能在数据集的一小部分上调用模型。上面的 compute_accuracy 函数是一种通用方法，可以扩展到任意大小的数据集，因为在每次迭代中，模型接收的数据集块的大小与训练期间看到的批量大小相同。

Notice that the internals of the compute_accuracy function are similar to what we used before when we converted the logits to the class labels.
请注意， compute_accuracy 函数的内部结构与我们之前将 logit 转换为类标签时使用的内部结构类似。

We can then apply the function to the training as follows:
然后，我们可以将该函数应用于训练，如下所示：

compute_accuracy(model, train_loader)

The results is 1.0.
结果为 1.0。

Similarly, we can apply the function to the test set as follows:
同样，我们可以将函数应用于测试集，如下所示：

compute_accuracy(model, test_loader)

This prints 1.0.
这将打印 1.0。

In this section, we learned how we can train a neural network using PyTorch. Next, let’s see how we can save and restore models after training.
在本节中，我们学习了如何使用 PyTorch 训练神经网络。接下来，让我们看看如何在训练后保存和恢复模型。

8 Saving and loading models
8 保存和加载模型

In the previous section, we successfully trained a model. Let’s now see how we can save a trained model to reuse it later.
在上一节中，我们成功训练了一个模型。现在让我们看看如何保存经过训练的模型以便以后重用它。

Here’s the recommended way how we can save and load models in PyTorch:
以下是我们在 PyTorch 中保存和加载模型的推荐方法：

torch.save(model.state_dict(), "model.pth")

The model’s state_dict is a Python dictionary object that maps each layer in the model to its trainable parameters (weights and biases). Note that "model.pth" is an arbitrary filename for the model file saved to disk. We can give it any name and file ending we like; however, .pth and .pt are the most common conventions.
该模型的 state_dict 是一个 Python 字典对象，它将模型中的每个层映射到其可训练参数（权重和偏差）。请注意， “model.pth” 是保存到磁盘的模型文件的任意文件名。我们可以给它起任何我们喜欢的名字和文件结尾;但是， .pth 和 .pt 是最常见的约定。

Once we saved the model, we can restore it from disk as follows:
保存模型后，我们可以按如下方式从磁盘恢复它：

model = NeuralNetwork(2, 2) # needs to match the original model exactly
model.load_state_dict(torch.load("model.pth", weights_only=True))

<All keys matched successfully>

The torch.load("model.pth") function reads the file "model.pth" and reconstructs the Python dictionary object containing the model’s parameters while model.load_state_dict() applies these parameters to the model, effectively restoring its learned state from when we saved it.
torch.load（“model.pth”） 函数读取文件 “model.pth” 并重建包含模型参数的 Python 字典对象，而 model.load_state_dict（） 将这些参数应用于模型，有效地恢复了我们保存时的学习状态。

Note that the line model = NeuralNetwork(2, 2) above is not strictly necessary if you execute this code in the same session where you saved a model. However, I included it here to illustrate that we need an instance of the model in memory to apply the saved parameters. Here, the NeuralNetwork(2, 2) architecture needs to match the original saved model exactly.
请注意，如果您在保存模型的同一会话中执行此代码，则上面的 line model = NeuralNetwork（2， 2） 并不是绝对必要的。但是，我将其包含在这里是为了说明我们需要内存中的模型实例来应用保存的参数。在这里， NeuralNetwork（2， 2） 架构需要与原始保存的模型完全匹配。

The last section will show you how to train PyTorch models faster using one or more GPUs (if available).
最后一部分将向您展示如何使用一个或多个 GPU（如果可用）更快地训练 PyTorch 模型。

9 Optimizing training performance with GPUs
9 使用 GPU 优化训练性能

In this last section of this tutorial, we will see how we can utilize GPUs, which will accelerate deep neural network training compared to regular CPUs. First, we will introduce the main concepts behind GPU computing in PyTorch. Then, we will train a model on a single GPU. Finally, we’ll then look at distributed training using multiple GPUs.
在本教程的最后一节中，我们将了解如何利用 GPU，与常规 CPU 相比，GPU 将加速深度神经网络训练。首先，我们将介绍 PyTorch 中 GPU 计算背后的主要概念。然后，我们将在单个 GPU 上训练模型。最后，我们将研究使用多个 GPU 的分布式训练。

9.1 PyTorch computations on GPU devices
9.1 GPU 设备上的 PyTorch 计算

As you will see, modifying the training loop from section 2.7 to optionally run on a GPU is relatively simple and only requires changing three lines of code.
如你所见，将 2.7 节中的训练循环修改为选择性地在 GPU 上运行相对简单，只需要更改三行代码。

Before we make the modifications, it’s crucial to understand the main concept behind GPU computations within PyTorch. First, we need to introduce the notion of devices. In PyTorch, a device is where computations occur, and data resides. The CPU and the GPU are examples of devices. A PyTorch tensor resides in a device, and its operations are executed on the same device.
在进行修改之前，了解 PyTorch 中 GPU 计算背后的主要概念至关重要。首先，我们需要介绍设备的概念。在 PyTorch 中，设备是进行计算和数据所在的位置。CPU 和 GPU 是设备的示例。PyTorch 张量驻留在设备中，其作在同一设备上执行。

Let’s see how this works in action. Assuming that you installed a GPU-compatible version of PyTorch as explained in section 2.1.3, Installing PyTorch, we can double-check that our runtime indeed supports GPU computing via the following code:
让我们看看它是如何工作的。假设您按照第 2.1.3 节安装 PyTorch 中所述安装了 GPU 兼容版本的 PyTorch，我们可以通过以下代码仔细检查运行时是否确实支持 GPU 计算：

print(torch.cuda.is_available())

The result is: 结果是：

True

Now, suppose we have two tensors that we can add as follows – this computation will be carried out on the CPU by default:
现在，假设我们有两个张量，我们可以按如下方式添加 – 默认情况下，此计算将在 CPU 上执行：

tensor_1 = torch.tensor([1., 2., 3.])
tensor_2 = torch.tensor([4., 5., 6.])

print(tensor_1 + tensor_2)

This outputs: 这将输出：

tensor([5., 7., 9.])

We can now use the .to() method to transfer these tensors onto a GPU and perform the addition there:
我们现在可以使用 .to（）方法将这些张量传输到 GPU 上，并在其中执行加法：

tensor_1 = tensor_1.to("cuda")
tensor_2 = tensor_2.to("cuda")

print(tensor_1 + tensor_2)

The output is as follows:
输出如下：

tensor([5., 7., 9.], device='cuda:0')

Notice that the resulting tensor now includes the device information, device='cuda:0', which means that the tensors reside on the first GPU. If your machine hosts multiple GPUs, you have the option to specify which GPU you’d like to transfer the tensors to. You can do this by indicating the device ID in the transfer command. For instance, you can use .to("cuda:0"), .to("cuda:1"), and so on.
请注意，生成的张量现在包含设备信息 device='cuda：0' ，这意味着张量驻留在第一个 GPU 上。如果您的计算机托管多个 GPU，则可以选择指定要将张量传输到的 GPU。您可以通过在 transfer 命令中指示设备 ID 来执行此作。例如，您可以使用 .to（“cuda：0”）、 .to（“cuda：1”） 等。

However, it is important to note that all tensors must be on the same device. Otherwise, the computation will fail, as shown below, where one tensor resides on the CPU and the other on the GPU:
但是，请务必注意，所有张量必须位于同一设备上。否则，计算将失败，如下所示，其中一个张量驻留在 CPU 上，另一个张量驻留在 GPU 上：

tensor_1 = tensor_1.to("cpu")
print(tensor_1 + tensor_2)

This results in the following:
这将产生以下结果：

    ---------------------------------------------------------------------------
    
    RuntimeError                              Traceback (most recent call last)
    
    /tmp/ipykernel_2321/2079609735.py in <cell line: 2>()
          1 tensor_1 = tensor_1.to("cpu")
    ----> 2 print(tensor_1 + tensor_2)


    RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

In this section, we learned that GPU computations on PyTorch are relatively straightforward. All we have to do is transfer the tensors onto the same GPU device, and PyTorch will handle the rest. Equipped with this information, we can now train the neural network from the previous section on a GPU.
在本节中，我们了解到 PyTorch 上的 GPU 计算相对简单。我们所要做的就是将张量传输到同一个 GPU 设备上，PyTorch 将处理其余部分。有了这些信息，我们现在可以在 GPU 上训练上一节中的神经网络。

9.2 Single-GPU training 9.2 单 GPU 训练

Now that we are familiar with transferring tensors to the GPU, we can modify the training loop from section 2.7, A typical training loop, to run on a GPU. This requires only changing three lines of code, as shown in the code below.
现在我们已经熟悉了如何将张量传输到 GPU，我们可以修改 第 2.7 节典型训练循环中的训练循环，使其在 GPU 上运行。这只需要更改 3 行代码，如下面的代码所示。

torch.manual_seed(123)  
model = NeuralNetwork(num_inputs=2, num_outputs=2)

# New: Define a device variable that defaults to a GPU. 
device = torch.device("cuda")
# New: Transfer the model onto the GPU. 
model.to(device)

optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

num_epochs = 3

for epoch in range(num_epochs):  
      
    model.train()  
    for batch_idx, (features, labels) in enumerate(train_loader):

        # New: Transfer the data onto the GPU.
        features, labels = features.to(device), labels.to(device)    #C  
        logits = model(features)  
        loss = F.cross_entropy(logits, labels) # Loss function  
          
        optimizer.zero_grad()  
        loss.backward()  
        optimizer.step()  
      
        ### LOGGING  
        print(f"Epoch: {epoch+1:03d}/{num_epochs:03d}"  
              f" | Batch {batch_idx:03d}/{len(train_loader):03d}"  
              f" | Train/Val Loss: {loss:.2f}")

    model.eval()  
    # Optional model evaluation

Running the above code will output the following, similar to the results obtained on the CPU previously in section 2.7:
运行上述代码将输出以下内容，类似于之前第 2.7 节中在 CPU 上获得的结果：

Epoch: 001/003 | Batch 000/002 | Train/Val Loss: 0.75  
Epoch: 001/003 | Batch 001/002 | Train/Val Loss: 0.65  
Epoch: 002/003 | Batch 000/002 | Train/Val Loss: 0.44  
Epoch: 002/003 | Batch 001/002 | Train/Val Loss: 0.13  
Epoch: 003/003 | Batch 000/002 | Train/Val Loss: 0.03  
Epoch: 003/003 | Batch 001/002 | Train/Val Loss: 0.00

We can also use .to("cuda") instead of device = torch.device("cuda"). As we saw in section 2.9.1, transferring a tensor to "cuda" instead of torch.device("cuda") works as well and is shorter. We can also modify the statement to the following, which will make the same code executable on a CPU if a GPU is not available, which is usually considered best practice when sharing PyTorch code:
我们还可以使用 .to（“cuda”） 而不是 device = torch.device（“cuda”）。 正如我们在第 2.9.1 节中看到的那样，将张量转移到 “cuda” 而不是 torch.device（“cuda”） 也可以，而且时间更短。我们还可以将语句修改为以下内容，如果 GPU 不可用，这将使相同的代码在 CPU 上可执行，这通常被认为是共享 PyTorch 代码时的最佳实践：

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In the case of the modified training loop above, we probably won’t see a speed-up because of the memory transfer cost from CPU to GPU. However, we can expect a significant speed-up when training deep neural networks, especially large language models.
在上面修改后的训练循环的情况下，由于从 CPU 到 GPU 的内存传输成本，我们可能不会看到加速。但是，在训练深度神经网络（尤其是大型语言模型）时，我们可以预期速度会显著加快。

As we saw in this section, training a model on a single GPU in PyTorch is relatively easy. Next, let’s introduce another concept: training models on multiple GPUs.
正如我们在本节中所看到的，在 PyTorch 中的单个 GPU 上训练模型相对容易。接下来，我们介绍另一个概念：在多个 GPU 上训练模型。

PyTorch on macOS. On an Apple Mac with an Apple Silicon chip (like the M1, M2, M3, M4, or newer models) instead of a computer with an Nvidia GPU, you can change device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
macOS 上的 PyTorch。 在配备 Apple Silicon 芯片的 Apple Mac（如 M1、M2、M3、M4 或更新型号）而不是配备 Nvidia GPU 的计算机上，您可以更改 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
to device = torch.device("mps" if torch.backends.mps.is_available() else "cpu") to take advantage of this chip.
自 device = torch.device("mps" if torch.backends.mps.is_available() else "cpu") 来利用这个芯片。

9.3 Training with multiple GPUs
9.3 使用多个 GPU 进行训练

In this section, we will briefly go over the concept of distributed training. Distributed training is the concept of dividing the model training across multiple GPUs and machines.
在本节中，我们将简要介绍分布式训练的概念。分布式训练是将模型训练划分到多个 GPU 和机器上的概念。

Why do we need this? Even when it is possible to train a model on a single GPU or machine, the process could be exceedingly time-consuming. The training time can be significantly reduced by distributing the training process across multiple machines, each with potentially multiple GPUs. This is particularly crucial in the experimental stages of model development, where numerous training iterations might be necessary to finetune the model parameters and architecture.
我们为什么需要这个？即使可以在单个 GPU 或机器上训练模型，该过程也可能非常耗时。通过将训练过程分布在多台机器上，每台机器可能有多个 GPU，可以显著减少训练时间。这在模型开发的实验阶段尤为重要，因为可能需要大量的训练迭代来微调模型参数和架构。

In this section, we will look at the most basic case of distributed training: PyTorch’s DistributedDataParallel (DDP) strategy. DDP enables parallelism by splitting the input data across the available devices and processing these data subsets simultaneously.
在本节中，我们将介绍分布式训练的最基本案例：PyTorch 的 DistributedDataParallel （DDP）策略。DDP 通过在可用设备之间拆分输入数据并同时处理这些数据子集来实现并行性。

How does this work? PyTorch launches a separate process on each GPU, and each process receives and keeps a copy of the model – these copies will be synchronized during training. To illustrate this, suppose we have two GPUs that we want to use to train a neural network, as shown in Figure 12.
这是如何工作的？PyTorch 在每个 GPU 上启动一个单独的进程，每个进程接收并保留模型的副本——这些副本将在训练期间同步。为了说明这一点，假设我们有两个 GPU 要用于训练神经网络，如图 12 所示。

Figure 12. The model and data transfer in DDP involves two key steps. First, we create a copy of the model on each of the GPUs. Then we divide the input data into unique minibatches that we pass on to each model copy.
图 12.DDP 中的模型和数据传输涉及两个关键步骤。首先，我们在每个 GPU 上创建模型的副本。然后，我们将输入数据划分为唯一的小批量，然后传递给每个模型副本。

Each of the two GPUs will receive a copy of the model. Then, in every training iteration, each model will receive a minibatch (or just batch) from the data loader. We can use a DistributedSampler to ensure that each GPU will receive a different, non-overlapping batch when using DDP.
两个 GPU 中的每一个都将收到模型的副本。然后，在每次训练迭代中，每个模型都将从数据加载器接收一个小批量（或仅批量）。我们可以使用 DistributedSampler 来确保在使用 DDP 时，每个 GPU 都会收到一个不同的、不重叠的批处理。

Since each model copy will see a different sample of the training data, the model copies will return different logits as outputs and compute different gradients during the backward pass. These gradients are then averaged and synchronized during training to update the models. This way, we ensure that the models don’t diverge, as illustrated in Figure 13.
由于每个模型副本将看到不同的训练数据样本，因此模型副本将返回不同的 logit 作为输出，并在向后传递期间计算不同的梯度。然后在训练期间对这些梯度进行平均和同步，以更新模型。这样，我们就可以确保模型不会发散，如图 13 所示。

Figure 13. The forward and backward pass in DDP are executed independently on each GPU with its corresponding data subset. Once the forward and backward passes are completed, gradients from each model replica (on each GPU) are synchronized across all GPUs. This ensures that every model replica has the same updated weights.
图 13.DDP 中的前向和后向传递在每个 GPU 及其相应的数据子集上独立执行。完成向前和向后传递后，每个模型副本（在每个 GPU 上）的梯度将在所有 GPU 之间同步。这可确保每个模型副本具有相同的更新权重。

The benefit of using DDP is the enhanced speed it offers for processing the dataset compared to a single GPU. Barring a minor communication overhead between devices that comes with DDP use, it can theoretically process a training epoch in half the time with two GPUs compared to just one. The time efficiency scales up with the number of GPUs, allowing us to process an epoch eight times faster if we have eight GPUs, and so on.
使用 DDP 的好处是，与单个 GPU 相比，它为处理数据集提供了更高的速度。除了使用 DDP 时设备之间会带来轻微的通信开销外，理论上，它使用两个 GPU 处理训练 epoch 的时间是原来的一半，而不是只有一个 GPU。时间效率随着 GPU 数量的增加而增加，如果我们有 8 个 GPU，则处理 epoch 的速度可以提高 8 倍，依此类推。

Multi-GPU computing in interactive environments DDP does not function properly within interactive Python environments like Jupyter notebooks, which don’t handle multiprocessing in the same way a standalone Python script does. Therefore, the following code should be executed as a script, not within a notebook interface like Jupyter. This is because DDP needs to spawn multiple processes, and each process should have its own Python interpreter instance.
交互式环境中的多 GPU 计算 DDP 在 Jupyter 笔记本等交互式 Python 环境中无法正常运行，这些环境无法像独立 Python 脚本那样处理多处理。因此，以下代码应作为脚本执行，而不是在 Jupyter 等笔记本界面中执行。这是因为 DDP 需要生成多个进程，每个进程都应该有自己的 Python 解释器实例。

First, we will import a few additional submodules, classes, and functions for distributed training PyTorch as shown in code below.
首先，我们将为分布式训练 PyTorch 导入一些额外的子模块、类和函数，如下面的代码所示。

import platform
from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group

Before we dive deeper into the changes to make the training compatible with DDP, let’s briefly go over the rationale and usage for these newly imported utilities that we need alongside the DistributedDataParallel class.
在我们深入研究使训练与 DDP 兼容的更改之前，让我们简要回顾一下我们需要的这些新导入的实用程序的基本原理和用法以及 DistributedDataParallel 类。

When we execute the modified multi-GPU code later, under the hood, PyTorch will spawn multiple independent processes to train the model. If we spawn multiple processes for training, we will need a way to divide the dataset among these different processes. For this, we will use the DistributedSampler.
当我们稍后执行修改后的多 GPU 代码时，PyTorch 会在后台生成多个独立的进程来训练模型。如果我们生成多个用于训练的进程，我们将需要一种方法来在这些不同的进程之间划分数据集。为此，我们将使用 DistributedSampler。

The init_process_group and destroy_process_group are used to initialize and quit the distributed training mods. The init_process_group function should be called at the beginning of the training script to initialize a process group for each process in the distributed setup, and destroy_process_group should be called at the end of the training script to destroy a given process group and release its resources.
init_process_group 和 destroy_process_group 用于初始化和退出分布式训练 Mod。应在训练脚本开始时调用 init_process_group 函数，为分布式设置中的每个进程初始化一个进程组，并在训练脚本结束时调用 destroy_process_group 函数，以销毁给定的进程组并释放其资源。

The following code below illustrates how these new components are used to implement DDP training for the NeuralNetwork model we implemented earlier.
下面的代码说明了如何使用这些新组件为我们之前实现的 NeuralNetwork 模型实现 DDP 训练。

The full script is provided below:
下面提供了完整的脚本：

import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

# NEW imports:
import os
import platform
from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group


# NEW: function to initialize a distributed process group (1 process / GPU)
# this allows communication among processes
def ddp_setup(rank, world_size):
    """
    Arguments:
        rank: a unique process ID
        world_size: total number of processes in the group
    """
    # Only set MASTER_ADDR and MASTER_PORT if not already defined by torchrun
    if "MASTER_ADDR" not in os.environ:
        os.environ["MASTER_ADDR"] = "localhost"
    if "MASTER_PORT" not in os.environ:
        os.environ["MASTER_PORT"] = "12345"

    # initialize process group
    if platform.system() == "Windows":
        # Disable libuv because PyTorch for Windows isn't built with support
        os.environ["USE_LIBUV"] = "0"
        # Windows users may have to use "gloo" instead of "nccl" as backend
        # gloo: Facebook Collective Communication Library
        init_process_group(backend="gloo", rank=rank, world_size=world_size)
    else:
        # nccl: NVIDIA Collective Communication Library
        init_process_group(backend="nccl", rank=rank, world_size=world_size)

    torch.cuda.set_device(rank)


class ToyDataset(Dataset):
    def __init__(self, X, y):
        self.features = X
        self.labels = y

    def __getitem__(self, index):
        one_x = self.features[index]
        one_y = self.labels[index]
        return one_x, one_y

    def __len__(self):
        return self.labels.shape[0]


class NeuralNetwork(torch.nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super().__init__()

        self.layers = torch.nn.Sequential(
            # 1st hidden layer
            torch.nn.Linear(num_inputs, 30),
            torch.nn.ReLU(),

            # 2nd hidden layer
            torch.nn.Linear(30, 20),
            torch.nn.ReLU(),

            # output layer
            torch.nn.Linear(20, num_outputs),
        )

    def forward(self, x):
        logits = self.layers(x)
        return logits


def prepare_dataset():
    X_train = torch.tensor([
        [-1.2, 3.1],
        [-0.9, 2.9],
        [-0.5, 2.6],
        [2.3, -1.1],
        [2.7, -1.5]
    ])
    y_train = torch.tensor([0, 0, 0, 1, 1])

    X_test = torch.tensor([
        [-0.8, 2.8],
        [2.6, -1.6],
    ])
    y_test = torch.tensor([0, 1])

    # Uncomment these lines to increase the dataset size to run this script on up to 8 GPUs:
    # factor = 4
    # X_train = torch.cat([X_train + torch.randn_like(X_train) * 0.1 for _ in range(factor)])
    # y_train = y_train.repeat(factor)
    # X_test = torch.cat([X_test + torch.randn_like(X_test) * 0.1 for _ in range(factor)])
    # y_test = y_test.repeat(factor)

    train_ds = ToyDataset(X_train, y_train)
    test_ds = ToyDataset(X_test, y_test)

    train_loader = DataLoader(
        dataset=train_ds,
        batch_size=2,
        shuffle=False,  # NEW: False because of DistributedSampler below
        pin_memory=True,
        drop_last=True,
        # NEW: chunk batches across GPUs without overlapping samples:
        sampler=DistributedSampler(train_ds)  # NEW
    )
    test_loader = DataLoader(
        dataset=test_ds,
        batch_size=2,
        shuffle=False,
    )
    return train_loader, test_loader


# NEW: wrapper
def main(rank, world_size, num_epochs):

    ddp_setup(rank, world_size)  # NEW: initialize process groups

    train_loader, test_loader = prepare_dataset()
    model = NeuralNetwork(num_inputs=2, num_outputs=2)
    model.to(rank)
    optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

    model = DDP(model, device_ids=[rank])  # NEW: wrap model with DDP
    # the core model is now accessible as model.module

    for epoch in range(num_epochs):
        # NEW: Set sampler to ensure each epoch has a different shuffle order
        train_loader.sampler.set_epoch(epoch)

        model.train()
        for features, labels in train_loader:

            features, labels = features.to(rank), labels.to(rank)  # New: use rank
            logits = model(features)
            loss = F.cross_entropy(logits, labels)  # Loss function

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            # LOGGING
            print(f"[GPU{rank}] Epoch: {epoch+1:03d}/{num_epochs:03d}"
                  f" | Batchsize {labels.shape[0]:03d}"
                  f" | Train/Val Loss: {loss:.2f}")

    model.eval()

    try:
        train_acc = compute_accuracy(model, train_loader, device=rank)
        print(f"[GPU{rank}] Training accuracy", train_acc)
        test_acc = compute_accuracy(model, test_loader, device=rank)
        print(f"[GPU{rank}] Test accuracy", test_acc)

    ####################################################
    # NEW:
    except ZeroDivisionError as e:
        raise ZeroDivisionError(
            f"{e}\n\nThis script is designed for 2 GPUs. You can run it as:\n"
            "torchrun --nproc_per_node=2 DDP-script-torchrun.py\n"
            f"Or, to run it on {torch.cuda.device_count()} GPUs, uncomment the code on lines 103 to 107."
        )
    ####################################################

    destroy_process_group()  # NEW: cleanly exit distributed mode


def compute_accuracy(model, dataloader, device):
    model = model.eval()
    correct = 0.0
    total_examples = 0

    for idx, (features, labels) in enumerate(dataloader):
        features, labels = features.to(device), labels.to(device)

        with torch.no_grad():
            logits = model(features)
        predictions = torch.argmax(logits, dim=1)
        compare = labels == predictions
        correct += torch.sum(compare)
        total_examples += len(compare)
    return (correct / total_examples).item()


if __name__ == "__main__":
    # NEW: Use environment variables set by torchrun if available, otherwise default to single-process.
    if "WORLD_SIZE" in os.environ:
        world_size = int(os.environ["WORLD_SIZE"])
    else:
        world_size = 1

    if "LOCAL_RANK" in os.environ:
        rank = int(os.environ["LOCAL_RANK"])
    elif "RANK" in os.environ:
        rank = int(os.environ["RANK"])
    else:
        rank = 0

    # Only print on rank 0 to avoid duplicate prints from each GPU process
    if rank == 0:
        print("PyTorch version:", torch.__version__)
        print("CUDA available:", torch.cuda.is_available())
        print("Number of GPUs available:", torch.cuda.device_count())

    torch.manual_seed(123)
    num_epochs = 3
    main(rank, world_size, num_epochs)

Before we run the code above, here is a summary of how it works. We have a __name__ == "__main__" clause at the bottom of the script that is executed when we run the file as a standalone Python script (as opposed to importing it as a module) – actually, we will not run it as a regular Python script, but more on that in a few moments. This __main__ block begins by printing the number of available GPUs using torch.cuda.device_count() and sets a random seed for reproducibility.
在我们运行上面的代码之前，以下是它的工作原理摘要。我们在脚本底部有一个 __name__ == “__main__” 子句，当我们将文件作为独立的 Python 脚本运行时（而不是将其作为模块导入）时执行 – 实际上，我们不会将其作为常规 Python 脚本运行，但稍后会详细介绍。此 __main__ 块首先使用 torch.cuda.device_count（） 打印可用 GPU 的数量，并设置随机种子以实现可重复性。

As teasered in the previous paragraph, rather than running the code as a “regular” Python script (via python ...py) and manually spawning processes from within Python using multiprocessing.spawn for the multi-GPU training aspect, we will rely on PyTorch’s modern and preferred utility: torchrun (the command will be shown after explaining the other main aspects in the code.)
如上一段所述，与其将代码作为“常规”Python 脚本运行（通过 python ...py）并使用 multiprocessing.spawn 从 Python 中手动生成进程对于多 GPU 训练方面，我们将依赖 PyTorch 的现代和首选实用程序： torchrun （该命令将在解释代码中的其他主要方面后显示。

When running the script using torchrun, it automatically launches one process per GPU and assigns each process a unique rank, along with other distributed training metadata (like world size and local rank), which are passed into the script via environment variables. In the __main__ block, we read these variables using os.environ and pass them to the main() function.
使用 torchrun 运行脚本时，它会自动为每个 GPU 启动一个进程，并为每个进程分配一个唯一的排名，以及其他分布式训练元数据（如世界大小和本地排名），这些元数据通过环境变量传递到脚本中。在 __main__ 块中，我们使用 os.environ 读取这些变量并将它们传递给 main（） 函数。

The main() function initializes the distributed environment via ddp_setup, which another helper function we define. Then, it loads the training and test sets, sets up the model, and performs the training loop. As in our single-GPU training setup from Section 2.12, we transfer both the model and data to the correct GPU using .to(rank), where rank corresponds to the GPU index for the current process. We also wrap the model using DistributedDataParallel (DDP), which enables synchronized gradient updates across all GPUs during training. Once training is complete and we evaluate the model, we call destroy_process_group() to properly shut down the distributed training processes and release associated resources.
main（） 函数通过 ddp_setup 初始化分布式环境，我们定义了另一个辅助函数。然后，它加载训练集和测试集，设置模型，并执行训练循环。与第 2.12 节中的单 GPU 训练设置一样，我们使用 .to（rank） 将模型和数据传输到正确的 GPU，其中 rank 对应于当前进程的 GPU 索引。我们还使用 DistributedDataParallel （DDP） 包装模型，它可以在训练期间跨所有 GPU 实现同步梯度更新。训练完成并评估模型后，我们调用 destroy_process_group（） 以正确关闭分布式训练进程并释放相关资源。

As mentioned earlier, each GPU should receive a different subset of the training data to ensure non-overlapping computation. To enable this, we use a DistributedSampler in the training data loader via the argument sampler=DistributedSampler(train_ds).
如前所述，每个 GPU 都应该接收不同的训练数据子集，以确保计算不重叠。为了实现这一点，我们通过参数 sampler=DistributedSampler(train_ds) 在训练数据加载器中使用 DistributedSampler 。

The last component to highlight is the ddp_setup() function. This function sets the master node’s address and communication port (unless already provided by torchrun), initializes the process group using the NCCL backend (which is optimized for GPU-to-GPU communication), and then sets the device for the current process using the provided rank.
最后一个要突出显示的组件是 ddp_setup（） 函数。此函数设置主节点的地址和通信端口（除非 torchrun 已提供），使用 NCCL 后端（针对 GPU 到 GPU 通信进行了优化）初始化进程组，然后使用提供的秩为当前进程设置设备。

This script is designed for 2 GPUs. After saving it as a file, DDP-script-torchrun.py, you can run it as follows using the torchrun utility from the command line, which is automatically installed when you install PyTorch, assuming you saved the above code as DDP-script-torchrun.py file:
此脚本专为 2 个 GPU 设计。 DDP-script-torchrun.py，将其保存为文件后，您可以使用命令行的 torchrun 实用程序按如下方式运行它，该实用程序会在您安装 PyTorch 时自动安装，假设您将上述代码保存为 DDP-script-torchrun.py 文件：

torchrun --nproc_per_node=2 DDP-script-torchrun.py

If you want to run it on all available GPUs, you can use:
如果您想在所有 可用的 GPU 上运行它，您可以使用：

torchrun --nproc_per_node=$(nvidia-smi -L | wc -l) DDP-script-torchrun.py

However, since this code uses only a very small dataset, you have to uncomment the following lines in the script code above to run it on more GPUs:
但是，由于此代码仅使用非常小的数据集，因此您必须取消上述脚本代码中以下行的注释，才能在更多 GPU 上运行它：

# Uncomment these lines to increase the dataset size to run this script on up to 8 GPUs:
# factor = 4
# X_train = torch.cat([X_train + torch.randn_like(X_train) * 0.1 for _ in range(factor)])
# y_train = y_train.repeat(factor)
# X_test = torch.cat([X_test + torch.randn_like(X_test) * 0.1 for _ in range(factor)])
# y_test = y_test.repeat(factor)

Note that the previous script should work on both single- and multi-GPU machines. If we run this code on a single GPU, we should see the following output:
请注意，前面的脚本应该可以在单 GPU 和多 GPU 计算机上运行。如果我们在单个 GPU 上运行此代码，我们应该会看到以下输出：

PyTorch version: 2.0.1+cu117  
CUDA available: True  
Number of GPUs available: 1  
[GPU0] Epoch: 001/003 | Batchsize 002 | Train/Val Loss: 0.62  
[GPU0] Epoch: 001/003 | Batchsize 002 | Train/Val Loss: 0.32  
[GPU0] Epoch: 002/003 | Batchsize 002 | Train/Val Loss: 0.11  
[GPU0] Epoch: 002/003 | Batchsize 002 | Train/Val Loss: 0.07  
[GPU0] Epoch: 003/003 | Batchsize 002 | Train/Val Loss: 0.02  
[GPU0] Epoch: 003/003 | Batchsize 002 | Train/Val Loss: 0.03  
[GPU0] Training accuracy 1.0  
[GPU0] Test accuracy 1.0

The code output looks similar to the one in section 2.9.2, which is a good sanity check.
代码输出看起来类似于 2.9.2 节中的输出，这是一个很好的健全性检查。

Now, if we run the same command and code on a machine with two GPUs, we should see the following:
现在，如果我们在具有两个 GPU 的计算机上运行相同的命令和代码，我们应该会看到以下内容：

PyTorch version: 2.0.1+cu117  
CUDA available: True  
Number of GPUs available: 2  
[GPU1] Epoch: 001/003 | Batchsize 002 | Train/Val Loss: 0.60  
[GPU0] Epoch: 001/003 | Batchsize 002 | Train/Val Loss: 0.59  
[GPU0] Epoch: 002/003 | Batchsize 002 | Train/Val Loss: 0.16  
[GPU1] Epoch: 002/003 | Batchsize 002 | Train/Val Loss: 0.17  
[GPU0] Epoch: 003/003 | Batchsize 002 | Train/Val Loss: 0.05  
[GPU1] Epoch: 003/003 | Batchsize 002 | Train/Val Loss: 0.05  
[GPU1] Training accuracy 1.0  
[GPU0] Training accuracy 1.0  
[GPU1] Test accuracy 1.0  
[GPU0] Test accuracy 1.0

As expected, we can see that some batches are processed on the first GPU (GPU0) and others on the second (GPU1). However, we see duplicated output lines when printing the training and test accuracies. This is because each process (in other words, each GPU) prints the test accuracy independently. Since DDP replicates the model onto each GPU and each process runs independently, if you have a print statement inside your testing loop, each process will execute it, leading to repeated output lines.
正如预期的那样，我们可以看到一些批次在第一个 GPU （GPU0）上处理，而其他批次在第二个 GPU （GPU1）上处理。但是，在打印训练和测试精度时，我们会看到重复的输出行。这是因为每个进程（换句话说，每个 GPU）都独立打印测试准确性。由于 DDP 将模型复制到每个 GPU 上，并且每个进程独立运行，因此，如果测试循环中有 print 语句，则每个进程都会执行它，从而导致重复的输出行。
If this bothers you, you can fix this using the rank of each process to control your print statements.
如果这让您感到困扰，您可以使用每个进程的排名来控制您的打印语句来解决此问题。

if rank == 0: # only print in the first process  
    print("Test accuracy: ", accuracy)

This is, in a nutshell, how distributed training via DDP works. If you are interested in additional details, I recommend checking the official API documentation at https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html.
简而言之，这就是通过 DDP 进行分布式训练的工作原理。如果您对其他详细信息感兴趣，我建议您查看 https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html 的官方 API 文档。

Summary 总结

PyTorch is an open-source library that consists of three core components: a tensor library, automatic differentiation functions, and deep learning utilities.
PyTorch 是一个开源库，由三个核心组件组成：张量库、自动微分函数和深度学习实用程序。
PyTorch’s tensor library is similar to array libraries like NumPy
PyTorch 的张量库类似于 NumPy 等数组库
In the context of PyTorch, tensors are array-like data structures to represent scalars, vectors, matrices, and higher-dimensional arrays.
在 PyTorch 的上下文中，张量是类似数组的数据结构，用于表示标量、向量、矩阵和高维数组。
PyTorch tensors can be executed on the CPU, but one major advantage of PyTorch’s tensor format is its GPU support to accelerate computations.
PyTorch 张量可以在 CPU 上执行，但 PyTorch 张量格式的一个主要优势是其 GPU 支持加速计算。
The automatic differentiation (autograd) capabilities in PyTorch allow us to conveniently train neural networks using backpropagation without manually deriving gradients.
PyTorch 中的自动微分（autograd）功能使我们能够方便地使用反向传播来训练神经网络，而无需手动推导梯度。
The deep learning utilities in PyTorch provide building blocks for creating custom deep neural networks.
PyTorch 中的深度学习实用程序为创建自定义深度神经网络提供了构建块。
PyTorch includes Dataset and DataLoader classes to set up efficient data loading pipelines.
PyTorch 包括 Dataset 和 DataLoader 类，用于设置高效的数据加载管道。
It’s easiest to train models on a CPU or single GPU.
在 CPU 或单个 GPU 上训练模型最容易。
Using DistributedDataParallel is the simplest way in PyTorch to accelerate the training if multiple GPUs are available.
如果有多个 GPU 可用，使用 DistributedDataParallel 是 PyTorch 中加速训练的最简单方法。

PyTorch in One Hour: From Tensors to Training Neural Networks on Multiple GPUs
PyTorch 一小时教程：从张量到在多个 GPU 上训练神经网络

1. What is PyTorch
1. 什么是 PyTorch

1.1 The three core components of PyTorch
1.1 PyTorch 的三个核心组件

1.2 Defining deep learning
1.2 定义深度学习

1.3 Installing PyTorch 1.3 安装 PyTorch

2 Understanding tensors 2 了解张量

2.1 Scalars, vectors, matrices, and tensors
2.1 标量、向量、矩阵和张量

2.2 Tensor data types
2.2 Tensor 数据类型

2.3 Common PyTorch tensor operations
2.3 常见的 PyTorch 张量作

3 Seeing models as computation graphs
3 将模型视为计算图

4 Automatic differentiation made easy
4 轻松实现自动区分

5 Implementing multilayer neural networks
5 实现多层神经网络

6 Setting up efficient data loaders
6 设置高效的数据加载器

7 A typical training loop
7 典型的训练循环

8 Saving and loading models
8 保存和加载模型

9 Optimizing training performance with GPUs
9 使用 GPU 优化训练性能

9.1 PyTorch computations on GPU devices
9.1 GPU 设备上的 PyTorch 计算

9.2 Single-GPU training 9.2 单 GPU 训练

9.3 Training with multiple GPUs
9.3 使用多个 GPU 进行训练

Summary 总结

Further reading 延伸阅读

1. What is PyTorch1. 什么是 PyTorch

1.1 The three core components of PyTorch1.1 PyTorch 的三个核心组件

1.2 Defining deep learning1.2 定义深度学习

1.3 Installing PyTorch 1.3 安装 PyTorch

2 Understanding tensors 2 了解张量

2.1 Scalars, vectors, matrices, and tensors2.1 标量、向量、矩阵和张量

2.2 Tensor data types2.2 Tensor 数据类型

2.3 Common PyTorch tensor operations2.3 常见的 PyTorch 张量作

3 Seeing models as computation graphs3 将模型视为计算图

4 Automatic differentiation made easy4 轻松实现自动区分

5 Implementing multilayer neural networks5 实现多层神经网络

6 Setting up efficient data loaders6 设置高效的数据加载器

7 A typical training loop7 典型的训练循环

8 Saving and loading models8 保存和加载模型

9 Optimizing training performance with GPUs9 使用 GPU 优化训练性能

9.1 PyTorch computations on GPU devices9.1 GPU 设备上的 PyTorch 计算

9.2 Single-GPU training 9.2 单 GPU 训练

9.3 Training with multiple GPUs9.3 使用多个 GPU 进行训练

Summary 总结

Further reading 延伸阅读

1. What is PyTorch
1. 什么是 PyTorch

1.1 The three core components of PyTorch
1.1 PyTorch 的三个核心组件

1.2 Defining deep learning
1.2 定义深度学习

2.1 Scalars, vectors, matrices, and tensors
2.1 标量、向量、矩阵和张量

2.2 Tensor data types
2.2 Tensor 数据类型

2.3 Common PyTorch tensor operations
2.3 常见的 PyTorch 张量作

3 Seeing models as computation graphs
3 将模型视为计算图

4 Automatic differentiation made easy
4 轻松实现自动区分

5 Implementing multilayer neural networks
5 实现多层神经网络

6 Setting up efficient data loaders
6 设置高效的数据加载器

7 A typical training loop
7 典型的训练循环

8 Saving and loading models
8 保存和加载模型

9 Optimizing training performance with GPUs
9 使用 GPU 优化训练性能

9.1 PyTorch computations on GPU devices
9.1 GPU 设备上的 PyTorch 计算

9.3 Training with multiple GPUs
9.3 使用多个 GPU 进行训练