Using NCCL with CUDA Graphs¶
使用 NCCL 与 CUDA 图 ¶

Starting with NCCL 2.9, NCCL operations can be captured by CUDA Graphs.
从 NCCL 2.9 开始，NCCL 操作可以被 CUDA 图捕获。

CUDA Graphs provide a way to define workflows as graphs rather than single operations. They may reduce overhead by launching multiple GPU operations through a single CPU operation. More details about CUDA Graphs can be found in the CUDA Programming Guide.
CUDA 图提供了一种将工作流定义为图而非单一操作的方式。它们可以通过单个 CPU 操作启动多个 GPU 操作，从而减少开销。有关 CUDA 图的更多详细信息，请参阅《CUDA 编程指南》。

NCCL’s collective, P2P and group operations all support CUDA Graph captures. This support requires a minimum CUDA version of 11.3.
NCCL 的集合操作、点对点操作和组操作均支持 CUDA 图捕获。此支持要求最低 CUDA 版本为 11.3。

Whether an operation launch is graph-captured is considered a collective property of that operation and therefore must be uniform over all ranks participating in the launch (for collectives this is all ranks in the communicator, for peer-to-peer this is both the sender and receiver). The launch of a graph (via cudaGraphLaunch, etc.) containing a captured NCCL operation is considered collective for the same set of ranks that were present in the capture, and each of those ranks must be using the graph derived from that collective capture.
操作启动是否被图形捕获被视为该操作的集体属性，因此必须在参与启动的所有等级上保持一致（对于集体操作，这是通信器中的所有等级；对于点对点操作，这是发送方和接收方）。包含捕获的 NCCL 操作的图形启动（通过 cudaGraphLaunch 等）对于捕获时存在的同一组等级被视为集体操作，并且这些等级中的每一个都必须使用从该集体捕获派生的图形。

The following sample code shows how to capture computational kernels and NCCL operations in a CUDA Graph:
以下示例代码展示了如何在 CUDA 图中捕获计算内核和 NCCL 操作：

cudaGraph_t graph;
cudaStreamBeginCapture(stream);
kernel_A<<< ..., stream >>>(...);
kernel_B<<< ..., stream >>>(...);
ncclAllreduce(..., stream);
kernel_C<<< ..., stream >>>(...);
cudaStreamEndCapture(stream, &graph);

cudaGraphExec_t instance;
cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);
cudaGraphLaunch(instance, stream);
cudaStreamSynchronize(stream);

Starting with NCCL 2.11, when NCCL communication is captured and the CollNet algorithm is used, NCCL allows for further performance improvement via user buffer registration. For details, please see the environment variable NCCL_GRAPH_REGISTER.
从 NCCL 2.11 开始，当捕获 NCCL 通信并使用 CollNet 算法时，NCCL 允许通过用户缓冲区注册进一步优化性能。详情请参阅环境变量 NCCL_GRAPH_REGISTER。

Having multiple outstanding NCCL operations that are any combination of graph-captured or non-captured is supported. There is a caveat that the mechanism NCCL uses internally to accomplish this has been seen to cause CUDA to deadlock when the graphs of multiple communicators are cudaGraphLaunch()’d from the same thread. To disable this mechansim see the environment variable NCCL_GRAPH_MIXING_SUPPORT.
支持同时存在多个未完成的 NCCL 操作，这些操作可以是图形捕获或非捕获的任意组合。但需要注意的是，NCCL 内部用于实现此功能的机制在多个通信器的图形从同一线程调用 cudaGraphLaunch()时，可能会导致 CUDA 死锁。要禁用此机制，请参阅环境变量 NCCL_GRAPH_MIXING_SUPPORT。

Using NCCL with CUDA Graphs¶使用 NCCL 与 CUDA 图 ¶

Using NCCL with CUDA Graphs¶
使用 NCCL 与 CUDA 图 ¶