仓库源文，站点原文

首先是这篇文章

Horace He. (2022). Making Deep Learning Go Brrrr From First Principles

内容不再赘述, 下面是纲要和一些对原文的注解. 虽然文章用 Torch 和 GPU 举例, 但基本原理依然普适.

<details><summary>注 (Show more »)</summary> 机器之心 <a href="https://zhuanlan.zhihu.com/p/485490532">编译</a> 过这篇文章, 但有些错误. 比如只看第一段, 前几句话还行, 但光最后一句就有两个技术错误: (1) 把 in-place 说成内置; (2) 梯度设置为 None 说成 0. 关于第 2 点可以参考 torch 文档 <a href="https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#use-parameter-grad-none-instead-of-model-zero-grad-or-optimizer-zero-grad">Use parameter.grad = None instead of model.zero_grad() or optimizer.zero_grad()</a>.</details>

简单介绍了模型耗时可以分为三部分

Compute: Time spent on your GPU computing actual floating point operations (FLOPS)
Bandwith: Time spent on moving the data from CPU to GPU, from one node to another, or even from CUDA global memory to CUDA shared memory
Overhead: Everything else

我们希望最大化利用 GPU 算力, 但实际上内存带宽才是瓶颈. 硬件层面, 每年内存带宽增速不及 FLOPS 增速, 导致这个问题更加严重.

简单起见, 文章只讨论了 GPU 内部数据转移带来的内存带宽开销, 并简单介绍了可以用算子融合降低带宽开销. 文章用 toy example 演示了什么时候受限于带宽, 什么时候瓶颈才是算力.

NVIDIA A100 specs
那张三类算子耗时图片出处论文 Data Movement Is All You Need: A Case Study on Optimizing Transformers 是 MLSys 2021 五篇 best papers 之一.

算子融合

也叫 kernel 融合

"Kernel" here is for computation kernels.
Torch 博文 Optimizing CUDA Recurrent Neural Networks with TorchScript 可以作为算子融合 (主要是逐点运算) 的例子. [网友笔记]
Torch 文档 fuse pointwise operations 介绍了用户怎么手动融合算子.
内核融合: GPU 深度学习的 "加速神器" 微软研究院一篇老博文, 不知道为什么官网上被删掉了. 可以读一读 "内核间的数据同步" 这节.

Overhead

最后文章提到虽然 PyTorch 等框架有额外调度开销, 但因为 CPU 和 GPU 并行, 只要 GPU 任务足够繁重, CPU 完成任务时间短, 那么 overhead 就不会成为瓶颈.

Overhead bound 那张手绘图的竖线应当是时间轴, 从上到下.
Torch 文档 asynchronous-execution 中也有介绍 "By default, GPU operations are asynchronous".
更详细的 Torch 内部调度方法可以参考 Let's talk about the PyTorch dispatcher.
Torch profiler

其他

NVIDIA 文档开头就写了

Operating In Math-Limited Regime Where Possible

尽量让计算成为瓶颈.

另外要注意 nvidia-smi 中的 GPU-Util (不要读成 Volatile GPU-Util) 定义只是时间维度上的

It indicates the percent of GPU utilization i.e. percent of the time when kernels were using GPU over the sample period. Here, the period could be between 1 to 1/6th second.

只要时间段内有一个 kernel 执行就会算上 util. 最简单的例子: 写一个 1-block 的循环 kernel 就能把 GPU "利用率" 打到 100%, 但在空间层面上利用率很低.

NVIDIA 文档 Deep Learning Performance Documentation 介绍了 arithmetic intensity, 以及更详细的介绍包括 GPU Performance Background, Matrix Multiplication Background 等. 其他相关关键词还有 roofline modeling. (TODO 疑问: 为什么增大 batch size 可以提高 arithmetic intensity, 分子和分母应该都是线性增长?)
FLOPs 与模型推理速度提供具体的计算机视觉模型的例子.
Understanding Tensor Cores
教你如何继续压榨 GPU 的算力介绍了 MPS (Multi-Process Service) 模式.
为什么模型和数据都在 GPU 上, 却打不满 GPU 的使用率?
同一张 GPU 运行多个程序速度会变慢吗? 大概率会变慢. 该答主 lychee 写过不少 GPU 底层的文章.
GPU 内存 (显存) 的理解与基本使用

混合精度训练

rumor. (2020). 一文搞懂神经网络混合精度训练
Train With Mixed Precision. NVIDIA.
Automatic Mixed Precision for Deep Learning. NVIDIA.
自动混合精度训练（AMP）-使用文档-PaddlePaddle深度学习平台