〖想觀看更多中文論文導讀,至[論文速速讀]系列文章介紹可以看到目前已發布的所有文章!〗
簡介
論文網址: ReZero is All You Need: Fast Convergence at Large Depth
20200310發在arXiv的論文,主要是提出了一種Residual blocks的變形,使得在深層模型的時候也能夠有效地進行back propagation,而盡量降低gradient vanish或gradient exploding的影響,並且能夠加速收斂的速度。簡單,卻很有效。
Abstract
Deep networks have enabled significant performance gains across domains, but they often suffer from vanishing/exploding gradients. This is especially true for Transformer architectures where depth beyond 12 layers is difficult to train without large datasets and computational budgets. In general, we find that inefficient signal propagation impedes learning in deep networks. In Transformers, multi-head self-attention is the main cause of this poor signal propagation. To facilitate deep signal propagation, we propose ReZero, a simple change to the architecture that initializes an arbitrary layer as the identity map, using a single additional learned parameter per layer. We apply this technique to language modeling and find that we can easily train ReZero-Transformer networks over a hundred layers. When applied to 12 layer Transformers, ReZero converges 56% faster on enwiki8. ReZero applies beyond Transformers to other residual networks, enabling 1,500% faster convergence for deep fully connected networks and 32% faster convergence for a ResNet-56 trained on CIFAR 10.
Introduction
有看過ResNet的就知道(想更了解ResNet可以參閱[DL]淺談CNN在Object Classification上的各種架構 和 [Python]逐步解釋ResNet34程式碼(Pytorch),ResNet透過提出了Residual block做了一個identify operation,對於input $x_i$, 一層neuron layer $F(\cdot)$,把上一層的資料直接加到經過layer的output,也就是
$x_{i+1} = x_i + F(x_i)$
而ReZero就是residual with zero initialization,對於每一層layer,加上一個可以訓練的參數(trainable variable)$\alpha$,zero initialization的意思就是一開始訓練的時候讓每一層的$\alpha=0$。隨著訓練次數增加,$\alpha$會逐漸調整,公式如下:
$x_{i+1} = x_i + \alpha_i{F(x_i)}$ 架構圖如下,可以比對一下ResNet就蠻清楚的了。
論文中提到ReZero的兩個優點:
- Deeper learning: Signals effectively propagate through deep networks, which allows for learning in otherwise untrainable networks. ReZero successfully trains 10,000 layers of fully-connected networks, and we are the first to train Transformers over 100 layers without learning rate warm-up or LayerNorm. In contrast to [11] we find that to get good results at this depth, it is not necessary to add auxiliary losses.
- Faster convergence: We observe significantly accelerated convergence in ReZero networks compared to regular residual networks with normalization. When ReZero is applied to Transformers, we converge 56% faster than the vanilla Transformer to reach 1.2 BPB on the enwiki8 language modeling benchmark. When applied to ResNets, we obtain 32% speed up to reach 85% accuracy on CIFAR
最後,論文中也提到了和其他normalization and residual connections變形的比較,如下圖:
結論
看起來是一種新的block架構,有了它就可以在沒有Normalization的情況下達到相同甚至更好的收斂速度和結果,而且也可以插在任意的NN中。不過作者放在paper的code掛掉了,所以暫時看不到作者的source code(?)