[Tensorflow]從Pytorch到TF2的學習之路 - Different Padding Algorithms

【[Tensorflow]從Pytorch到TF2的學習之路】所有文章:

前言

在Conv layer和Pooling layer的時候，由於kernel size和stride的設置，Input有可能會在操作過程中越變越小。為了使得圖片在這個過程中保持相同的輸出，此時我們就會對input加上padding(通常是補0操作)。

不過在Tensorflow和Pytorch中對於padding這件事有一點小差異，像是Tensorflow的padding參數就提供了SAME和VALID，但在Pytorch的文件中我們並沒看到類似的參數，究竟這不同框架之間padding的差異到底在哪裡呢?

這篇文章透過簡單的例子搭配程式碼進行講解，最後搭配一些網路上的討論跟介紹幫助大家釐清TF/Pytorch的Padding觀念。

Padding in Tensorflow2

Padding在TF1/TF2中似乎沒有做太大改動，不過下面我們以TF2(下面統稱TF)的程式搭配來輔助講解。

首先，在TF的Conv系列(例如tf.nn.conv1d)和Pooling系列(例如tf.nn.max_pool1d)都可以看到關於Padding參數有兩個方法可以使用:

SAME
VALID

不過對於這兩個方法的細節似乎沒有太多著墨，下面針對這兩個方法來進行介紹。

首先為了方便介紹我們先考慮一維的資料，然後下面我們都使用這組作為我們的input。假設我們的輸入為:

inputs: 1 2 3 4 5 6 7 8 9

我們先用幾行的程式碼把資料建立起來

import tensorflow as tf
data = tf.constant([1, 2, 3, 4, 5, 6, 7, 8, 9]) # shape: (dimension, )
data = tf.reshape(data, (1, -1, 1)) # reshape to (batch, dimension, channel)
print('input: ', data)
# input:  tf.Tensor(
# [[[1]
#   [2]
#   [3]
#   [4]
#   [5]
#   [6]
#   [7]
#   [8]
#   [9]]], shape=(1, 9, 1), dtype=int32)

儘管一開始我們說資料維度是一維的，但我們仍然轉換到一個三維的space以符合Tensorflow的格式。這邊的維度(1, 9, 1)分別代表了(batch, dimension, channel)，符合TF data format的預設NHWC格式(channel last)

如果是二維的資料則是(batch, width, height, channel)
關於TF的data format可以參考Channels first vs Channels last - what do these mean?

padding=’VALID’

對於padding=VALID，就是不做padding的意思，不夠的部分我就直接砍掉，所以資料長度很理所當然的會變小。

其實覺得選用VALID這個字很容易讓人搞混，網路上也有人說或許用NO-PADDING會更加清楚
在這個情況下，output shape為$\left\lceil\frac{W-K+1}{S}\right\rceil$，$W$是input shape、$K$是kernel size，而$S$則是stride

現在考慮在padding=’VALID’下，以下設置搭配max_pool1d的padding結果:

kernel=4, stride=4

inputs: 1 2 3 4 5 6 7 8 (9)
        |-----|             -> output 4 
                |-----|     -> output 8

可以發現多餘的部分，也就是9，直接被捨棄掉了。對照一下程式的結果發現是吻合的

print(tf.nn.max_pool1d(data, ksize=4, strides=4, padding='VALID'))
# tf.Tensor(
# [[[4]
#   [8]]], shape=(1, 2, 1), dtype=int32)

再來另外一個例子想想看，以下設置的時候輸出應該是什麼?

kernel=4, stride=1

print(tf.nn.max_pool1d(data, ksize=4, strides=1, padding='VALID'))
# tf.Tensor(
# [[[4]
#   [5]
#   [6]
#   [7]
#   [8]
#   [9]]], shape=(1, 6, 1), dtype=int32)

padding=’SAME’

對於padding=’SAME’，代表使用padding(default zero padding)來調整shape

output shape公式為$\left\lceil\frac{W}{S}\right\rceil$

實際來看一下幾個例子，首先考慮和上面相同的case在padding=’SAME’下會是怎樣?

kernel size=4, stride=4

print(tf.nn.max_pool1d(data, ksize=4, strides=4, padding='SAME'))
# tf.Tensor(
# [[[3]
#   [7]
#   [9]]], shape=(1, 3, 1), dtype=int32)

…咦?

為什麼會是這種神奇的輸出結果? 實際上TF在padding的時候把我們的input變成了下面這樣

inputs: 0 1 2 3 4 5 6 7 8 9 0 0
        |-----|                 -> output 3 
                |-----|         -> output 7
                        |-----| -> output 9

這個神奇的補0方式是什麼鬼? 為什麼右邊多補了1個0?

要理解這個首先就必須知道output shape真正的計算公式其實是長怎樣的:

$\mathrm{output\_shape}=\frac{\mathrm{W}+2 \times \text { padding }-\text { dilation } \times(\text { kernel_size }-1)-1}{\text { stride }}+1$

dilation是另外一種捲積操作，一般預設為1，這裡先忽略他。於是把上面式子簡化一下可以得到

$\mathrm{output\_shape}=\frac{\mathrm{W}+2 \times \text { padding }-\text { kernel_size }}{\text { stride }}+1$

TF2透過這個公式反推出總共要補0的數量($2 \times \text { padding }$)
然後對input的左右各補padding行(二維的話則是上下左右)
然而，當($2 \times \text { padding }$)為奇數時，TF會對右邊(二維時則是右邊跟下面)多補一行0，確保output shape是整數
- 你應該也發現了這其實並不是一個對稱的padding，所以這種方式也被稱之Asymmetric padding
- TF對右下多補0的操作和caffe不同(caffe會對左上補)，所以在框架轉換的時候可能會出現問題，詳見Tensorflow’s asymmetric padding assumptions

了解了padding=’SAME’所做的事情後，整個流程其實可用下面這個更新公式來得到，透過公式直接算出左右補0所需要的增加的欄位數量

# pad_width: 寬度方向填充0的行數
# pad_left, pad_right: 分别代表左右方向填充0的行數
# 二維的pad_top, pad_bottom依此類推
pad_width = max((out_width-1) * strides_width + kernel_size - in_width, 0)
pad_left = pad_width // 2 # 向下取整
pad_right  = pad_width - pad_left

這個你也可以在TF的source code看到相關的code:

tf.nn.atrous_conv2d_transpose()，implemented in python
- 看標註區段(計算pad_left和pad_right)的部分就好，前面的計算其實是再做另外一種padding(FULL padding)，常被用在DeConv中，用來放大input shape
- 不過padding=’FULL’這個參數只在原本的Keras(不是指TF2的keras API)有，在TF中是透過直接設置pad_width=kernel_size-1來達成，細節這裡不做太多著墨
tf.nn.conv2d()，implemented in C

好了，講了這麼多，再看一次剛剛問題的解答，現在應該知道為什麼剛剛的例子最右邊會補2個0，而左邊只補了1個0了，這裡我們再貼一次剛剛的padding結果幫助你複習

inputs: 0 1 2 3 4 5 6 7 8 9 0 0
        |-----|                 -> output 3 
                |-----|         -> output 7
                        |-----| -> output 9

再來回顧VALID的第二個case，在padding=’SAME’的時候下列設置輸出應該是什麼?

kernel=4, stride=1

print(tf.nn.max_pool1d(data, ksize=4, strides=1, padding='SAME'))
# tf.Tensor(
# [[[3]
#   [4]
#   [5]
#   [6]
#   [7]
#   [8]
#   [9]
#   [9]
#   [9]]], shape=(1, 9, 1), dtype=int32)

在這個例子中，你會發現其實padding的狀況和前一個例子相同。

到這裡你應該對於TF的兩種padding方法有了充足的理解。最後再次強調，根據公式，padding=’SAME’只有當stride=1時才會確保output shape和input shape相同。

Padding in Pytorch

Pytorch的padding參數預設是0，也就是預設不做padding(TF的padding=’Valid’)。

不過要做padding的話，就是需要按照Pytorch文件的公式來計算出正確output shape(寫Pytorch的都要對資料的維度很敏感，不然一個不小心shape弄錯模型就炸了…)，在torch.nn.MaxPool1d中寫到output shape公式如下:

$\mathrm{L}_{\text {out }}=\left\lfloor\frac{\mathrm{L}_{\text {in }}+2 \times \text { padding }-\text { dilation } \times(\text { kernel size }-1)-1}{\text { stride }}+1\right\rfloor$

題外話，當初在那算二維的output shape算到崩潰，後來直接把公式變成了一個function，可以參考[Python]Utility function of calculate convolution output shape

Pytorch的padding方法是Symmetric padding，也就是一維的時候對左右(二維則是對上下左右)補齊的數量是相同的，後續的操作出餘的部分就捨棄掉。

接下來我們來用code實際跑一下，首先我們準備和TF章節中相同的輸入:

inputs: 1 2 3 4 5 6 7 8 9

和TF不同的地方是data format要轉換一下，Pytorch預設採用的格式是channel first

import torch 
data = torch.Tensor([1, 2, 3, 4, 5, 6, 7, 8, 9]).view(1, 1, -1) # (batch, channel, dimension)
print('input:', data, 'shape:', data.shape)
# input: tensor([[[1., 2., 3., 4., 5., 6., 7., 8., 9.]]]) shape: torch.Size([1, 1, 9])

然後一樣用max_pool1d來看一下在TF中測試過的兩組設置

kernel=4, stride=4
kernel=4, stride=1

print(torch.nn.functional.max_pool1d(data, kernel_size=4, stride=4))
# tensor([[[4., 8.]]])
print(torch.nn.functional.max_pool1d(data, kernel_size=4, stride=1))
# tensor([[[ 4., 5., 6., 7., 8., 9.]]])

由於預設參數padding=0，所以就行為和TF的VALID padding一模一樣。

再來，測試有指定padding參數的狀況下結果是什麼，考慮以下設置:

kernel=4, stride=4, padding=1

0 1	print(torch.nn.functional.max_pool1d(c, kernel_size=4, stride=4, padding=1)) # tensor([[[3., 7.]]])

此時padding的情況如下

inputs: 0 1 2 3 4 5 6 7 (8 9 0) 
        |-----|                 -> output 3 
                |-----|         -> output 7

左右對稱補0後，多出來的部分(8 9 0)被捨棄了(也可以想成，Pytorch的作法是symmetric padding後，將output當作新的input進行VALID padding操作)

最後，Pytorch要怎麼做到TF的SAME padding(Asymmetric padding)呢?
你可以自行算好input周遭的padding數，透過torch.nn.ZeroPad2d()來實踐。不過這個功能目前似乎沒打算到torch.nn.Conv2d()等Conv function中。對此論壇上討論了很久了，目前基於效能等的考量所以不打算加進去，有興趣了解來龍去脈的可以參考下列Issue:

總結

前面辛苦寫了這麼久，其實懶人包就幾行:

在TF中，有兩種padding algorithm: SAME padding & VALID padding
- VALID代表不做任何padding，在運算中多餘的部分就會直接捨去
  - output shape: $\left\lceil\frac{W-K+1}{S}\right\rceil$
- SAME代表對input周遭進行padding，並且當stride=1的時候會使output保持跟input相同的大小
  - output shape: $\left\lceil\frac{W}{S}\right\rceil$
在Pytorch中，預設是padding參數是0，也就是TF的VALID padding
- 如果有指定padding參數，則會對四周做對稱補0，並且運算中多餘的部分會直接捨去(周圍對稱補0後再做VALID padding)
  - output shape: $\mathrm{L}_{\text {out }}=\left\lfloor\frac{\mathrm{L}_{\text {in }}+2 \times \text { padding }-\text { dilation } \times(\text { kernel size }-1)-1}{\text { stride }}+1\right\rfloor$

前言

Padding in Tensorflow2

padding=’VALID’

padding=’SAME’

Padding in Pytorch

總結

References