Introduction
Convolution is a mathematical operation on two functions (
Definition
The convolution of *
. It is defined as the integral of the product of the two functions after one is reversed and shifted. As such, it is a particular kind of integral transform
Visual Explanation
In the below example, the red-colored
Algebraic Properties
The convolution defines a product on the linear space of integrable functions.
- Commutativity:
- Associativity:
- Distributivity:
Computation with the Image and Kernel
Detkov and Nikita (2020) wrote the convolution from scratch using pure python and numpy, which can be checked out in their github. A more precise tutorial of convolution of input signal in 2D spatial can be seen in Song’s article
Effect on Images
Computer vision is common in daily life whether we are aware of it or not. For starters, we see filtered photographs in our social media feeds, news stories, publications, and novels. Image processing can be divided into two categories: image filtering and image warping. Image filtering alters the range (or pixel values) of an image, causing the image’s colours to change without affecting the pixel locations, whereas image warping alters the domain (or pixel positions), causing points to be mapped to other points without changing the colours.
We’ll take a closer look at picture filtering. Filters are used to change or enhance visual features and extract valuable information from images such as edges, corners, and blobs. Here are some instances of how applying filters to photos can improve their visual attractiveness.Four images are compared below.
Convolutional Neural Networks
Basically, in neural networks, the convolution layer uses filters that perform convolution operations as it is scanning the input
Generally, when the API is running tensor, it is usually a batch of data that is being executed. Suppose there are 100 samples of data, and each sample of data is a
Without further ado, let’s look at a diagram. The following diagram shows the general convolution calculation. In this example, the shape of the image is
You must be wondering why the above diagram showing that after applying the convolution operation, why isn’t the feature map being smaller? If you have played with the deep learning API, there are usually two parameters (strides, padding) that can be tuned in addition to the basic input and filter (kernel map) part of the convolution calculation. In most cases, we would like to let the input map not be affected by the size of the kernel map. The formula for calculating the length and width of a new feature map after convolution operation is as follows.
where we assume that the width and height
If you are not sure about the output size after computing convolution operation, you can calculate it on Edward’s Convolution Visualizer. This interactive visualisation shows how different convolution settings affect the input, weight, and output matrices’ forms and data dependencies, or you can calculate it with the formula. If the input has non-equal width and height, just simply apply the formula separately on the width and the height.
PyTorch does not support same padding the way Keras does, but still we can manage it easily using explicit padding before passing the tensor to convolution layer. Here is a very simple Conv2d
layer with same
padding for reference. It only support square kernels and stride=1
, dilation=1
, groups=1
.
import torch
import torch.nn as nn
class Conv2dSame(nn.Module):
def __init__(self,
in_channels,
out_channels,
kernel_size,
bias=True,
padding_layer=torch.nn.ReflectionPad2d):
super().__init__()
ka = kernel_size // 2
kb = ka - 1 if kernel_size % 2 == 0 else ka
self.net = torch.nn.Sequential(
padding_layer((ka, kb ,ka, kb)),
torch.nn.Conv2d(in_channels, out_channels, kernel_size, bias=bias)
)
def forward(self, x):
return self.net(x)
if __name__ == "__main__":
input_ = torch.rand((16, 1, 32, 100)).float()
output_ = Conv2dSame(in_channels=1, out_channels=3, kernel_size=4)(input_)
print(input_.shape)
print(output_.shape)
CNN Case Studies
LeNet-5
LeNet-5 is Yann LeCun’s convolutional network designed for handwritten and machine-printed character recognition in the year 1998. In their research paper, LeNet-5 is popular due to its simple and straightforward architecture (multi-layer convolution neural network).
import torch
import torch.nn as nn
from collections import OrderedDict
class LeNet5(nn.Module):
"""
Gradient-Based Learning Applied to Document Recognition
"""
def __init__(self):
super(LeNet5, self).__init__()
self.conv = nn.Sequential(OrderedDict([
('conv1', nn.Conv2d(1, 6, kernel_size=5)),
('activate1', nn.Tanh()),
('pool1', nn.MaxPool2d(kernel_size=2)),
('conv2', nn.Conv2d(6, 16, kernel_size=5)),
('activate2', nn.Tanh()),
('pool2', nn.MaxPool2d(kernel_size=2))
]))
self.fc = nn.Sequential(OrderedDict([
('fc1', nn.Linear(16*5*5, 120)),
('activate1', nn.Tanh()),
('fc2', nn.Linear(120, 84)),
('activate2', nn.Tanh()),
('fc3', nn.Linear(84, 10))
]))
def forward(self, x):
"""
x: torch.tensor
A torch tensor of shape [batch, 1, 32, 32]
"""
x = self.conv(x)
x = Flatten()(x)
x = self.fc(x)
return x
class Flatten(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x):
return x.view(x.size(0), -1)
if __name__ == "__main__":
input_ = torch.rand((16, 1, 32, 32)).float()
output_ = LeNet5()(input_)
print(input_.shape)
print(output_.shape)
AlexNet
AlexNet, proposed by Alex Krizhevsky, is one of the popular variants of the convolutional neural network and used as a deep learning framework. In Alex’s paper, AlexNet model has a very good performance when it is used as a transfer learning framework.
import torch
import torch.nn as nn
from collections import OrderedDict
class AlexNet(nn.Module):
"""
ImageNet Classification with Deep Convolutional Neural Networks
"""
def __init__(self, num_classes=1000):
super(AlexNet, self).__init__()
self.features = nn.Sequential(OrderedDict([
('conv1', nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2)),
('relu1', nn.ReLU(inplace=True)),
('pool1', nn.MaxPool2d(kernel_size=3, stride=2)),
('conv2', nn.Conv2d(64, 192, kernel_size=5, padding=2)),
('relu2', nn.ReLU(inplace=True)),
('pool2', nn.MaxPool2d(kernel_size=3, stride=2)),
('conv3', nn.Conv2d(192, 384, kernel_size=3, padding=1)),
('relu3', nn.ReLU(inplace=True)),
('conv4', nn.Conv2d(384, 256, kernel_size=3, padding=1)),
('relu4', nn.ReLU(inplace=True)),
('conv5', nn.Conv2d(256, 256, kernel_size=3, padding=1)),
('relu5', nn.ReLU(inplace=True)),
('pool5', nn.MaxPool2d(kernel_size=3, stride=2))
]))
self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
self.classifier = nn.Sequential(OrderedDict([
('drop1', nn.Dropout()),
('fc1', nn.Linear(256*6*6, 4096)),
('relu1', nn.ReLU(inplace=True)),
('drop2', nn.Dropout()),
('fc2', nn.Linear(4096, 4096)),
('relu2', nn.ReLU(inplace=True)),
('fc3', nn.Linear(4096, num_classes)),
]))
def forward(self, x):
"""
x: torch.Tensor
A torch tensor of shape [batch, 3, 224, 224]
"""
x = self.features(x)
x = self.avgpool(x)
x = torch.flatten(x, 1)
x = self.classifier(x)
return x
if __name__ == "__main__":
input_ = torch.rand((16, 3, 224, 224)).float()
output_ = AlexNet()(input_)
print(input_.shape)
print(output_.shape)
VGG
There are six models available in VGG. VGG mainly has three parts: convolution, pooling, and fully connected layers. VGG resulted in a significant increase in accuracy as well as a significant increase in speed. This was mostly due to the model’s increased depth and the incorporation of pretrained models. However, this model suffers from the vanishing gradient problem, which is a significant drawback. The ResNet architecture was used to tackle the vanishing gradient problem.
import math
import torch
import torch.nn as nn
from collections import OrderedDict
class VGG(nn.Module):
cfg = {
'A': [64, 'M', 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
'B': [64, 64, 'M', 128, 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
'D': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M'],
'E': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 256, 'M', 512, 512, 512, 512, 'M', 512, 512, 512, 512, 'M'],
}
def __init__(self, structure, dropout=0.5, batch_norm=True, num_classes=1000):
"""
Examples
--------
>>> input_ = torch.rand((16, 3, 224, 224)).float()
>>> VGG11 = VGG('A')
>>> VGG13 = VGG('B')
>>> VGG16 = VGG('D')
>>> VGG19 = VGG('E')
"""
super(VGG, self).__init__()
self.features = self.make_layers(self.cfg[structure], batch_norm=batch_norm)
self.classifier = nn.Sequential(OrderedDict([
('fc1', nn.Linear(512*7*7, 4096)),
('relu1', nn.ReLU(inplace=True)),
('drop1', nn.Dropout(p=dropout)),
('fc2', nn.Linear(4096, 4096)),
('relu2', nn.ReLU(inplace=True)),
('drop2', nn.Dropout(p=dropout)),
('fc3', nn.Linear(4096, num_classes)),
]))
self._initialize_weights()
def forward(self, x):
"""
x: torch.Tensor
A torch tensor of shape [batch, 3, 224, 224]
"""
x = self.features(x)
x = Flatten()(x)
x = self.classifier(x)
return x
def _initialize_weights(self):
for m in self.modules():
if isinstance(m, nn.Conv2d):
n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
m.weight.data.normal_(0, math.sqrt(2. / n))
if m.bias is not None:
m.bias.data.zero_()
elif isinstance(m, nn.BatchNorm2d):
m.weight.data.fill_(1)
m.bias.data.zero_()
elif isinstance(m, nn.Linear):
m.weight.data.normal_(0, 0.01)
m.bias.data.zero_()
def make_layers(self, cfg, batch_norm=False):
layers = []
in_channels = 3
for v in cfg:
if v == 'M':
layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
else:
conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
if batch_norm:
layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
else:
layers += [conv2d, nn.ReLU(inplace=True)]
in_channels = v
return nn.Sequential(*layers)
if __name__ == "__main__":
input_ = torch.rand((16, 3, 224, 224)).float()
print(input_.shape)
output_ = VGG('A')(input_) # VGG11
print(output_.shape)
output_ = VGG('B')(input_) # VGG13
print(output_.shape)
output_ = VGG('D')(input_) # VGG16
print(output_.shape)
output_ = VGG('E')(input_) # VGG19
print(output_.shape)
ResNet
By integrating skip connections, the ResNet proposed in “Deep Residual Learning for Image Recognition” overcame the degradation problem. According to the authors, optimising the network with skip connections is easier than optimising the original, unreferenced mapping.
Representation of residual networks with 18, 34, 50, 101, and 152 layers.
import torch
import torch.nn as nn
from collections import OrderedDict
class Block(nn.Module):
def __init__(self, in_channels, out_channels, num_layers, downsample=None, stride=1):
assert num_layers in [18, 34, 50, 101, 152], "Not a valid architecture."
super(Block, self).__init__()
self.num_layers = num_layers
if self.num_layers > 34:
self.expansion = 4
else:
self.expansion = 1
self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=1, padding=0)
self.bn1 = nn.BatchNorm2d(out_channels)
if self.num_layers > 34:
self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=stride, padding=1)
else:
self.conv2 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1)
self.bn2 = nn.BatchNorm2d(out_channels)
self.conv3 = nn.Conv2d(out_channels, out_channels * self.expansion, kernel_size=1, stride=1, padding=0)
self.bn3 = nn.BatchNorm2d(out_channels * self.expansion)
self.relu = nn.ReLU()
self.downsample = downsample
def forward(self, x):
identity = x
if self.num_layers > 34:
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
x = self.conv2(x)
x = self.bn2(x)
x = self.relu(x)
x = self.conv3(x)
x = self.bn3(x)
if self.downsample is not None:
identity = self.downsample(identity)
x += identity
x = self.relu(x)
return x
class ResNet(nn.Module):
def __init__(self, in_channels=3, num_layers=18, block=Block, num_classes=1000):
assert num_layers in [18, 34, 50, 101, 152], "Not a valid architecture."
super(ResNet, self).__init__()
if num_layers < 50:
self.expansion = 1
else:
self.expansion = 4
if num_layers == 18:
layers = [2, 2, 2, 2]
elif num_layers == 34 or num_layers == 50:
layers = [3, 4, 6, 3]
elif num_layers == 101:
layers = [3, 4, 23, 3]
else:
layers = [3, 8, 36, 3]
self.in_channels = 64
self.conv1 = nn.Conv2d(in_channels, 64, kernel_size=7, stride=2, padding=3)
self.bn1 = nn.BatchNorm2d(64)
self.relu = nn.ReLU()
self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
self.layer1 = self.make_layers(num_layers, block, layers[0], intermediate_channels=64, stride=1)
self.layer2 = self.make_layers(num_layers, block, layers[1], intermediate_channels=128, stride=2)
self.layer3 = self.make_layers(num_layers, block, layers[2], intermediate_channels=256, stride=2)
self.layer4 = self.make_layers(num_layers, block, layers[3], intermediate_channels=512, stride=2)
self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
self.fc = nn.Linear(512*self.expansion, num_classes)
def forward(self, x):
"""
x: torch.Tensor
A torch tensor of shape [batch, 3, 224, 224]
"""
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
x = self.maxpool(x)
x = self.layer1(x)
x = self.layer2(x)
x = self.layer3(x)
x = self.layer4(x)
x = self.avgpool(x)
x = Flatten()(x)
x = self.fc(x)
return x
def make_layers(self, num_layers, block, num_residual_blocks, intermediate_channels, stride):
layers = []
downsample = nn.Sequential(OrderedDict([
('conv', nn.Conv2d(self.in_channels, intermediate_channels*self.expansion, kernel_size=1, stride=stride)),
('bn', nn.BatchNorm2d(intermediate_channels*self.expansion))
]))
layers.append(block(self.in_channels, intermediate_channels, num_layers, downsample, stride))
self.in_channels = intermediate_channels * self.expansion
for i in range(num_residual_blocks - 1):
layers.append(block(self.in_channels, intermediate_channels, num_layers))
return nn.Sequential(*layers)
if __name__ == "__main__":
input_ = torch.rand((16, 3, 224, 224)).float()
print(input_.shape)
output_ = ResNet(num_layers=18)(input_)
print(output_.shape)
output_ = ResNet(num_layers=34)(input_)
print(output_.shape)
output_ = ResNet(num_layers=50)(input_)
print(output_.shape)
output_ = ResNet(num_layers=101)(input_)
print(output_.shape)
output_ = ResNet(num_layers=152)(input_)
print(output_.shape)
Conclusions
Lenet-5 is one of the earliest pre-trained models, which promoted the event of deep learning, proposed by Yann LeCun and others in the year 1998. Alexnet and VGG are pretty much the same concept, but VGG is deeper and has more parameters, as well has using only
References
- https://en.wikipedia.org/wiki/Convolution
- https://machinelearningmastery.com/
- https://medium.com/analytics-vidhya/2d-convolution-using-python-numpy-43442ff5f381convolutional-layers-for-deep-learning-neural-networks/
- https://github.com/detkov/Convolution-From-Scratch
- https://en.wikipedia.org/wiki/Kernel_(image_processing)
- https://ai.stanford.edu/~syyeung/cvweb/tutorial1.html
- https://datascience.stackexchange.com/questions/73929/what-are-advantages-or-disadvantages-of-training-deep-learning-model-from-scratc
- https://github.com/vdumoulin/conv_arithmetic
- https://github.com/pytorch/pytorch/issues/3867#issuecomment-407663012
- https://chsasank.com/vision/_modules/torchvision/models/vgg.html
- Tommy Huang, Convolutional neural network (CNN): what 1×1 convolutional computing is doing [article]