Understand GoogLeNet (Inception v1) and Implement it easily from scratch using Tensorflow and Keras

Nitish Kumar Pilla
13 min readMar 22, 2021

— The main goal of this blog is to make the readers understand the architecture of GoogLeNet and Implement it from scratch using Tensorflow and Keras.

Motivation:

In order to improve the performance of neural network architecture, the network should be deeper (in terms of a number of layers) but, there are complications for creating a deeper network. The main problem with deeper neural networks is overfitting. The second problem with more layers is, increase in computational power. Previous image classification models like VGG16 use only a 3x3 filter in their network which used to be a bit difficult in capturing the different sized objects in the image. In order to solve these problems, GoogLeNet was developed in 2014. This GoogLeNet model outperformed the VGG model in ILSVRC14 and marked as the best model for image classification in 2014.

Introduction:

In a traditional neural network layer or convolutional neural network layer, the output from the previous layer is the input for the next layer and follows that pattern until the prediction. The basic idea of the inception network is the inception block. It takes apart the individual layers and instead of passing it through 1 layer it takes the previous layer input and passes it to four different operations in parallel and then concatenates the outlets from all these different layers. Below is the figure of the inception block. Don’t worry about the picture you saw below, we will get into the details of it.

fig(a) Inception Block ( made it using Lucidchart )

Inception block:

Let’s understand what is inception block and how it works. Google Net is made of 9 inception blocks. Before understanding inception blocks, I assume that you know about backpropagation concepts like scholastic gradient descent and CNN-related concepts like max-pooling, convolution, stride, and padding if not check out those concepts. Do note that, in the above image max-pooling is done using SAME pooling. you may get a doubt that what is SAME pooling? There are types of performing pooling like VALID pooling and SAME pooling.

Visualization credits: vdumoulin@GitHub

In the above visualizations, the first one is VALID pooling where you don't apply any padding on the image. The third one is the SAME pooling. Here we apply padding to input (if needed) so that the input image gets fully covered by the filter and stride you specified. For stride 1, this will ensure that the output image size is the same as the input.

Now let’s see fig (a) and understand about inception block. In a traditional CNN architecture, the output of one layer is connected as an input of the Next Layer, but for the Inception block, each filter is applied separately to the previous layer output and finally, all the results are concatenated and sent as an input to the next layer. as we can see in fig (a), for 28x28x192 (height, width, channels) input, we applied 4 different filters 1x1, 3x3, 5x5, 3x3. Right now, you may have many doubts like why do we apply 1x1, 3x3, and 5x5 filters? And performing convolution using 1x1 will not give any changes, but still, why do we use a 1x1 filter?

In earlier architectures, they have fixed convolution size window, for example, in VGG16 the convolution filter is of size 3x3 and is fixed for all but here we use both 3x3 and 5x5 to address the different object sizes in the image. We apply a 1x1 filter to decrease the spectral dimension. Spectral dimension means band, for example, if an input size for an inception block is 28x28x192, then 192 is called a band or spectral dimension. If we reduce that spectral dimension using a 1x1 filter, we can save a lot of computational power. We also apply a 1x1 filter before applying 3x3 and 5x5 to save the computational resources. We will see how we will save computational power by using a 1x1 filter in detail.

See the above diagram, the input is of size 28x28x192 convoluted with 5x5 filters of channel size 192 with 32 filters of 5x5. We got the output of size 28x28x32. How we got the output as 28x28x32? We can use 2 formulas for calculating the output size after applying convolution using a filter on the input image, they are:

result image (Height) = ((original image height + 2 * padding value — filter size (height))/stride value) +1

result image (width) = ((original image width + 2 * padding value — filter size (width))/stride value) +1

let’s calculate the output size by considering the input parameters from the above image.

Original image height = 28

Original image width = 28

Padding value = 0

Filter size = 1

Stride value = 1

result image (Height) = 28 + 2*0 -1 + 1 = 28

result image (width) = 28 + 2*0 -1 + 1 = 28

As we used 32 filters of 5x5, the output has 32 channels

So finally, we get the output size of 28x28x32

When we apply a 5x5 filter on 28x28x192, the number of operations to be performed is (28x28x32)+(5x5x192)= 120 million operations.

See the above image where we first applied 16 filters of 1x1 and then 32 filters of 5x5. Here we just need 12.4 million operations ((28x28x16x1x1x192)+(28x28x32x5x5x1)) to complete the same task. In this way, we have many advantages using a 1x1 filter. The below table is taken from the research paper which contains in-depth details of google Net architecture layers.

(Table 1) In-depth Architecture details are taken from the “Going deeper with convolutions” paper

From the above table, we can see that there are 9 inception blocks. We can see two columns named #3x3 reduce and #5x5 reduce, which means the number of 1x1 filters used before applying 3x3 and 5x5 filters on the input. If the value of #3x3 reduce is 64, it means that 64 filters of 1x1 are applied before applying a 3x3 filter on the input. If the value of #5x5 reduce is 16, it means that 16 filters of 1x1 are applied before applying a 5x5 filter on the input. Note that applying filter means doing convolution on the input using a filter.

Below is the full architecture of GoogLeNET

GoogLeNet architecture taken from “Going deeper with convolutions” paper

As u can see, it is a stack of different inception blocks. You may wonder what is the loss function for this architecture for the purpose of backpropagation? The loss function Loss = Lre + 0.3 L1 + 0.3 L2. Here L1 (loss 1 ) and L2 ( loss l2 ) are calculated. L1 and L2 are taken from the middle of the network to reduce the time to work on the architecture. Lre is the loss function at the output.

Training :

GoogleNet was trained using DistBelief distributed machine learning system using a modest amount of model and data-parallelism. Data parallelism trains multiple instances of the same model on different subsets of the training dataset. For understanding the data parallelism better, I advise you to look at this blog https://leimao.github.io/blog/Data-Parallelism-vs-Model-Paralelism/ written by Lei Mao. This model used asynchronous stochastic gradient descent with 0.9 momentum, fixed learning rate schedule (decreasing the learning rate by 4% every 8 epochs).

Drawbacks of inception v1 architecture:

  • The inception v1 model should help in reducing the effect of the vanishing gradient problem but while training the model the authors of the paper found that this classifier hasn’t improved the convergence very much during the early stage of training. These have poor initialization, so you will be wasting lots of computational power.
  • The use of 5x5 filters in Inception v1 causes a decrease in accuracy because it causes the input dimensions to decrease which is susceptible to information loss by a large margin. This problem was solved by inception v2.

Implementation of GoogLeNet using Keras and TensorFlow:

We are going to use the cifar10 dataset and develop a model for classifying images from the cifar10 dataset. cifar10 dataset contains 50,000 images, as we cant train the whole 50,000 images on my local computer as my laptop resources can’t handle that much data, I am considering only the first 3000 images for developing the model.

Step 1: import all the required libraries for building the GoogLeNet model

import cv2 
import numpy as np
import keras
import tensorflow as tf
from keras.datasets import cifar10 # importing the dataset
from keras import backend as K
from keras.utils import np_utils
from keras.layers import Layer
import keras.backend as K
import math
from keras.optimizers import SGD
from keras.callbacks import LearningRateScheduler
from keras.datasets import cifar10
from keras.models import Model
from keras.layers import Conv2D, MaxPool2D,Dropout, Dense, Input, concatenate,GlobalAveragePooling2D, AveragePooling2D,Flatten

Step 2: Creating a function to perform a train test split on the Cifar10 data and reshape the images to 224 x 224 size. We are also dividing the values of the image with 255 to keep all the value range between 0 and 1

num_classes = 10def cifar10_data(img_rows, img_cols):# Load training and validation sets
(X_train, Y_train), (X_valid, Y_valid) = cifar10.load_data()
# Resize images to 244x244 X_train = np.array([cv2.resize(img, (img_rows,img_cols)) for img in X_train[:,:,:,:][:3000]])
X_valid = np.array([cv2.resize(img, (img_rows,img_cols)) for img in X_valid[:,:,:,:][:3000]])
Y_train = Y_train[:3000]
Y_valid = Y_valid[:3000]
# Transform targets to keras compatible format
Y_train = np_utils.to_categorical(Y_train, num_classes)
Y_valid = np_utils.to_categorical(Y_valid, num_classes)

X_train = X_train.astype(‘float32’)
X_valid = X_valid.astype(‘float32’)
# making all the values range between 0 and 1
X_train = X_train / 255.0
X_valid = X_valid / 255.0
return X_train, Y_train, X_valid, Y_valid

Step 3: Now we are calling the above-defined function

X_train, y_train, X_test, y_test = cifar10_data(224, 224)

Step 4: Creating a function for the inception block

def inception_module(x,filters_1x1,filters_3x3_reduce,filters_3x3,
filters_5x5_reduce,filters_5x5,filters_pool_proj,name=None):

conv_1x1 = Conv2D(filters_1x1, (1, 1), padding=’same’, activation=’relu’, kernel_initializer=kernel_init, bias_initializer=bias_init)(x)

conv_3x3 = Conv2D(filters_3x3_reduce, (1, 1), padding=’same’, activation=’relu’, kernel_initializer=kernel_init, bias_initializer=bias_init)(x)
conv_3x3 = Conv2D(filters_3x3, (3, 3), padding=’same’, activation=’relu’, kernel_initializer=kernel_init, bias_initializer=bias_init)(conv_3x3)
conv_5x5 = Conv2D(filters_5x5_reduce, (1, 1), padding=’same’, activation=’relu’, kernel_initializer=kernel_init, bias_initializer=bias_init)(x)
conv_5x5 = Conv2D(filters_5x5, (5, 5), padding=’same’, activation=’relu’, kernel_initializer=kernel_init, bias_initializer=bias_init)(conv_5x5)
pool_proj = MaxPool2D((3, 3), strides=(1, 1), padding=’same’)(x)
pool_proj = Conv2D(filters_pool_proj, (1, 1), padding=’same’, activation=’relu’, kernel_initializer=kernel_init, bias_initializer=bias_init)(pool_proj)
output = concatenate([conv_1x1, conv_3x3, conv_5x5, pool_proj], axis=3, name=name)

return output
kernel_init = keras.initializers.glorot_uniform()
bias_init = keras.initializers.Constant(value=0.2)

Step 5: Consider the table 1 figure and create the model using the same layers and values given in the table

input_layer = Input(shape=(224, 224, 3))x = Conv2D(64, (7, 7), padding=’same’, strides=(2, 2), activation=’relu’, name=’conv_1_7x7/2', kernel_initializer=kernel_init, bias_initializer=bias_init)(input_layer)
x = MaxPool2D((3, 3), padding=’same’, strides=(2, 2), name=’max_pool_1_3x3/2')(x)
x = Conv2D(64, (1, 1), padding=’same’, strides=(1, 1), activation=’relu’, name=’conv_2a_3x3/1')(x)
x = Conv2D(192, (3, 3), padding=’same’, strides=(1, 1), activation=’relu’, name=’conv_2b_3x3/1')(x)
x = MaxPool2D((3, 3), padding=’same’, strides=(2, 2), name=’max_pool_2_3x3/2')(x)
x = inception_module(x,filters_1x1=64,filters_3x3_reduce=96,
filters_3x3=128,filters_5x5_reduce=16,
filters_5x5=32,filters_pool_proj=32,
name=’inception_3a’)
x = inception_module(x,filters_1x1=128,filters_3x3_reduce=128,
filters_3x3=192,filters_5x5_reduce=32,
filters_5x5=96,filters_pool_proj=64,
name=’inception_3b’)
x = MaxPool2D((3, 3), padding=’same’, strides=(2, 2), name=’max_pool_3_3x3/2')(x)x = inception_module(x,filters_1x1=192,filters_3x3_reduce=96,
filters_3x3=208,filters_5x5_reduce=16,filters_5x5=48,filters_pool_proj=64,
name=’inception_4a’)
x1 = AveragePooling2D((5, 5), strides=3)(x)
x1 = Conv2D(128, (1, 1), padding=’same’, activation=’relu’)(x1)
x1 = Flatten()(x1)
x1 = Dense(1024, activation=’relu’)(x1)
x1 = Dropout(0.7)(x1)
x1 = Dense(10, activation=’softmax’, name=’auxilliary_output_1')(x1)
x = inception_module(x,filters_1x1=160,filters_3x3_reduce=112,
filters_3x3=224,filters_5x5_reduce=24,filters_5x5=64,filters_pool_pr oj=64,name=’inception_4b’)
x = inception_module(x,filters_1x1=128,filters_3x3_reduce=128,
filters_3x3=256,filters_5x5_reduce=24,filters_5x5=64,
filters_pool_proj=64,name=’inception_4c’)
x = inception_module(x,filters_1x1=112,filters_3x3_reduce=144,
filters_3x3=288,filters_5x5_reduce=32,filters_5x5=64,
filters_pool_proj=64,
name=’inception_4d’)
x2 = AveragePooling2D((5, 5), strides=3)(x)
x2 = Conv2D(128, (1, 1), padding=’same’, activation=’relu’)(x2)
x2 = Flatten()(x2)
x2 = Dense(1024, activation=’relu’)(x2)
x2 = Dropout(0.7)(x2)
x2 = Dense(10, activation=’softmax’, name=’auxilliary_output_2')(x2)
x = inception_module(x,filters_1x1=256,filters_3x3_reduce=160,
filters_3x3=320,filters_5x5_reduce=32,filters_5x5=128,
filters_pool_proj=128,
name=’inception_4e’)
x = MaxPool2D((3, 3), padding=’same’, strides=(2, 2), name=’max_pool_4_3x3/2')(x)x = inception_module(x,filters_1x1=256,filters_3x3_reduce=160,
filters_3x3=320,filters_5x5_reduce=32,filters_5x5=128,
filters_pool_proj=128,
name=’inception_5a’)
x = inception_module(x, filters_1x1=384,filters_3x3_reduce=192,
filters_3x3=384,filters_5x5_reduce=48,filters_5x5=128,
filters_pool_proj=128,
name=’inception_5b’)
x = GlobalAveragePooling2D(name=’avg_pool_5_3x3/1')(x)x = Dropout(0.4)(x)x = Dense(10, activation=’softmax’, name=’output’)(x)model = Model(input_layer, [x, x1, x2], name=’inception_v1')

To see the whole network, use the model. summary()

Model: "inception_v1"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_3 (InputLayer) [(None, 224, 224, 3) 0
__________________________________________________________________________________________________
conv_1_7x7/2 (Conv2D) (None, 112, 112, 64) 9472 input_3[0][0]
__________________________________________________________________________________________________
max_pool_1_3x3/2 (MaxPooling2D) (None, 56, 56, 64) 0 conv_1_7x7/2[0][0]
__________________________________________________________________________________________________
conv_2a_3x3/1 (Conv2D) (None, 56, 56, 64) 4160 max_pool_1_3x3/2[0][0]
__________________________________________________________________________________________________
conv_2b_3x3/1 (Conv2D) (None, 56, 56, 192) 110784 conv_2a_3x3/1[0][0]
__________________________________________________________________________________________________
max_pool_2_3x3/2 (MaxPooling2D) (None, 28, 28, 192) 0 conv_2b_3x3/1[0][0]
__________________________________________________________________________________________________
conv2d_113 (Conv2D) (None, 28, 28, 96) 18528 max_pool_2_3x3/2[0][0]
__________________________________________________________________________________________________
conv2d_115 (Conv2D) (None, 28, 28, 16) 3088 max_pool_2_3x3/2[0][0]
__________________________________________________________________________________________________
max_pooling2d_18 (MaxPooling2D) (None, 28, 28, 192) 0 max_pool_2_3x3/2[0][0]
__________________________________________________________________________________________________
conv2d_112 (Conv2D) (None, 28, 28, 64) 12352 max_pool_2_3x3/2[0][0]
__________________________________________________________________________________________________
conv2d_114 (Conv2D) (None, 28, 28, 128) 110720 conv2d_113[0][0]
__________________________________________________________________________________________________
conv2d_116 (Conv2D) (None, 28, 28, 32) 12832 conv2d_115[0][0]
__________________________________________________________________________________________________
conv2d_117 (Conv2D) (None, 28, 28, 32) 6176 max_pooling2d_18[0][0]
__________________________________________________________________________________________________
inception_3a (Concatenate) (None, 28, 28, 256) 0 conv2d_112[0][0]
conv2d_114[0][0]
conv2d_116[0][0]
conv2d_117[0][0]
__________________________________________________________________________________________________
conv2d_119 (Conv2D) (None, 28, 28, 128) 32896 inception_3a[0][0]
__________________________________________________________________________________________________
conv2d_121 (Conv2D) (None, 28, 28, 32) 8224 inception_3a[0][0]
__________________________________________________________________________________________________
max_pooling2d_19 (MaxPooling2D) (None, 28, 28, 256) 0 inception_3a[0][0]
__________________________________________________________________________________________________
conv2d_118 (Conv2D) (None, 28, 28, 128) 32896 inception_3a[0][0]
__________________________________________________________________________________________________
conv2d_120 (Conv2D) (None, 28, 28, 192) 221376 conv2d_119[0][0]
__________________________________________________________________________________________________
conv2d_122 (Conv2D) (None, 28, 28, 96) 76896 conv2d_121[0][0]
__________________________________________________________________________________________________
conv2d_123 (Conv2D) (None, 28, 28, 64) 16448 max_pooling2d_19[0][0]
__________________________________________________________________________________________________
inception_3b (Concatenate) (None, 28, 28, 480) 0 conv2d_118[0][0]
conv2d_120[0][0]
conv2d_122[0][0]
conv2d_123[0][0]
__________________________________________________________________________________________________
max_pool_3_3x3/2 (MaxPooling2D) (None, 14, 14, 480) 0 inception_3b[0][0]
__________________________________________________________________________________________________
conv2d_125 (Conv2D) (None, 14, 14, 96) 46176 max_pool_3_3x3/2[0][0]
__________________________________________________________________________________________________
conv2d_127 (Conv2D) (None, 14, 14, 16) 7696 max_pool_3_3x3/2[0][0]
__________________________________________________________________________________________________
max_pooling2d_20 (MaxPooling2D) (None, 14, 14, 480) 0 max_pool_3_3x3/2[0][0]
__________________________________________________________________________________________________
conv2d_124 (Conv2D) (None, 14, 14, 192) 92352 max_pool_3_3x3/2[0][0]
__________________________________________________________________________________________________
conv2d_126 (Conv2D) (None, 14, 14, 208) 179920 conv2d_125[0][0]
__________________________________________________________________________________________________
conv2d_128 (Conv2D) (None, 14, 14, 48) 19248 conv2d_127[0][0]
__________________________________________________________________________________________________
conv2d_129 (Conv2D) (None, 14, 14, 64) 30784 max_pooling2d_20[0][0]
__________________________________________________________________________________________________
inception_4a (Concatenate) (None, 14, 14, 512) 0 conv2d_124[0][0]
conv2d_126[0][0]
conv2d_128[0][0]
conv2d_129[0][0]
__________________________________________________________________________________________________
conv2d_132 (Conv2D) (None, 14, 14, 112) 57456 inception_4a[0][0]
__________________________________________________________________________________________________
conv2d_134 (Conv2D) (None, 14, 14, 24) 12312 inception_4a[0][0]
__________________________________________________________________________________________________
max_pooling2d_21 (MaxPooling2D) (None, 14, 14, 512) 0 inception_4a[0][0]
__________________________________________________________________________________________________
conv2d_131 (Conv2D) (None, 14, 14, 160) 82080 inception_4a[0][0]
__________________________________________________________________________________________________
conv2d_133 (Conv2D) (None, 14, 14, 224) 226016 conv2d_132[0][0]
__________________________________________________________________________________________________
conv2d_135 (Conv2D) (None, 14, 14, 64) 38464 conv2d_134[0][0]
__________________________________________________________________________________________________
conv2d_136 (Conv2D) (None, 14, 14, 64) 32832 max_pooling2d_21[0][0]
__________________________________________________________________________________________________
inception_4b (Concatenate) (None, 14, 14, 512) 0 conv2d_131[0][0]
conv2d_133[0][0]
conv2d_135[0][0]
conv2d_136[0][0]
__________________________________________________________________________________________________
conv2d_138 (Conv2D) (None, 14, 14, 128) 65664 inception_4b[0][0]
__________________________________________________________________________________________________
conv2d_140 (Conv2D) (None, 14, 14, 24) 12312 inception_4b[0][0]
__________________________________________________________________________________________________
max_pooling2d_22 (MaxPooling2D) (None, 14, 14, 512) 0 inception_4b[0][0]
__________________________________________________________________________________________________
conv2d_137 (Conv2D) (None, 14, 14, 128) 65664 inception_4b[0][0]
__________________________________________________________________________________________________
conv2d_139 (Conv2D) (None, 14, 14, 256) 295168 conv2d_138[0][0]
__________________________________________________________________________________________________
conv2d_141 (Conv2D) (None, 14, 14, 64) 38464 conv2d_140[0][0]
__________________________________________________________________________________________________
conv2d_142 (Conv2D) (None, 14, 14, 64) 32832 max_pooling2d_22[0][0]
__________________________________________________________________________________________________
inception_4c (Concatenate) (None, 14, 14, 512) 0 conv2d_137[0][0]
conv2d_139[0][0]
conv2d_141[0][0]
conv2d_142[0][0]
__________________________________________________________________________________________________
conv2d_144 (Conv2D) (None, 14, 14, 144) 73872 inception_4c[0][0]
__________________________________________________________________________________________________
conv2d_146 (Conv2D) (None, 14, 14, 32) 16416 inception_4c[0][0]
__________________________________________________________________________________________________
max_pooling2d_23 (MaxPooling2D) (None, 14, 14, 512) 0 inception_4c[0][0]
__________________________________________________________________________________________________
conv2d_143 (Conv2D) (None, 14, 14, 112) 57456 inception_4c[0][0]
__________________________________________________________________________________________________
conv2d_145 (Conv2D) (None, 14, 14, 288) 373536 conv2d_144[0][0]
__________________________________________________________________________________________________
conv2d_147 (Conv2D) (None, 14, 14, 64) 51264 conv2d_146[0][0]
__________________________________________________________________________________________________
conv2d_148 (Conv2D) (None, 14, 14, 64) 32832 max_pooling2d_23[0][0]
__________________________________________________________________________________________________
inception_4d (Concatenate) (None, 14, 14, 528) 0 conv2d_143[0][0]
conv2d_145[0][0]
conv2d_147[0][0]
conv2d_148[0][0]
__________________________________________________________________________________________________
conv2d_151 (Conv2D) (None, 14, 14, 160) 84640 inception_4d[0][0]
__________________________________________________________________________________________________
conv2d_153 (Conv2D) (None, 14, 14, 32) 16928 inception_4d[0][0]
__________________________________________________________________________________________________
max_pooling2d_24 (MaxPooling2D) (None, 14, 14, 528) 0 inception_4d[0][0]
__________________________________________________________________________________________________
conv2d_150 (Conv2D) (None, 14, 14, 256) 135424 inception_4d[0][0]
__________________________________________________________________________________________________
conv2d_152 (Conv2D) (None, 14, 14, 320) 461120 conv2d_151[0][0]
__________________________________________________________________________________________________
conv2d_154 (Conv2D) (None, 14, 14, 128) 102528 conv2d_153[0][0]
__________________________________________________________________________________________________
conv2d_155 (Conv2D) (None, 14, 14, 128) 67712 max_pooling2d_24[0][0]
__________________________________________________________________________________________________
inception_4e (Concatenate) (None, 14, 14, 832) 0 conv2d_150[0][0]
conv2d_152[0][0]
conv2d_154[0][0]
conv2d_155[0][0]
__________________________________________________________________________________________________
max_pool_4_3x3/2 (MaxPooling2D) (None, 7, 7, 832) 0 inception_4e[0][0]
__________________________________________________________________________________________________
conv2d_157 (Conv2D) (None, 7, 7, 160) 133280 max_pool_4_3x3/2[0][0]
__________________________________________________________________________________________________
conv2d_159 (Conv2D) (None, 7, 7, 32) 26656 max_pool_4_3x3/2[0][0]
__________________________________________________________________________________________________
max_pooling2d_25 (MaxPooling2D) (None, 7, 7, 832) 0 max_pool_4_3x3/2[0][0]
__________________________________________________________________________________________________
conv2d_156 (Conv2D) (None, 7, 7, 256) 213248 max_pool_4_3x3/2[0][0]
__________________________________________________________________________________________________
conv2d_158 (Conv2D) (None, 7, 7, 320) 461120 conv2d_157[0][0]
__________________________________________________________________________________________________
conv2d_160 (Conv2D) (None, 7, 7, 128) 102528 conv2d_159[0][0]
__________________________________________________________________________________________________
conv2d_161 (Conv2D) (None, 7, 7, 128) 106624 max_pooling2d_25[0][0]
__________________________________________________________________________________________________
inception_5a (Concatenate) (None, 7, 7, 832) 0 conv2d_156[0][0]
conv2d_158[0][0]
conv2d_160[0][0]
conv2d_161[0][0]
__________________________________________________________________________________________________
conv2d_163 (Conv2D) (None, 7, 7, 192) 159936 inception_5a[0][0]
__________________________________________________________________________________________________
conv2d_165 (Conv2D) (None, 7, 7, 48) 39984 inception_5a[0][0]
__________________________________________________________________________________________________
max_pooling2d_26 (MaxPooling2D) (None, 7, 7, 832) 0 inception_5a[0][0]
__________________________________________________________________________________________________
average_pooling2d_4 (AveragePoo (None, 4, 4, 512) 0 inception_4a[0][0]
__________________________________________________________________________________________________
average_pooling2d_5 (AveragePoo (None, 4, 4, 528) 0 inception_4d[0][0]
__________________________________________________________________________________________________
conv2d_162 (Conv2D) (None, 7, 7, 384) 319872 inception_5a[0][0]
__________________________________________________________________________________________________
conv2d_164 (Conv2D) (None, 7, 7, 384) 663936 conv2d_163[0][0]
__________________________________________________________________________________________________
conv2d_166 (Conv2D) (None, 7, 7, 128) 153728 conv2d_165[0][0]
__________________________________________________________________________________________________
conv2d_167 (Conv2D) (None, 7, 7, 128) 106624 max_pooling2d_26[0][0]
__________________________________________________________________________________________________
conv2d_130 (Conv2D) (None, 4, 4, 128) 65664 average_pooling2d_4[0][0]
__________________________________________________________________________________________________
conv2d_149 (Conv2D) (None, 4, 4, 128) 67712 average_pooling2d_5[0][0]
__________________________________________________________________________________________________
inception_5b (Concatenate) (None, 7, 7, 1024) 0 conv2d_162[0][0]
conv2d_164[0][0]
conv2d_166[0][0]
conv2d_167[0][0]
__________________________________________________________________________________________________
flatten_4 (Flatten) (None, 2048) 0 conv2d_130[0][0]
__________________________________________________________________________________________________
flatten_5 (Flatten) (None, 2048) 0 conv2d_149[0][0]
__________________________________________________________________________________________________
avg_pool_5_3x3/1 (GlobalAverage (None, 1024) 0 inception_5b[0][0]
__________________________________________________________________________________________________
dense_4 (Dense) (None, 1024) 2098176 flatten_4[0][0]
__________________________________________________________________________________________________
dense_5 (Dense) (None, 1024) 2098176 flatten_5[0][0]
__________________________________________________________________________________________________
dropout_8 (Dropout) (None, 1024) 0 avg_pool_5_3x3/1[0][0]
__________________________________________________________________________________________________
dropout_6 (Dropout) (None, 1024) 0 dense_4[0][0]
__________________________________________________________________________________________________
dropout_7 (Dropout) (None, 1024) 0 dense_5[0][0]
__________________________________________________________________________________________________
output (Dense) (None, 10) 10250 dropout_8[0][0]
__________________________________________________________________________________________________
auxilliary_output_1 (Dense) (None, 10) 10250 dropout_6[0][0]
__________________________________________________________________________________________________
auxilliary_output_2 (Dense) (None, 10) 10250 dropout_7[0][0]
==================================================================================================
Total params: 10,334,030
Trainable params: 10,334,030
Non-trainable params: 0

As you can see we had created the layers and values which match the GoogLeNet Architecture.

Step 6: Creating required variables for training the model. we use Scholastic gradient descent with a momentum value of 0.9 as mentioned in the paper. we are using the learning rate scheduler library for reducing the learning rate during the training period.

epochs = 15
initial_lrate = 0.01
def decay(epoch, steps=100):
initial_lrate = 0.01
drop = 0.96
epochs_drop = 8
lrate = initial_lrate * math.pow(drop, math.floor((1+epoch)/epochs_drop))
return lrate
sgd = SGD(lr=initial_lrate, momentum=0.9, nesterov=False)lr_sc = LearningRateScheduler(decay, verbose=1)model.compile(loss=[‘categorical_crossentropy’, ‘categorical_crossentropy’, ‘categorical_crossentropy’], loss_weights=[1, 0.3, 0.3], optimizer=sgd, metrics=[‘accuracy’])

Now we are going to fit the data into the model.

H = model.fit(X_train, [y_train, y_train, y_train],validation_data=(X_test, [y_test, y_test, y_test]), epochs=epochs, batch_size=256, callbacks=[lr_sc])

In this way, you can fit and train the model. After I trained the model on the dataset, I got 78% accuracy on the validation dataset. Implement the code yourself and see how much accuracy you are getting.

Conclusion:

Inception v1 is the first inception network. There are many other versions of the inception network like Inception v2, Inception v3, Inception v4, and inception ResNet v2 which solved the drawbacks of inception v1. Do comment if you want a blog on any algorithm related to the inception networks.

I hope you understood how the GoogLeNet algorithm works. If you have any doubts do comment and if you want to contact me do mail me at nitishkumar2902@gmail.com.

--

--

Nitish Kumar Pilla

Data Scientist Enthusiast, Master’s Student in Computer Science