Understand GoogLeNet (Inception v1) and Implement it easily from scratch using Tensorflow and Keras

Nitish Kumar Pilla
13 min readMar 22, 2021

— The main goal of this blog is to make the readers understand the architecture of GoogLeNet and Implement it from scratch using Tensorflow and Keras.


In order to improve the performance of neural network architecture, the network should be deeper (in terms of a number of layers) but, there are complications for creating a deeper network. The main problem with deeper neural networks is overfitting. The second problem with more layers is, increase in computational power. Previous image classification models like VGG16 use only a 3x3 filter in their network which used to be a bit difficult in capturing the different sized objects in the image. In order to solve these problems, GoogLeNet was developed in 2014. This GoogLeNet model outperformed the VGG model in ILSVRC14 and marked as the best model for image classification in 2014.


In a traditional neural network layer or convolutional neural network layer, the output from the previous layer is the input for the next layer and follows that pattern until the prediction. The basic idea of the inception network is the inception block. It takes apart the individual layers and instead of passing it through 1 layer it takes the previous layer input and passes it to four different operations in parallel and then concatenates the outlets from all these different layers. Below is the figure of the inception block. Don’t worry about the picture you saw below, we will get into the details of it.

fig(a) Inception Block ( made it using Lucidchart )

Inception block:

Let’s understand what is inception block and how it works. Google Net is made of 9 inception blocks. Before understanding inception blocks, I assume that you know about backpropagation concepts like scholastic gradient descent and CNN-related concepts like max-pooling, convolution, stride, and padding if not check out those concepts. Do note that, in the above image max-pooling is done using SAME pooling. you may get a doubt that what is SAME pooling? There are types of performing pooling like VALID pooling and SAME pooling.

Visualization credits: vdumoulin@GitHub

In the above visualizations, the first one is VALID pooling where you don't apply any padding on the image. The third one is the SAME pooling. Here we apply padding to input (if needed) so that the input image gets fully covered by the filter and stride you specified. For stride 1, this will ensure that the output image size is the same as the input.

Now let’s see fig (a) and understand about inception block. In a traditional CNN architecture, the output of one layer is connected as an input of the Next Layer, but for the Inception block, each filter is applied separately to the previous layer output and finally, all the results are concatenated and sent as an input to the next layer. as we can see in fig (a), for 28x28x192 (height, width, channels) input, we applied 4 different filters 1x1, 3x3, 5x5, 3x3. Right now, you may have many doubts like why do we apply 1x1, 3x3, and 5x5 filters? And performing convolution using 1x1 will not give any changes, but still, why do we use a 1x1 filter?

In earlier architectures, they have fixed convolution size window, for example, in VGG16 the convolution filter is of size 3x3 and is fixed for all but here we use both 3x3 and 5x5 to address the different object sizes in the image. We apply a 1x1 filter to decrease the spectral dimension. Spectral dimension means band, for example, if an input size for an inception block is 28x28x192, then 192 is called a band or spectral dimension. If we reduce that spectral dimension using a 1x1 filter, we can save a lot of computational power. We also apply a 1x1 filter before applying 3x3 and 5x5 to save the computational resources. We will see how we will save computational power by using a 1x1 filter in detail.

See the above diagram, the input is of size 28x28x192 convoluted with 5x5 filters of channel size 192 with 32 filters of 5x5. We got the output of size 28x28x32. How we got the output as 28x28x32? We can use 2 formulas for calculating the output size after applying convolution using a filter on the input image, they are:

result image (Height) = ((original image height + 2 * padding value — filter size (height))/stride value) +1

result image (width) = ((original image width + 2 * padding value — filter size (width))/stride value) +1

let’s calculate the output size by considering the input parameters from the above image.

Original image height = 28

Original image width = 28

Padding value = 0

Filter size = 1

Stride value = 1

result image (Height) = 28 + 2*0 -1 + 1 = 28

result image (width) = 28 + 2*0 -1 + 1 = 28

As we used 32 filters of 5x5, the output has 32 channels

So finally, we get the output size of 28x28x32

When we apply a 5x5 filter on 28x28x192, the number of operations to be performed is (28x28x32)+(5x5x192)= 120 million operations.

See the above image where we first applied 16 filters of 1x1 and then 32 filters of 5x5. Here we just need 12.4 million operations ((28x28x16x1x1x192)+(28x28x32x5x5x1)) to complete the same task. In this way, we have many advantages using a 1x1 filter. The below table is taken from the research paper which contains in-depth details of google Net architecture layers.

(Table 1) In-depth Architecture details are taken from the “Going deeper with convolutions” paper

From the above table, we can see that there are 9 inception blocks. We can see two columns named #3x3 reduce and #5x5 reduce, which means the number of 1x1 filters used before applying 3x3 and 5x5 filters on the input. If the value of #3x3 reduce is 64, it means that 64 filters of 1x1 are applied before applying a 3x3 filter on the input. If the value of #5x5 reduce is 16, it means that 16 filters of 1x1 are applied before applying a 5x5 filter on the input. Note that applying filter means doing convolution on the input using a filter.

Below is the full architecture of GoogLeNET

GoogLeNet architecture taken from “Going deeper with convolutions” paper

As u can see, it is a stack of different inception blocks. You may wonder what is the loss function for this architecture for the purpose of backpropagation? The loss function Loss = Lre + 0.3 L1 + 0.3 L2. Here L1 (loss 1 ) and L2 ( loss l2 ) are calculated. L1 and L2 are taken from the middle of the network to reduce the time to work on the architecture. Lre is the loss function at the output.

Training :

GoogleNet was trained using DistBelief distributed machine learning system using a modest amount of model and data-parallelism. Data parallelism trains multiple instances of the same model on different subsets of the training dataset. For understanding the data parallelism better, I advise you to look at this blog https://leimao.github.io/blog/Data-Parallelism-vs-Model-Paralelism/ written by Lei Mao. This model used asynchronous stochastic gradient descent with 0.9 momentum, fixed learning rate schedule (decreasing the learning rate by 4% every 8 epochs).

Drawbacks of inception v1 architecture:

  • The inception v1 model should help in reducing the effect of the vanishing gradient problem but while training the model the authors of the paper found that this classifier hasn’t improved the convergence very much during the early stage of training. These have poor initialization, so you will be wasting lots of computational power.
  • The use of 5x5 filters in Inception v1 causes a decrease in accuracy because it causes the input dimensions to decrease which is susceptible to information loss by a large margin. This problem was solved by inception v2.

Implementation of GoogLeNet using Keras and TensorFlow:

We are going to use the cifar10 dataset and develop a model for classifying images from the cifar10 dataset. cifar10 dataset contains 50,000 images, as we cant train the whole 50,000 images on my local computer as my laptop resources can’t handle that much data, I am considering only the first 3000 images for developing the model.

Step 1: import all the required libraries for building the GoogLeNet model

import cv2 
import numpy as np
import keras
import tensorflow as tf
from keras.datasets import cifar10 # importing the dataset
from keras import backend as K
from keras.utils import np_utils
from keras.layers import Layer
import keras.backend as K
import math
from keras.optimizers import SGD
from keras.callbacks import LearningRateScheduler
from keras.datasets import cifar10
from keras.models import Model
from keras.layers import Conv2D, MaxPool2D,Dropout, Dense, Input, concatenate,GlobalAveragePooling2D, AveragePooling2D,Flatten

Step 2: Creating a function to perform a train test split on the Cifar10 data and reshape the images to 224 x 224 size. We are also dividing the values of the image with 255 to keep all the value range between 0 and 1

num_classes = 10def cifar10_data(img_rows, img_cols):# Load training and validation sets
(X_train, Y_train), (X_valid, Y_valid) = cifar10.load_data()
# Resize images to 244x244 X_train = np.array([cv2.resize(img, (img_rows,img_cols)) for img in X_train[:,:,:,:][:3000]])
X_valid = np.array([cv2.resize(img, (img_rows,img_cols)) for img in X_valid[:,:,:,:][:3000]])
Y_train = Y_train[:3000]
Y_valid = Y_valid[:3000]
# Transform targets to keras compatible format
Y_train = np_utils.to_categorical(Y_train, num_classes)
Y_valid = np_utils.to_categorical(Y_valid, num_classes)

X_train = X_train.astype(‘float32’)
X_valid = X_valid.astype(‘float32’)
# making all the values range between 0 and 1
X_train = X_train / 255.0
X_valid = X_valid / 255.0
return X_train, Y_train, X_valid, Y_valid

Step 3: Now we are calling the above-defined function

X_train, y_train, X_test, y_test = cifar10_data(224, 224)

Step 4: Creating a function for the inception block

def inception_module(x,filters_1x1,filters_3x3_reduce,filters_3x3,

conv_1x1 = Conv2D(filters_1x1, (1, 1), padding=’same’, activation=’relu’, kernel_initializer=kernel_init, bias_initializer=bias_init)(x)

conv_3x3 = Conv2D(filters_3x3_reduce, (1, 1), padding=’same’, activation=’relu’, kernel_initializer=kernel_init, bias_initializer=bias_init)(x)
conv_3x3 = Conv2D(filters_3x3, (3, 3), padding=’same’, activation=’relu’, kernel_initializer=kernel_init, bias_initializer=bias_init)(conv_3x3)
conv_5x5 = Conv2D(filters_5x5_reduce, (1, 1), padding=’same’, activation=’relu’, kernel_initializer=kernel_init, bias_initializer=bias_init)(x)
conv_5x5 = Conv2D(filters_5x5, (5, 5), padding=’same’, activation=’relu’, kernel_initializer=kernel_init, bias_initializer=bias_init)(conv_5x5)
pool_proj = MaxPool2D((3, 3), strides=(1, 1), padding=’same’)(x)
pool_proj = Conv2D(filters_pool_proj, (1, 1), padding=’same’, activation=’relu’, kernel_initializer=kernel_init, bias_initializer=bias_init)(pool_proj)
output = concatenate([conv_1x1, conv_3x3, conv_5x5, pool_proj], axis=3, name=name)

return output
kernel_init = keras.initializers.glorot_uniform()
bias_init = keras.initializers.Constant(value=0.2)

Step 5: Consider the table 1 figure and create the model using the same layers and values given in the table

input_layer = Input(shape=(224, 224, 3))x = Conv2D(64, (7, 7), padding=’same’, strides=(2, 2), activation=’relu’, name=’conv_1_7x7/2', kernel_initializer=kernel_init, bias_initializer=bias_init)(input_layer)
x = MaxPool2D((3, 3), padding=’same’, strides=(2, 2), name=’max_pool_1_3x3/2')(x)
x = Conv2D(64, (1, 1), padding=’same’, strides=(1, 1), activation=’relu’, name=’conv_2a_3x3/1')(x)
x = Conv2D(192, (3, 3), padding=’same’, strides=(1, 1), activation=’relu’, name=’conv_2b_3x3/1')(x)
x = MaxPool2D((3, 3), padding=’same’, strides=(2, 2), name=’max_pool_2_3x3/2')(x)
x = inception_module(x,filters_1x1=64,filters_3x3_reduce=96,
x = inception_module(x,filters_1x1=128,filters_3x3_reduce=128,
x = MaxPool2D((3, 3), padding=’same’, strides=(2, 2), name=’max_pool_3_3x3/2')(x)x = inception_module(x,filters_1x1=192,filters_3x3_reduce=96,
x1 = AveragePooling2D((5, 5), strides=3)(x)
x1 = Conv2D(128, (1, 1), padding=’same’, activation=’relu’)(x1)
x1 = Flatten()(x1)
x1 = Dense(1024, activation=’relu’)(x1)
x1 = Dropout(0.7)(x1)
x1 = Dense(10, activation=’softmax’, name=’auxilliary_output_1')(x1)
x = inception_module(x,filters_1x1=160,filters_3x3_reduce=112,
filters_3x3=224,filters_5x5_reduce=24,filters_5x5=64,filters_pool_pr oj=64,name=’inception_4b’)
x = inception_module(x,filters_1x1=128,filters_3x3_reduce=128,
x = inception_module(x,filters_1x1=112,filters_3x3_reduce=144,
x2 = AveragePooling2D((5, 5), strides=3)(x)
x2 = Conv2D(128, (1, 1), padding=’same’, activation=’relu’)(x2)
x2 = Flatten()(x2)
x2 = Dense(1024, activation=’relu’)(x2)
x2 = Dropout(0.7)(x2)
x2 = Dense(10, activation=’softmax’, name=’auxilliary_output_2')(x2)
x = inception_module(x,filters_1x1=256,filters_3x3_reduce=160,
x = MaxPool2D((3, 3), padding=’same’, strides=(2, 2), name=’max_pool_4_3x3/2')(x)x = inception_module(x,filters_1x1=256,filters_3x3_reduce=160,
x = inception_module(x, filters_1x1=384,filters_3x3_reduce=192,
x = GlobalAveragePooling2D(name=’avg_pool_5_3x3/1')(x)x = Dropout(0.4)(x)x = Dense(10, activation=’softmax’, name=’output’)(x)model = Model(input_layer, [x, x1, x2], name=’inception_v1')

To see the whole network, use the model. summary()

Model: "inception_v1"
Layer (type) Output Shape Param # Connected to
input_3 (InputLayer) [(None, 224, 224, 3) 0
conv_1_7x7/2 (Conv2D) (None, 112, 112, 64) 9472 input_3[0][0]
max_pool_1_3x3/2 (MaxPooling2D) (None, 56, 56, 64) 0 conv_1_7x7/2[0][0]
conv_2a_3x3/1 (Conv2D) (None, 56, 56, 64) 4160 max_pool_1_3x3/2[0][0]
conv_2b_3x3/1 (Conv2D) (None, 56, 56, 192) 110784 conv_2a_3x3/1[0][0]
max_pool_2_3x3/2 (MaxPooling2D) (None, 28, 28, 192) 0 conv_2b_3x3/1[0][0]
conv2d_113 (Conv2D) (None, 28, 28, 96) 18528 max_pool_2_3x3/2[0][0]
conv2d_115 (Conv2D) (None, 28, 28, 16) 3088 max_pool_2_3x3/2[0][0]
max_pooling2d_18 (MaxPooling2D) (None, 28, 28, 192) 0 max_pool_2_3x3/2[0][0]
conv2d_112 (Conv2D) (None, 28, 28, 64) 12352 max_pool_2_3x3/2[0][0]
conv2d_114 (Conv2D) (None, 28, 28, 128) 110720 conv2d_113[0][0]
conv2d_116 (Conv2D) (None, 28, 28, 32) 12832 conv2d_115[0][0]
conv2d_117 (Conv2D) (None, 28, 28, 32) 6176 max_pooling2d_18[0][0]
inception_3a (Concatenate) (None, 28, 28, 256) 0 conv2d_112[0][0]
conv2d_119 (Conv2D) (None, 28, 28, 128) 32896 inception_3a[0][0]
conv2d_121 (Conv2D) (None, 28, 28, 32) 8224 inception_3a[0][0]
max_pooling2d_19 (MaxPooling2D) (None, 28, 28, 256) 0 inception_3a[0][0]
conv2d_118 (Conv2D) (None, 28, 28, 128) 32896 inception_3a[0][0]
conv2d_120 (Conv2D) (None, 28, 28, 192) 221376 conv2d_119[0][0]
conv2d_122 (Conv2D) (None, 28, 28, 96) 76896 conv2d_121[0][0]
conv2d_123 (Conv2D) (None, 28, 28, 64) 16448 max_pooling2d_19[0][0]
inception_3b (Concatenate) (None, 28, 28, 480) 0 conv2d_118[0][0]
max_pool_3_3x3/2 (MaxPooling2D) (None, 14, 14, 480) 0 inception_3b[0][0]
conv2d_125 (Conv2D) (None, 14, 14, 96) 46176 max_pool_3_3x3/2[0][0]
conv2d_127 (Conv2D) (None, 14, 14, 16) 7696 max_pool_3_3x3/2[0][0]
max_pooling2d_20 (MaxPooling2D) (None, 14, 14, 480) 0 max_pool_3_3x3/2[0][0]
conv2d_124 (Conv2D) (None, 14, 14, 192) 92352 max_pool_3_3x3/2[0][0]
conv2d_126 (Conv2D) (None, 14, 14, 208) 179920 conv2d_125[0][0]
conv2d_128 (Conv2D) (None, 14, 14, 48) 19248 conv2d_127[0][0]
conv2d_129 (Conv2D) (None, 14, 14, 64) 30784 max_pooling2d_20[0][0]
inception_4a (Concatenate) (None, 14, 14, 512) 0 conv2d_124[0][0]
conv2d_132 (Conv2D) (None, 14, 14, 112) 57456 inception_4a[0][0]
conv2d_134 (Conv2D) (None, 14, 14, 24) 12312 inception_4a[0][0]
max_pooling2d_21 (MaxPooling2D) (None, 14, 14, 512) 0 inception_4a[0][0]
conv2d_131 (Conv2D) (None, 14, 14, 160) 82080 inception_4a[0][0]
conv2d_133 (Conv2D) (None, 14, 14, 224) 226016 conv2d_132[0][0]
conv2d_135 (Conv2D) (None, 14, 14, 64) 38464 conv2d_134[0][0]
conv2d_136 (Conv2D) (None, 14, 14, 64) 32832 max_pooling2d_21[0][0]
inception_4b (Concatenate) (None, 14, 14, 512) 0 conv2d_131[0][0]
conv2d_138 (Conv2D) (None, 14, 14, 128) 65664 inception_4b[0][0]
conv2d_140 (Conv2D) (None, 14, 14, 24) 12312 inception_4b[0][0]
max_pooling2d_22 (MaxPooling2D) (None, 14, 14, 512) 0 inception_4b[0][0]
conv2d_137 (Conv2D) (None, 14, 14, 128) 65664 inception_4b[0][0]
conv2d_139 (Conv2D) (None, 14, 14, 256) 295168 conv2d_138[0][0]
conv2d_141 (Conv2D) (None, 14, 14, 64) 38464 conv2d_140[0][0]
conv2d_142 (Conv2D) (None, 14, 14, 64) 32832 max_pooling2d_22[0][0]
inception_4c (Concatenate) (None, 14, 14, 512) 0 conv2d_137[0][0]
conv2d_144 (Conv2D) (None, 14, 14, 144) 73872 inception_4c[0][0]
conv2d_146 (Conv2D) (None, 14, 14, 32) 16416 inception_4c[0][0]
max_pooling2d_23 (MaxPooling2D) (None, 14, 14, 512) 0 inception_4c[0][0]
conv2d_143 (Conv2D) (None, 14, 14, 112) 57456 inception_4c[0][0]
conv2d_145 (Conv2D) (None, 14, 14, 288) 373536 conv2d_144[0][0]
conv2d_147 (Conv2D) (None, 14, 14, 64) 51264 conv2d_146[0][0]
conv2d_148 (Conv2D) (None, 14, 14, 64) 32832 max_pooling2d_23[0][0]
inception_4d (Concatenate) (None, 14, 14, 528) 0 conv2d_143[0][0]
conv2d_151 (Conv2D) (None, 14, 14, 160) 84640 inception_4d[0][0]
conv2d_153 (Conv2D) (None, 14, 14, 32) 16928 inception_4d[0][0]
max_pooling2d_24 (MaxPooling2D) (None, 14, 14, 528) 0 inception_4d[0][0]
conv2d_150 (Conv2D) (None, 14, 14, 256) 135424 inception_4d[0][0]
conv2d_152 (Conv2D) (None, 14, 14, 320) 461120 conv2d_151[0][0]
conv2d_154 (Conv2D) (None, 14, 14, 128) 102528 conv2d_153[0][0]
conv2d_155 (Conv2D) (None, 14, 14, 128) 67712 max_pooling2d_24[0][0]
inception_4e (Concatenate) (None, 14, 14, 832) 0 conv2d_150[0][0]
max_pool_4_3x3/2 (MaxPooling2D) (None, 7, 7, 832) 0 inception_4e[0][0]
conv2d_157 (Conv2D) (None, 7, 7, 160) 133280 max_pool_4_3x3/2[0][0]
conv2d_159 (Conv2D) (None, 7, 7, 32) 26656 max_pool_4_3x3/2[0][0]
max_pooling2d_25 (MaxPooling2D) (None, 7, 7, 832) 0 max_pool_4_3x3/2[0][0]
conv2d_156 (Conv2D) (None, 7, 7, 256) 213248 max_pool_4_3x3/2[0][0]
conv2d_158 (Conv2D) (None, 7, 7, 320) 461120 conv2d_157[0][0]
conv2d_160 (Conv2D) (None, 7, 7, 128) 102528 conv2d_159[0][0]
conv2d_161 (Conv2D) (None, 7, 7, 128) 106624 max_pooling2d_25[0][0]
inception_5a (Concatenate) (None, 7, 7, 832) 0 conv2d_156[0][0]
conv2d_163 (Conv2D) (None, 7, 7, 192) 159936 inception_5a[0][0]
conv2d_165 (Conv2D) (None, 7, 7, 48) 39984 inception_5a[0][0]
max_pooling2d_26 (MaxPooling2D) (None, 7, 7, 832) 0 inception_5a[0][0]
average_pooling2d_4 (AveragePoo (None, 4, 4, 512) 0 inception_4a[0][0]
average_pooling2d_5 (AveragePoo (None, 4, 4, 528) 0 inception_4d[0][0]
conv2d_162 (Conv2D) (None, 7, 7, 384) 319872 inception_5a[0][0]
conv2d_164 (Conv2D) (None, 7, 7, 384) 663936 conv2d_163[0][0]
conv2d_166 (Conv2D) (None, 7, 7, 128) 153728 conv2d_165[0][0]
conv2d_167 (Conv2D) (None, 7, 7, 128) 106624 max_pooling2d_26[0][0]
conv2d_130 (Conv2D) (None, 4, 4, 128) 65664 average_pooling2d_4[0][0]
conv2d_149 (Conv2D) (None, 4, 4, 128) 67712 average_pooling2d_5[0][0]
inception_5b (Concatenate) (None, 7, 7, 1024) 0 conv2d_162[0][0]
flatten_4 (Flatten) (None, 2048) 0 conv2d_130[0][0]
flatten_5 (Flatten) (None, 2048) 0 conv2d_149[0][0]
avg_pool_5_3x3/1 (GlobalAverage (None, 1024) 0 inception_5b[0][0]
dense_4 (Dense) (None, 1024) 2098176 flatten_4[0][0]
dense_5 (Dense) (None, 1024) 2098176 flatten_5[0][0]
dropout_8 (Dropout) (None, 1024) 0 avg_pool_5_3x3/1[0][0]
dropout_6 (Dropout) (None, 1024) 0 dense_4[0][0]
dropout_7 (Dropout) (None, 1024) 0 dense_5[0][0]
output (Dense) (None, 10) 10250 dropout_8[0][0]
auxilliary_output_1 (Dense) (None, 10) 10250 dropout_6[0][0]
auxilliary_output_2 (Dense) (None, 10) 10250 dropout_7[0][0]
Total params: 10,334,030
Trainable params: 10,334,030
Non-trainable params: 0

As you can see we had created the layers and values which match the GoogLeNet Architecture.

Step 6: Creating required variables for training the model. we use Scholastic gradient descent with a momentum value of 0.9 as mentioned in the paper. we are using the learning rate scheduler library for reducing the learning rate during the training period.

epochs = 15
initial_lrate = 0.01
def decay(epoch, steps=100):
initial_lrate = 0.01
drop = 0.96
epochs_drop = 8
lrate = initial_lrate * math.pow(drop, math.floor((1+epoch)/epochs_drop))
return lrate
sgd = SGD(lr=initial_lrate, momentum=0.9, nesterov=False)lr_sc = LearningRateScheduler(decay, verbose=1)model.compile(loss=[‘categorical_crossentropy’, ‘categorical_crossentropy’, ‘categorical_crossentropy’], loss_weights=[1, 0.3, 0.3], optimizer=sgd, metrics=[‘accuracy’])

Now we are going to fit the data into the model.

H = model.fit(X_train, [y_train, y_train, y_train],validation_data=(X_test, [y_test, y_test, y_test]), epochs=epochs, batch_size=256, callbacks=[lr_sc])

In this way, you can fit and train the model. After I trained the model on the dataset, I got 78% accuracy on the validation dataset. Implement the code yourself and see how much accuracy you are getting.


Inception v1 is the first inception network. There are many other versions of the inception network like Inception v2, Inception v3, Inception v4, and inception ResNet v2 which solved the drawbacks of inception v1. Do comment if you want a blog on any algorithm related to the inception networks.

I hope you understood how the GoogLeNet algorithm works. If you have any doubts do comment and if you want to contact me do mail me at nitishkumar2902@gmail.com.



Data Scientist Enthusiast, Master’s Student in Computer Science