Object Recognition with Google’s Convolutional Neural Networks

12 min readJul 28, 2017

Classifying Images Using Google’s Pre-Trained Inception CNN Models

Convolutional neural networks are the state of the art technique for image recognition-that is, identifying objects such as people or cars in pictures. While object recognition comes naturally to humans, it has been difficult to implement using machine algorithms and until the advent of convolutional neural networks (beginning in earnest with the development of LeNet-5 in 1998) the best computer was no match for the average child at this deceptively challenging task. Recent advances in CNN design, notably deeper models with more layers enabled by the availability of cheap computing power and enhanced techniques such as inception modules and skip connections, have created models that rival human accuracy in object identification. Moreover, CNNs are poised to make real-world impacts in areas from self-driving vehicles to medical imaging evaluation (a field where computers are already outperforming humans). However, training convolutional neural networks, in particular implementing the backpropagation method used to update the model parameters, is computationally expensive. The greater the amount of training data (labeled images) and the deeper the net, the longer the training time. Reducing network depth or the amount of training data is not advisable as the performance of any machine learning system is directly related to the number of quality training examples, and deeper networks (up to a point) perform better. Additional performance-enhancing techniques, such as dropout or batch normalization, increase computation time as well. Properly training a useful image recognition network on ten of thousands of labeled images could take months or longer on a personal computer. Moreover, developing the correct architecture and selecting the optimal hyperparameters requires training the network hundreds or thousands of times which means we had better be prepared to spend several decades at this project if we limit ourselves to laptops. Fortunately, Google has not only developed several iterations of an ideal architecture for image classification (in 2014 GoogLeNet won the Imagenet Large Scale Visual Recognition Challenge where models must identify 1000 different classes of objects) but they have also released models fully trained on 1.2 million images across 1000 categories. This means that instead of building our own network and enduring training epoch after training epoch, we can used the Google pre-trained model to perform high-accuracy object recognition.

All of the Python code for this project was written in a Jupyter Notebook. The complete notebook and project is available on my machine learning projects Github repository. This project was adapted from the Google Tensorflow slim walkthrough Jupyter Notebook and was aided by the book Hands-On Machine Learning with Scikit-Learn and Tensorflow by Aurelien Geron.

Inception Neural Network

Google has a number of neural network models that they have made available for use in TensorFlow. The two models we will use here are the Inception-v3 and Inception-v4.

They both make use of inception modules which take several convolutional kernels of different sizes and stack their outputs along the depth dimension in order to capture features at different scales.

Both networks also borrow the concept of a residual network with skip connections where the input is added to the output so that the model is forced to predict the residual rather than the target itself.

Skip Connection Typically Used in Residual Network

Using this architecture, Inception-v4 was able to achieve 80.2% top-1 accuracy and 95.2% top-5 accuracy on the Imagenet dataset, or in other words, the network correctly determined the object in an image 4/5 of the time and 19/20 times, the correct prediction appeared in the top five probabilities output by the model. Other models developed by Google (notably Inception-ResNet-v2) have achieved slighter better results, but the Inception-v3 and -v4 networks are still at the top of the field.

Retrieving the Pre-Trained Models

To obtain the appropriate Python libraries, go to the tensorflow/models GitHub repository and download or Git clone the repo. All the work we will be doing should be run from within the slim library, so navigate to that folder and create a new Python script or Jupyter Notebook there. The next step is to download the most recent checkpoint of the Inception networks. The list of models can be found on the tensorflow/models Git Hub. To download a different model, simply replace the “inception_v3_2016_08_28.tar.gz” with the architecture of your choice (other code may also need to be modified).

import tensorflow as tf
from datasets import dataset_utils
import os# Base url
TF_MODELS_URL = "http://download.tensorflow.org/models/"# Modify this path for a different CNN
INCEPTION_V3_URL = TF_MODELS_URL + "inception_v3_2016_08_28.tar.gz"
INCEPTION_V4_URL = TF_MODELS_URL + "inception_v4_2016_09_09.tar.gz"# Directory to save model checkpoints
MODELS_DIR = "models/cnn"INCEPTION_V3_CKPT_PATH = MODELS_DIR + "/inception_v3.ckpt"
INCEPTION_V4_CKPT_PATH = MODELS_DIR + "/inception_v4.ckpt"# Make the model directory if it does not exist
if not tf.gfile.Exists(MODELS_DIR):
    tf.gfile.MakeDirs(MODELS_DIR)
 
# Download the appropriate model if haven't already done so
if not os.path.exists(INCEPTION_V3_CKPT_PATH):    
    dataset_utils.download_and_uncompress_tarball(INCEPTION_V3_URL, MODELS_DIR)
    
if not os.path.exists(INCEPTION_V4_CKPT_PATH):
    dataset_utils.download_and_uncompress_tarball(INCEPTION_V4_URL, MODELS_DIR)Output:
>> Downloading inception_v3_2016_08_28.tar.gz 100.0%
Successfully downloaded inception_v3_2016_08_28.tar.gz 100885009 bytes.
>> Downloading inception_v4_2016_09_09.tar.gz 100.0%
Successfully downloaded inception_v4_2016_09_09.tar.gz 171177982 bytes.

Processing Images for Use

Now that the models have been downloaded, we need a way to ensure the images are the right configuration for the network. The Imagenet images are all 299 pixels by 299 pixels (height x width) x 3 color channels (Red-Green-Blue). Therefore, any images we send through the network will have to be in the same format. The Inception networks also expect images scaled to be between 0 and 1, which means that the pixels values needed to be divided by 255 (the maximum intensity value for a color). While this is relatively straightforward, the slim library we are working in already has a built-in picture processing function in the inception_preprocessing.py script in the preprocessing directory. This function takes in an image as a three-dimensional array of pixels values and returns the correctly formatted array for evaluation by the Inception network. It also has a number of other capabilities for use with training such as shifting or altering the image which make the network invariant to aspects of the image (such as orientation) that do not affect the object in the image. This technique can also be use to augment a small dataset by including copies of each image that have been shifted, scaled, or rotated. We will pass in is_training = False so the image will be processed for evaluation and will only be resized.

from preprocessing import inception_preprocessing
# This can be modified depending on the model used and the training image datasetdef process_image(image):
    root_dir = "images/"
    filename = root_dir + image
    with open(filename, "rb") as f:
        image_str = f.read()
        
    if image.endswith('jpg'):
        raw_image = tf.image.decode_jpeg(image_str, channels=3)
    elif image.endswith('png'):
        raw_image = tf.image.decode_png(image_str, channels=3)
    else: 
        print("Image must be either jpg or png")
        return 
    
    image_size = 299 # ImageNet image size, different models may be sized differently
    processed_image = inception_preprocessing.preprocess_image(raw_image, image_size,
                                                             image_size, is_training=False)
    
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        raw_image, processed_image = sess.run([raw_image, processed_image])
        
    return raw_image, processed_image.reshape(-1, 299, 299, 3)

The images that we want to classify should be placed in a new images directory located with the slim folder (or change the root_dir in the preceding code). For now, we will stick to jpg and png images although other formats could also be processed. To generate the correct images, we create a TensorFlow session to run the TensorFlow operations. We return the raw image so we can plot it, as well as the processed image shaped into [batch_size, height, width, color_channels].

Display Example Images

We can download any image we want and place it in the images directory. However, for the network to have any chance of being correct, we will need images that are included in the ImageNet dataset. The complete list of 1000 classes can be found as text here. (When the network is used for evaluation with is_training=False, it will have 1001 classes because it adds an extra “background” category.) Choose a couple of images and download them to the images directory. It is best if the images are close-up and feature the object in the center. We can first write a small function to plot the image using matplotlib.pyplot and with the %matplotlib inline magic function to display plots in the Jupyter Notebook.

import matplotlib.pyplot as plt
import numpy as np
%matplotlib inlinedef plot_color_image(image):
    plt.figure(figsize=(10,10))
    plt.imshow(image.astype(np.uint8), interpolation='nearest')
    plt.axis('off')raw_bison, processed_bison = process_image('bison.jpg')
plot_color_image(raw_bison)

We can also check the size of both the raw image and the processed image:

print(raw_bison.shape, processed_bison.shape)Output:
(183, 275, 3) (1, 299, 299, 3)

In the case of this image, because the original size was too small, the preprocessing function adds extra pixels by interpolating between existing pixel values. This results in a blurry image which should not significantly affect the performance of the CNN.

One more image for fun:

raw_sombrero, processed_sombrero = process_image('sombrero.jpg')
plot_color_image(raw_sombrero)

Image Recognition

The heart of this project is the prediction of classes for the pictures. Now that we have several images (feel free to gather as many as you like. It might be interesting to see what the CNN guesses for classes of images it never saw during training.) We will write a function that takes in the name of an image and the version of the CNN to use (currently limited to the Inception architecture, either “V3” or “V4”), plots the raw image, and shows the top-10 predictions below the plot.

'''
Function takes in the name of the image and optionally the network to use for predictions
Currently, the only options for the net are Inception V3 and Inception V4.
Plots the raw image and displays the top-10 class predictions.
'''def predict(image, version='V3'):
    tf.reset_default_graph()
    
    # Process the image 
    raw_image, processed_image = process_image(image)
    class_names = imagenet.create_readable_names_for_imagenet_labels()
    
    # Create a placeholder for the images
    X = tf.placeholder(tf.float32, [None, 299, 299, 3], name="X")
    
    '''
    inception_v3 function returns logits and end_points dictionary
    logits are output of the network before applying softmax activation
    '''
    
    if version.upper() == 'V3':
        model_ckpt_path = INCEPTION_V3_CKPT_PATH
        with slim.arg_scope(inception.inception_v3_arg_scope()):
            # Set the number of classes and is_training parameter  
            logits, end_points = inception.inception_v3(X, num_classes=1001, is_training=False)
            
    elif version.upper() == 'V4':
        model_ckpt_path = INCEPTION_V4_CKPT_PATH
        with slim.arg_scope(inception.inception_v3_arg_scope()):
            # Set the number of classes and is_training parameter
            # Logits 
            logits, end_points = inception.inception_v4(X, num_classes=1001, is_training=False)
            
    
    predictions = end_points.get('Predictions', 'No key named predictions')
    saver = tf.train.Saver()
    
    with tf.Session() as sess:
        saver.restore(sess, model_ckpt_path)
        prediction_values = predictions.eval({X: processed_image})
        
    try:
        # Add an index to predictions and then sort by probability
        prediction_values = [(i, prediction) for i, prediction in enumerate(prediction_values[0,:])]
        prediction_values = sorted(prediction_values, key=lambda x: x[1], reverse=True)
        
        # Plot the image
        plot_color_image(raw_image)
        plt.show()
        print("Using Inception_{} CNN\nPrediction: Probability\n".format(version))
        # Display the image and predictions 
        for i in range(10):
            predicted_class = class_names[prediction_values[i][0]]
            probability = prediction_values[i][1]
            print("{}: {:.2f}%".format(predicted_class, probability*100))
    
    # If the predictions do not come out right
    except:
        print(predictions)

The code is relatively straightforward. We apply the correct argument scope and function depending on the model. (Note that we need to set the number of classes to 1001 and is_training=False.) After constructing the TensorFlow computational graph using the inception_v3 function (or inception_v4), we create a TensorFlow session to feed the image through the network. We use a saver to restore the model weights that we downloaded earlier. The graph returns the logits, which are the unscaled outputs from the network, and the probabilities, which are the result of passing the logits through the softmax activation function. After we get the predictions, we create a list of tuples with the index of the prediction and the associated probability. We then sort these tuples by probability and print out the top-10 class names by probability according to the network.

Prediction Results

Here a few typical results.

predict('bison.jpg', version='V3')

Pretty good predictions from Inception_v3! Let’s see what v4 predicts:

predict('bison.jpg', version='V4')

It’s good to see that the models are in agreement. I’ll give the model a few more easy pictures.

predict('sombrero.jpg', version='V4')

It’s interesting to note not only that the CNN made the right prediction, but also the other potential candidates. The second option here at least makes sense but some of the others seem way off (the probabilities are very small and using two decimal places round to zero).

predict('tiger-shark.jpg', version='V4')

predict('albatross.jpg', version='V4')

These results are pretty impressive. Granted, all of the images I choose were relatively easy to identify and prominently featured the object of interest, both conditions which are not likely to be replicated in the real world. In reality, things do not stay still, and a scene is likely to have hundreds or thousands of different objects that may need to be identified (and to think we do this constantly without ever breaking a sweat). Nevertheless, this network is a decent start to the problem and even with a busier image, we can obtain accurate results:

predict('basketball.jpg', version='V4')

This network is designed and trained to identify one image class in each image, a task to which it is well-suited. However, if we introduce more elements into a single picture, the predictions begin to break down.

predict('basketball_game.jpg', version='V4')

There are some good predictions in the mix, but overall, the model becomes overwhelmed by the noise. Neural networks can also be applied for multilabel classifications, such as the example above, which has objects in many different classes.

the model is also limited in that it has only been trained on 1000 classes. If the image is not in those classes, then we are out of luck. It will give a best guess but we will need to train it ourselves to expand the capabilities.

predict('giraffe.jpg', version='V4')

As far as the network is concerned, it has never seen a giraffe before, so there must be no such thing. Nonetheless, were we to show the network many examples of labeled giraffes, it would soon become adept at identifying them as well.

Next Steps

Currently, we are limited to the 1000 classes learned by the network. If we want to expand the range of pictures, then we will need more data. In particular, we need hundreds of images labeled with the one class if we actually want the network to learn how to identify that object. We can take that on in another post, so for now start gathering or hand labeling images (or have a graduate student do it for you). CNNs are the state of the art, but in the end , they rely on millions of images that have been hand-labeled through thousands of human-hours of work. Luckily, this training data only has to be prepared once, and then it can be re-used. Maybe we will soon reach the point where we can have weaker CNNs train on images that have been labeled by stronger networks without any humans involved (although then we may have to worry about Superintelligent AI as discussed by Nick Bostrom). To train the model on our own data, we will unfreeze at least one layer before the outputs (in order to adjust the model weights to our data) and add a new output layer with the correct number of classes. There are a number of extra steps we can take with this example, such as drawing labeled boxes on the images, or visualizing some of the models using TensorBoard which could provide some insight into the model. This demonstration is not ground-breaking by any means, but it demonstrates the fundamental programming principal of DRY: Don’t Repeat Yourself. When you want to build your own object recognition system, your first words shouldn’t be “ how do I start?” but “who has developed a model that I can improve upon?” In that line of thinking, feel free to use, disseminate, share, and most importantly, improve this code!

Object Recognition with Google’s Convolutional Neural Networks

Written by Will Koehrsen

Responses (2)