Humans unlike computers make sense of what we see based on our experiences, memories and biological structure. Human brain extracts and analyses humongous volumes of data using visual cues. To understand the true effect of sight, 40% of our nerve fibers are linked to the retina, and 90% of information transmitted to the brain is visual. In fact, research at 3M Corporation concluded that we process visuals 60,000 times faster than text.
To begin with, image classification is a fundamental task that assesses an entire image. The goal is to classify the image by assigning it to a specific label. Typically, image classification refers to images in which only one object appears and a computer analyses. Besides, object detection involves both classification and localization tasks, and analyses more realistic cases in which multiple objects may exist in an image. In general, there are two methods of classification: supervised and unsupervised.
Furthermore, the more advanced task of separating pixels in an image to a particular object or class requires computer vision techniques and methods. Data scientists and computer vision specialists refer to such task as a semantic segmentation or a dense prediction. Semantic segmentation is particularly a popular term in autonomous vehicles such as cars, drones and planes, in addition to medical image diagnosis. An example of Deep Learning based Semantic Segmentation | Keras on Kaggle.
In this post we will discuss Convolutional Neural Networks, data augmentation, EfficientNet and how to achieve nearly 100% accuracy on a classification of several classes of images potentially across multiple datasets.
Convolutional Neural Networks (CNNs) is the most popular neural network for image classification. Comparing to a fully connected neural network, fewer parameters in CNN greatly improves the training time as well as reduces the amount of sufficient data. Instead of a fully connected network of weights from each pixel, a CNN can process a small patch of the image for a prediction.
Consider an image, a CNN can efficiently scan it chunk by chunk — for instance, a 5 × 5 window. The 5 × 5 window slides along the image (usually left to right, and top to bottom). How quickly it slides is called its stride length. For example, a stride length of 2 means the 5 × 5 sliding window moves by 2 pixels at a time until it covers the entire image.
A convolution is a weighted sum of the pixel values of the image, as the window slides across the whole image. Turns out, this convolution process throughout an image with a weight matrix produces another image.
The sliding-window operations occur in the convolution layer of the neural network. In general, a CNN has multiple convolution layers. Each convolutional layer typically generates many alternate convolutions, so the weight matrix is a tensor of 5 × 5 × n, where n is the number of convolutions.
As an example, let’s say an image goes through a convolution layer on a weight matrix of 5 × 5 × 64. It generates 64 convolutions by sliding a 5 × 5 window. Therefore, this model has 5 × 5 × 64 = 1,600 parameters, which is remarkably fewer parameters than a fully connected network: 256 × 256 = 65,536.
More often than not image classification datasets are significant in size. Nevertheless, we use data augmentation in order to further generalize the model. Data augmentation takes the approach of generating additional training data from your existing examples by augmenting them using random transformations that yield similar-looking images. This exposes the model to more aspects of the data and prevents over-fitting.
According to documentation of Tensorflow, we can combine the data augmentation, rescalling and the CNN model in this manner:
In general, we can also divide the process of image augmentation into four steps regardless of the neural network architecture:
For reading images from disk and resizing we can use OpenCV. Meanwhile for data augmentation Albumentations is a fast and flexible library compatible with different neural networks.
An example of data augmentation in Albumentations: horizontal flips with probability of 50%, rotation by a random angle in range from 0 to 15 degrees with 50% probability; either sharpens the input image with 50%/3=16.67%, or embosses the image with 16.67%, or randomly changes brightness and contrast 16.67%; cut outs 5 holes in 50% of instances.
The following class reads images, resizes, augments the images and passes the batch size to our neural network.
EfficientNets are state of the art convolutional neural networks that Google Brain released open source. A family of image classification models achieve state-of-the-art accuracy, yet are an order-of-magnitude smaller and faster than previous models such as ResNet-152 and ResNet-50.
Finally, we combine the augmentation pipeline, the data generator and our model to estimate the class of the image. On the fourth epoch the model reaches 100%. On the withheld dataset the model generated comparable 99.6%.
Jeremy Jordan. 2018. An overview of semantic image segmentation.. [online] Available at: https://www.jeremyjordan.me/semantic-segmentation/ [Accessed 1 September 2021].
Le, J., 2018. The 4 Convolutional Neural Network Models That Can Classify Your Fashion Images. [online] Medium. Available at: https://towardsdatascience.com/the-4-convolutional-neural-network-models-that-can-classify-your-fashion-images-9fe7f3e5399d [Accessed 1 September 2021].
Tan, M. and Le, Q., 2020. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv.org