Now that we know how to create simple neural networks and how to optimize them, we can transition to more complex variants of neural networks. This week, I'll be covering one that's used very often in image processing and computer vision: convolutional neural networks (CNN).
Why CNNs?
In my first article, I gave a walkthrough of how to leverage scikit-learn
to recognize digits by training a simple
-nearest neighbors model. You might think that you can apply the same methodology to recognizing certain objects in images.
Here's why that doesn't work. First off, all the images of digits shared the same black color and white background. In reality, digits are written down on paper or are shown on various displays with differing backgrounds and colors. Our -nearest neighbors model would be overwhelmed at the staggering amount of colors and differences in the image, implode on itself, and fail to make accurate predictions.
A CNN, on the other hand, is specifically designed to handle different aspects in an image, including colors and backgrounds across multiple environments. There are three concepts that CNNs rely on: convolution, pooling, and the fully-connected layer in that order.
What is Convolution?
In the context of image processing, the convolution operation slides a small matrix, known as kernels or equivalently filters, across the input data of the pixels in the image and calculates their dot product. The dimensions of a kernel and its values depend on the operation being performed. If, for example, you were trying to blur an image, the values would be somewhat small around the edges of the matrix, but larger towards the center to more closely (but not exactly) preserve the color of the pixel that lies in the center kernel. Different operations lead to different dot products, or in other words, different aspects of an image, and a CNN leverages these kernels to deduce the characteristics of an image.
Sometimes, for images with lots of pixels, rather than convoluting at each individual pixel, the process can be sped up by only processing rows at intervals (such as every other row or every third row). In the first case, you would be performing convolution with a stride length of 2 and in the second case, you would be doing it with a stride length of 3. Using a stride length greater than 1 causes the dot product to be less representative of an image as a result of skipping out on pixels in the image. At the same time, however, the chances of overfitting on an image would be greatly reduced. Again, a balance needs to be determined.
The above animation is a bit of a simplification, as pixels consist of RGB values rather than a single scalar, but the concept is still there. As the yellow rectangle moves across the matrix, that submatrix is multiplied by the kernel and then summed to a number that becomes the corresponding element of the convoluted image.
What is Pooling?
You can imagine that the more pixels an image has, the more computationally intensive convolution and interpreting the results gets for the model. Pooling takes the convoluted image and compresses it in a manner similar to the convolution process. Slide a small matrix over the values and then calculate some sort of metric on them that outputs a single value for that element of the smaller matrix. The dimensions of the matrix in question once again depends on the developer's desires. If they are more focused on smoothing out noise, then the maximum of those values should be used. Otherwise, they can use the average of the values.
The output of the pooling process is another smaller matrix, which can be subsequently fed back into multiple other convolution-pooling loop layers.
The Fully-Connected Layer
Once processing in the convolution and pooling layers is complete, the matrix is then flattened into a column vector (like what we've seen with MNIST), and then passed into a simple feed forward neural network with backpropagation. Activation functions such as ReLU and SoftMax can then be used to fit non-linear data and classify the results respectively.
This part of the CNN as a whole is called the fully-connected layer.
That's It For Now!
Take some time to digest what you read. It's complicated! I'll provide a PyTorch demo in next week's article. Thanks for reading and see you in the demo!