An Unsupervised Neural Image Compression Algorithm

Picture of two zips denoting image compression

In this article, we will design a neural network-based unsupervised image compression algorithm that can work on arbitrary image type and size. So sit back, clear your mind and read on!

How to compress images?

There are many ways of achieving image compression. A tried and tested method is to represent an image with fewer colours in order to reduce image size while having the same resolution. Since as humans we do not look at each pixel but rather at the image as a whole, removing a good chunk of colours does not take anything away from the image, at least to the naked eye, while achieving good compression.

Consider a $4 \times 4$ matrix as shown in Fig. 1 (left). Imagine this matrix to represent a single channel $4 \times 4$ image. Each cell contains a pixel value $p1 … p16$. A simple compression technique is to replace multiple pixel values with a single value as shown in Fig. 1 (right). Thus, based on the figure, we are representing the original $16$ different pixel values with only $4$. This allows encoding techniques such as JPEG to take advantage of the reduced colour space thus minimizing the image size.

But how do we achieve this?

 
An image of the basic idea behind the neural compression algorithm

Fig. 1: Imagine an image compression algorithm that takes a $4 \times 4$ single-channel image on the left and replaces the pixel values, $p_i$, with $q_j$ on the right such that $q_j$ is repeated. Thus, instead of representing an image with $16$ values (left), the algorithm represents it with only $4$ (right) which allows JPEG to compress the image by a large margin.

An Initial Try: Average “Pooling”

Based on Fig. 1, can you think of a simple compression technique? How about replacing each $2 \times 2$ pixels in an image with the average of those pixels? For example, from Fig. 1 $q1 = \frac{p1 + p2 + p5 + p6}{4}$. $q1$ is then repeated $4$ times. Since the average colour is a good representation of the colours we are trying to replace, this should be a good idea, right? Well, let’s try it out.

Image of a small dragon

Fig. 2 (a): Original Image with a file size of 88.2KB

2 times average compression of small dragon image

Fig. 2 (b): 2x compressed image with a file size of 26.8KB

4 times average compression of small dragon image

Fig. 2 (c): 4x compressed image with a file size of 20.6KB

Fig. 2 (a) is the original image while Fig. 2 (b) and (c) are the $2\times$ and $4\times$ compressed versions, respectively, using the averaging technique described above. $2\times$ compression is achieved by replacing $2 \times 2$ pixels with their average. Similarly, $4\times$ compression is achieved by replacing $4 \times 4$ pixels with their average. As one can see, the images are getting heavily pixelated.

Can we do better? Well, let’s try to understand what’s wrong with the current approach.

We are picking adjacent pixels and then replacing them with their average. Now, look at Fig. 3 specifically the red box in the image. When considering the adjacent pixels, you will reach a point where the background and foreground meet (spiderman’s leg and the shadow at the back). At this point, the colours will most likely be extremely different. Replacing pixels having different colours with their average is going to make the compression quite noticeable. Similarly, the green boxes denote regions with similar pixel values at different locations in the image. Hence, instead of adjacent pixels, we should choose pixels that are similar to each other in terms of colour value.

 
An image of a spiderman toy hanging and annotated

Fig. 3: The green boxes denote similar pixels but at different locations in the image while the pixels within the red box are quite dissimilar from each other even though they are at the same location (Photo by Jean-Philippe Delberghe on Unsplash)

 

Is replacing with average really the best method? Again, consider a $2 \times 2$ grid in an image. Averaging the colours means giving each pixel in that grid the same weight of $0.25$. But what if out of those $4$ pixels, three are mostly one colour and one of them is slightly different? Don’t you think that the $3$ pixels with similar colours should be given a higher weight? Thus, we need a method that can provide us with this capability.

The main ingredients for a good compression

There are mainly two parts that we need to take into account to achieve significant reduction at little to no loss in quality:

  • Find pixels that are most similar in terms of their RGB values and replace them with a single colour. For example, if there are very slight variations/shades of the colour red then replacing all those pixels with the same single red colour will lead to little to no visible loss in image quality.

  • While replacing a bunch of colours with a single colour, make sure that this colour is the best possible approximation. For example, replacing different shades of red with yellow will not help your cause.

Finding similar pixels

We need to group similar pixels together. This has to be done in an unsupervised fashion. If you thought about clustering then you are correct! We can apply k-means to find the pixels that are similar. The upside of using k-means is that the number of clusters can become our compression hyper-parameter. More clusters would mean representing the image with more colours (less compression) while lesser clusters would mean lesser colours hence higher compression.

Image of a man in a dungeon

Fig. 4 (a): The image chosen to be clustered (Photo by Linus Sandvide on Unsplash)

Semantic image of a man in a dungeon

Fig. 4 (b) The semantic (clustered) version of Fig. 4 (a) with K=6 in kmeans. Here each colour denotes a separate cluster

 

Fig. 4 (a) is the image we use to generate our semantic/cluster map while Fig. 4 (b) is its clustered version. In order to generate such a map, we need to cluster at pixel level since we want to find similar pixels. Let us assume that we have an image of size $w \times h \times 3$ where, $w$ is width, $h$ is height and $3$ is the number of channels (RGB). In order to cluster such an image, we need to reshape it into a matrix of size $wh \times 3$ where $wh$ is the number of rows (width multiplied by height) and $3$ is the number of columns. Now k-means clustering can be run on this matrix and each pixel/row/sample will be given a cluster label. If you provide a different colour for each cluster then you can generate a visualization similar to Fig. 4 (b). Code 1 provides the necessary code for this purpose.

Code 1: A short script that can be used to visualize the semantic map (clustered version) of any image

Replacing similar pixels with a single pixel value

The previous k-means step already provides us with the average pixel value (centroid) of a cluster. Replacing all pixels in a cluster with their average is a good starting point but, as discussed earlier, not the end of the road. One direction to move ahead, to achieve better compression, can be to find out the exact colour value that represents each cluster. Enter Multi-Layer Perceptron (MLP).

The compression algorithm

Fig. 5: The entire compression algorithm shown for a single channel image. For multiple channels, simply stack this image three times.

 

As shown in Fig. 5, let us consider a single channel image of size $4 \times 4$ represented by $16$ pixel values. The idea is to predict $K$ colour values representing each of the $K$ clusters determined in the previous step. Hence, the image is first converted into a single vector of dimension $16$ and passed through an MLP which predicts the $K$ cluster values, in this case, $2$. The values are then used to replace the cluster labels generated during k-means step as shown in Fig. 5. The replacement is denoted as $0:q1$ and $1:q2$ meaning all pixels labelled as cluster $0$ receive the pixel value $q1$ while those labelled with cluster $1$ receive the pixel value $q2$. The loss function is simply the absolute difference between the predicted image and the original image a.k.a. L1 loss.

Please Note: For a three channel image, simply stack Fig. 5 three times.

Code 2: The complete code to reproduce our compression results

 

Results

This section is all about eye-candy. So enjoy! Also, let us know in the comments whether you found any differences between original and compressed version 😉

 
Image of a large dragon standing on a mountain

Fig. 6 (a): Original file size is 362KB

Compressed image of a large dragon standing on a mountain

Fig. 6 (b): Neural Compressed version file size is 77.4KB

 

Fig. 7 (a): Original file size is 17.2KB

Fig. 7 (b): Neural compressed file size is 8.77KB

gif of the compressed superman image

Fig. 7 (c): Gif of the network compression in action

 
Image of fireworks in the night sky

Fig. 8 (a): Original file size is 1.15MB (Photo by Andreas Dress on Unsplash)

Compressed image of fireworks at night

Fig. 8 (b): Neural compressed file size is 215KB

 
Image of a small dragon on a peak under a blue sky

Fig. 9 (a): Original file size is 88.2KB

Compressed image of a dragon on a peak under a blue sky

Fig. 9 (b): Neural compressed version size is 13.2KB

Gif of dragon standing on peak under a blue sky

Fig. 9 (c): Gif of the network compression in action

 
An extract from a book

Fig. 10 (a): Original file size is 405KB (Photo by Annie Spratt on Unsplash)

Compressed image of an extract from a book

Fig. 10 (b) Neural compressed file size is 132KB

Conclusion

We just built a neural network based compression algorithm which is completely unsupervised and can work on ANY type of image and size. Obviously, the larger an image, the more time it takes but if you have the hardware, you can, theoretically, compress very large arbitrary images. Consequently, the downside is the number of computations required.

As you can understand, the neural compression system is limited by the k-means step since clusters are built based on centroids. Hence the best colour approximation is either the mean pixel or something close to the mean. If you can replace this with a better pixel similarity technique you can achieve superior results. The current implementation is also not efficient as the entire data is loaded on the gpu/cpu before processing. This is why, practically speaking, a data loader (pytorch) is required.

There are many techniques out there for neural compression. The aim of this article was to get you excited about the domain and generate interest. So armed with this knowledge go forth and innovate!

 
Previous
Previous

How to correctly use TF-IDF with imbalanced data

Next
Next

How to invent your own new ML algorithm: A novel ensembling technique