Tech

Deep Learning Image Classification [Parking Case Study]

by Marcin Budny, Head of R&D & Mateusz Walo, Software Developer

Computers can nowadays perform tasks that just a decade ago could only be handled by humans. One such task is accurate image classification. We previously had algorithms that served this purpose, but deep learning took image recognition to a whole new level.

Deep learning image classification has an advantage over traditional computer vision techniques because it does not require manual feature engineering. When recognizing images of cats, you don’t have to explicitly tell the model to look for cat ears or whiskers in the image. You let the model figure out the important visual features on its own. It is possible to get better results faster, with less human work involved – if enough data is available for the training. That is why this approach gained so much popularity in recent years.

In this article, we will present an interesting use case of applying this machine learning algorithm to automate tasks in parking management software.

Read on to learn how deep learning helped us accurately recognize the direction of vehicle movement from still images.

The challenge: detecting the direction of vehicle movement

We were faced with an interesting problem with the parking solution we’ve been developing for one of our Norwegian partners. In this solution, ALPR cameras are used to take pictures and to read the license plate numbers of vehicles coming in and out of parking facilities. In some deployments, we were also relying on the camera to give us the direction in which a vehicle was moving (towards the camera or away from it).

And that’s where the trouble starts. The cameras have limited capabilities and in some cases, they are not able to determine the direction reliably. This is happening often enough to cause us some serious headaches and manual work.

But what if we had a component in our system that looks at the problematic pictures and automatically determines if the front or rear of a vehicle is visible? That would solve 99% of the problematic cases. The last 1% would be vehicles backing off from the camera, which doesn’t happen that often.

Note: since we only have pictures and not a video stream, we can’t really tell the actual direction of movement.

The approach: deep learning image classification

Our problem seemed to be a fairly straightforward image classification task. We were pretty sure that we could get good results with a modern deep learning model and transfer learning. There was one caveat though: the picture quality.

image classification 1
image classification 2
image classification 3

The ALPR camera highlights the part of the image it is most interested in: the license plate. It uses infrared flash for this purpose because license plates have a special reflective surface. This is also the reason for the monochrome picture. Apart from the LPN, the only parts of the vehicle visible to the naked eye are elements emitting light and other reflective surfaces.

So in the first image, you can see the front of a car, judging by the headlights. The second image may also show the front of the car, but it is not obvious. The last one is most probably rear if the reflective elements on the bumper are considered.

These images are a bit challenging to label, right? Fortunately, it turns out the important visual features of the vehicle – such as the shape of lamps, lines of the car body – are actually in the picture, we just can’t see them right away.

Let’s try the last picture again, this time with brightness enhanced 10x and contrast 2x.

image classification 4

Now we can definitely confirm that this is a picture of the rear of a car. As it later turned out, these brightness and contrast enhancements are only needed for humans to label the data. The deep learning model can do just fine on the raw images.

Knowing that important visual features of vehicles are present in the pictures, we could plan for testing different deep learning models:

We experimented with random image augmentations (horizontal flip, rotation, skew), image size fed to the network and training hyperparameters.

The process: training the image classification model

We needed a dataset large enough to fine-tune an image classification model. Also, we knew that having just the front and rear class won’t be enough, because a kind of “negative” class is also needed. That one would represent situations where:

  • there is no vehicle in the picture (ALPR camera was triggered incorrectly)
  • the vehicle is in the picture, but it is impossible to tell whether front or rear is visible

And this is how we introduced a class called “unknown”.

unknown class

We iterated several times with the dataset, correcting labels and extending its size. It turned out that pictures coming from different facilities have different characteristics and we need to:

  • balance the number of pictures coming from different facilities
  • balance the number of pictures in each of the classes

Here’s an overview of the dataset size in subsequent versions: On each dataset, 80% / 20% train/validation split was applied.

dataset size overview

The Tools: PyTorch, Azure and VS Code to the rescue

Framework

pytorch logo
PyTorch, the PyTorch logo and any related marks are trademarks of Facebook, Inc.

pyTorch advantages were numerous. First of all, the framework provides a very good developer experience. It offers a lot of pre-trained models and makes using them really easy. It provides a set of configurable image transformations so it is easy to augment the dataset with randomly modified images. Integration with CUDA and running models on multiple GPUs is also straightforward.

Having worked a little with Tensorflow 1.x previously, we’ve seen a tremendous difference in the ease of use, although that changes with Tensorflow 2.0.

One issue we had with pyTorch was the need to translate from torch representation to numpy representation and from PIL images to numpy arrays. A lot of existing libraries and code samples assume usage of numpy.

Computation

Microsoft Azure logo
Source: https://azure.microsoft.com/

For training, we used Azure NC12 instance that comes with 2x half of Tesla K80 GPU. This is essentially a single GPU but visible as two GPUs in the OS.

At the time we worked on the project, Microsoft offered promotional prices for their Tesla K80 machines and we were able to use NC12 instance for as little as €0.87/hour (and this includes 12 cores and 112GiB of RAM).

Other tools

Visual Studio Code logo

Of course, Jupyter Lab was very useful, but another solution for remote work on the notebooks blew our minds: VS Code with remote SSH access. The Python extension for VS Code supports notebooks and you can use the external computation power of a cloud instance from the comfort of your development machine.

The results

In order to know how well the model performed, we kept track of several metrics.

best accuracy/precission/recall on validation set
  • 95% Accuracy
  • Accuracy tells us how often the model prediction is correct. Most of the time the results were very satisfying – we were able to reach 95% accuracy rate.

    It is important to take note of the drop in accuracy rate for dataset 2. This dataset was created as a sort of side experiment. It contained around 200 images where the camera was not able to detect the direction of the car movement.

    An important factor was that these images came from multiple different parking facilities, while the main dataset 1.9 was dominated by images from one or two large facilities. Poor accuracy on dataset 2 prompted us to work on a better balance of image sources in later versions of the main dataset.

  • 92-94% Precision
  • Precision is a ratio of true positives in regards to all positives predicted. In other words, it tells us how many predictions for a given class were correct. We achieved a high precision rate of around 92 to 94 percent.

    It is important to note that precision is calculated for each class separately. The chart shows the average result of all classes.

  • 92-95% Recall
  • Recall informs us what the rate of correct findings is for a given class. Here we also got very high results: from 92 up to 95 percent.

Understanding the model

One of the challenges in deep learning is to understand why the model decided to assign an image into a given class.

Grad-CAM and Guided Grad-CAM are techniques, which allowed us to create a heatmap of the image. Regions that were crucial for the model decision are highlighted, so it is easy to explain on what basis the decision was made. This gives us better confidence in a model’s ability to make correct predictions outside of the training dataset.

As you can see in the following pictures, the model performed predictions based on the most relevant features: the lamps, bumpers and license plates. The last image is especially interesting because we can learn which car the model is looking at.

Also, we know now that the model is not paying attention to the horizontal road signs showing the lane direction.

grand-cam-ok-1
grandcam ok 2
grand cam ok 3

We can also visualize the situations where the model failed to classify correctly.

grand cam error 1
grand cam error 2

The model failed to capture important features of the vehicle which caused invalid predictions.

The deployment concerns

While it is possible to build a simple Python service in order to expose the deep learning model with a REST API, there are concerns that need to be addressed. Things like:

  • parallelizing model usage according to underlying machine capabilities
  • queueing requests
  • packaging and versioning the model

A project we found useful was Amazon’s Multi Model Server. Its main focus are the models produced by AWS Machine Learning services, but it is generic enough to support other use cases as well. In particular, they have a pyTorch example (it is a bit outdated). With a little bit of work, we were able to get it up and running.

Inference time

To find out how fast the model can do its job, we benchmarked a classification of 100 images. Tests were performed on both CPU (12 cores) and GPU (half of Tesla K80).

As shown on the chart, GPU (graphics processing unit)  outperforms CPU (central processing unit) in the image classification task (ResNet34 model). However, inference on CPU is also possible if the cost of the cloud instance becomes a factor for final selection of the processing method.

performance chart

To be continued – future challenges

The main future challenge is to support the model that runs in production.

Pictures may change their characteristics over time with the change of seasons, facility lighting conditions and new models of cars.

The focus is to develop techniques to continuously monitor the quality of the results and improve the model.

If you have questions concerning deep learning image classification, machine learning or if you are interested in similar projects, contact us here.

Ready to create better software?

We are constantly working on new interesting projects.

Maybe yours will be one of them?

Share This