Summary

Goal

Set up YOLO neural network for object detection in an input image. The YOLO architecture has shown promise in real-time object detection at speeds fast enough to be applicable to dynamic environments.

Conclusions

YOLO works quite well on iconic images. Even on non-iconic images, the model can detect objects in the background as well as upfront ones. But accuracy needs to improve for cards further back in the image.

The time spent per image is under two seconds on an NVIDIA 1070 GPU. The majority of the time is lost in the lack of optimization in the Pytorch implementation of YOLO, which would improve with a preloaded model running on a server with gunicorn or django accepting image requests. The GPU does not seem to be fully utilized either, running at ~11% VRAM usage during detection.

YOLO proves itself to be a powerful starting point to develop a model that can detect several objects of different pose in an image. See the references section for a more detailed overview of YOLO's effectiveness at different image type detections.

Image Types

Not all images are the same. There are differences such as time of day, number of objects per image, how many different types of objects are in an image, and even the image pixel density. Classical image detection would also apply processing techniques such as conversion to grayscale to augment image detection.

There are two main distinctions that have been most prevalent in recent training sets such as VOC and COCO.

Number of classes across the dataset
How iconic the objects in images are captured

Iconic

These images have the object front and center. Some examples of these include product shots found on Amazon, or portraits dedicated to a person or thing. These are also the easiest to annotate, since there is usually just one object.

It is also these types of images that YOLO has the easiest time accurately detecting. There was one case where an iconic image failed to classify toy cars accurately, mistaking them for cell phones. This is most likely because of the training set used for YOLO to teach it what car features are significant.

Non-iconic

These have more objects in a frame. "More" has been loosely applied as having greater than three objects in an image, and these objects often are different classes from each other.

YOLO has some success detecting objects of this type, even with objects in the background, instead of front and center. However, with the use case of a parking lot, YOLO has had a hard time detecting cars further back in the recesses.

Deliverable 4 - YOLO Detection