YOLO algorithm
What is YOLO and why is it so revolutionary?
YOLO (you only look once) is a real-time object detection algorithm originally proposed in Redmon et al. 2016. One of the ways the YOLO algorithm revolutionized the field of object-detection is that it was the first algorithm that implemented a 1-look approach to object detection. In the case of multi class object detection, instead of running an algorithm once for each class, YOLO performs a single pass over the image and generates all of the bounding boxes for detected objects. The ability to produce predictions after a single pass over the image allowed YOLO to be much faster than other state-of-the-art algorithms in 2016. While the original YOLO algorithm performed slightly below other object detection algorithms of the time, the drastic improvement in inference time made it suitable for real-time applications. Below are a timeline showing how YOLO gave rise to the class of one-look algorithms (left) and a plot showing YOLO's performance and inference time plotted against other state-of-the-art algorithms at the time (right).
||
How Does YOLO Work?
The YOLO algorithm begins by taking in an image as input. It then divides the image in to squares of size S by S pixels. The algorithm's job is to predict which of these cells contains the center of an object, and if it does, a probability distribution over the C classes it is trying to predict. In order to make these predictions, the original YOLO v1 model used a convolutional neural network made up of 24 convolutional layers and max suppression layers followed by fully connected layers. Each of these cells makes B predictions for where an object is in the cell. Each of these predictions is called a bounding box. For each of the cells, the algorithm outputs B prediction vectors. Each of these vectors is made up of 6 parts and appears as such: [p, x, y, h, w, c] where p is the probability of an object being in the box, x and y are the center of the object, h and w are the height and width of the object, and c is a one-hot encoded vector representing the C classes. Everything in this section so far makes up the YOLO algorithm. It took as input an image, and returned B vectors for each of the cells it divided the picture into at the beginning. At this point, we have B predicted objects, but it turns out that this is usually far too many and that many of these boxes are not actually of objects or multiple end up being of the same object. In order to remove the boxes that are not likely to be of objects, a two stage approach is taken. Firstly, any box that has a low certainty of an object being in it is removed. This removes boxes that were not likely to have had objects. Next max suppression is used to remove duplicate boxes that are detecting the same image. In max suppression, boxes with the same predicted class are removed if there is one that is a better fit to the ground truth label. This measure of being a better fit is known as IoU (Intersection over Union) where a box's score is the area of intersection between the predicted box and the true box divided by the union of their areas. After this two step filter, we are left with only images that have a relatively high confidence and images that are of distinct objects.
Here are some images that point out some of the key concepts about how the YOLO algorithm detects objects. Left: Each image is divided into cells and a single prediction is made per cell after taking the best bounding box per cell. Right: Shows how once the best boxes are shown, that there are far too many and how after the 2 stage filtering, only the high confidence distinct boxes remain.
Sample Inferences
The most typical set that YOLO is trained on is the Common Objects in Context (COCO) dataset. The COCO dataset originally had 80 classes of objects and has been adding more as they expand the dataset. The YOLO algorithm takes in an image and returns bounding boxes around each of the instances detected. Here are some samples.