Developed with the primary aim of face detection, is the first object detection framework to provide competitive object detection rates in real-time.
The algorithm has three phases:
1. Image representation as, what they call, integral image, which is a feature selection based on Haar Features.
2. Selection of relevant features using AdaBoost training
3. Reducing the search space eliminating sub-windows by successively applying more complex classifiers in a cascade structure.
The third phase in particular is known as a cascade classifier .
While the accuracy is not comparable with actual deep learning models, due to the fact that it was intended for use on low power CPUs (phones, cameras), it is lightweight and fast.
Traditional and similar to Viola-Jones. Uses Histogram of Oriented Gradients (HOG) features and Support Vector Machine (SVM) for classification. It still requires a multi-scale sliding window, and even though it’s superior to Viola-Jones, it’s much slower.show details↓
Replaces the exaustive search of the previous models with a selective search. This is accomplished using segmentation to generate a limited set of locations on which bag of words features are calculated.
Instead of searching a small number (tens) of accurate locations (usually selected with some sort of contour analysis), a large number (thousands) of approximate locations are generated at all scales. Initially the image is oversegmented and the various segments are progressively grouped together. This makes possible to account for all scales. Also different grouping strategies can be used to account for different type of features (eg: color-based, texture-based, ecc.). Finally, because it reduces the object locations to consider for the actual recognition it allows a more computing intensive classifier.
Single ConvNet for detection, recognition and localization. It uses multi-scale sliding windows to produce a distribution over categories for each window. In addition it produces a prediction of position and size of the bounding box relative to the window. In contrast to selective search proposals are accumated with subsequent passes.show details↓
Object detection system based on three modules:
1. Generation of category-indipendent region proposals. Various methods can be used (eg: Selective Search).
2. Feature extraction for every region using a CNN.
3. Classification with SVMs.
The model used pre training for region proposals and features. To slim the model a number of subsequent implementations have been developed, such as fast and faster rcnn.
R-CNN approach quickly evolved into a purer deep learning one. Similar to R-CNN, it used Selective Search to generate object proposals, but instead of extracting all of them independently and using SVM classifiers, it applied the CNN on the complete image and then used both Region of Interest (RoI) Pooling on the feature map with a final feed forward network for classification and regression. Not only was this approach faster, but having the RoI Pooling layer and the fully connected layers allowed the model to be end-to-end differentiable and easier to train. The biggest downside was that the model still relied on Selective Search (or any other region proposal algorithm), which became the bottleneck when using it for inference.show details↓
Shortly after that, You Only Look Once: Unified, Real-Time Object Detection (YOLO) paper published by Joseph Redmon (with Girshick appearing as one of the co-authors). YOLO proposed a simple convolutional neural network approach which has both great results and high speed, allowing for the first time real time object detection.show details↓
Faster R-CNN, the third iteration of the R-CNN series. Faster R-CNN added what they called a Region Proposal Network (RPN), in an attempt to get rid of the Selective Search algorithm and make the model completely trainable end-to-end. RPNs has the task to output objects based on an “objectness” score. These objects are used by the RoI Pooling and fully connected layers for classification.show details↓
Single Shot Detector (SSD) takes on YOLO by using multiple sized convolutional feature maps achieving better results and speed.show details↓
Region-based Fully Convolutional Networks (R-FCN) takes the architecture of Faster R-CNN but with only convolutional networks.show details↓