Paper Reading — Object Detection in 20 Years: A Survey (Part 1)

Original Source Here

Deformable Part-based Model (DPM) (2008)

Original paper: Object Detection with Discriminatively Trained Part Based Models

DPM, as the winners of VOC-07, -08, and -09 detection challenges, was the peak of the traditional object detection methods. The DPM follows the detection philosophy of “divide and conquer”, where the training can be simply considered as the learning of a proper way of decomposing an object, and the inference can be considered as an ensemble of detections on different object parts. For example, the problem of detecting a “car” can be considered as the detection of its window, body, and wheels.

Figure: Objects are a collection of deformable parts.

Feature map and Filter

A feature map is an array whose entries are d-dimensional feature vectors computed from a dense grid of locations in an image. Intuitively each feature vector describes a local image patch. In practice we use a variation of the HOG features.

A filter is a rectangular template defined by an array of d-dimensional weight vectors. The response, or score, of a filter F at a position (x, y) in a feature map G is the “dot product” of the filter and a subwindow of the feature map with top-left corner at (x, y)

HOG feature pyramid

We would like to define a score at different positions and scales in an image. This is done using a feature pyramid, which specifies a feature map for a finite number of scales in a fixed range. In practice we compute feature pyramids by computing a standard image pyramid via repeated smoothing and subsampling, and then computing a feature map from each level of the image pyramid.

The scale sampling in a feature pyramid is determined by a parameter λ, which is the number of levels we need to go down in the pyramid to get to a feature map computed at twice the resolution of another one. In practice we have used λ = 5 in training and λ = 10 at test time. Fine sampling of scale space is important for obtaining high performance with our models.

Let F be a w × h filter. Let H be a feature pyramid and p = (x, y, l) specify a position (x, y) in the l-th level of the pyramid. Let φ(H, p, w, h) denote the vector obtained by concatenating the feature vectors in the w×h subwindow of H with top-left corner at p in row-major order. The score of F at p is F’ ·φ(H, p, w, h), where F’ is the vector obtained by concatenating the weight vectors in F in row-major order. Below we write F’ ·φ(H, p) since the subwindow dimensions are implicitly defined by the dimensions of the filter F.

Deformable Part Models

Our star models are defined by a coarse root filter that approximately covers an entire object (it defines a detection window) and higher resolution part filters (λ levels down in the pyramid) that cover smaller parts of the object.

A model for an object with n parts is formally defined by a (n + 2)-tuple (F0, P1, . . . , Pn, b) where F0 is a root filter, Pi is a model for the i-th part and b is a real-valued bias term. Each part model is defined by a 3-tuple (Fi, vi, di) where Fi is a filter for the i-th part, vi is a two-dimensional vector specifying an “anchor” position for part i relative to the root position, and di is a four- dimensional vector specifying coefficients of a quadratic function defining a deformation cost for each possible placement of the part relative to the anchor position.

An object hypothesis specifies the location of each filter in the model in a feature pyramid, z = (p0, . . . , pn), where pi = (xi, yi, li) specifies the level and position of the i-th filter. We require that the level of each part is such that the feature map at that level was computed at twice the resolution of the root level, li = l0−λ for i > 0.

The score of a hypothesis is given by the scores of each filter at their respective locations (the data term) minus a deformation cost that depends on the relative position of each part with respect to the root (the spatial prior), plus the bias,

If di = (0, 0, 1, 1) the deformation cost for
the i-th part is the squared distance between its actual position and its anchor position relative to the root. In general the deformation cost is an arbitrary separable quadratic function of the displacements. The bias term is introduced in the score to make the scores of multiple models comparable when we combine them into a mixture model.

The score of a hypothesis z can be expressed in terms of a dot product, β · ψ(H, z), between a vector of model parameters β and a vector ψ(H, z),


To detect objects in an image we compute an overall score for each root location according to the best possible placement of the parts,

High-scoring root locations define detections while the locations of the parts that yield a high-scoring root location define a full object hypothesis.

Fig 4. The matching process at one scale.

Mixture Models

A mixture model with m components is defined by a m-tuple, M = (M1, . . . ,Mm), where Mc is the model for the c-th component. An object hypothesis for a mixture model specifies a mixture component, 1 ≤ c ≤ m, and a location for each filter of Mc, z = (c, p0, . . . , pnc). Here nc is the number of parts in Mc. The score of this hypothesis is the score of the hypothesis z’ = (p0, . . . , pnc) for the c-th model component.

As in the case of a single component model the score of a hypothesis for a mixture model can be expressed by a dot product between a vector of model parameters β and a vector ψ(H, z). For a mixture model the vector β is the concatenation of the model parameter vectors for each component. The vector ψ(H, z) is sparse, with non-zero entries defined by ψ(H, z’) in a single interval matching the interval of βc in β,

To detect objects using a mixture model we use the matching algorithm described above to find root locations that yield high scoring hypotheses independently for each component.

Latent SVM

Since the training dataset only has the labels for the objects but not their parts, to train models using partially labeled data we use a latent variable formulation of MI-SVM that we call latent SVM (LSVM). In a latent SVM each example x is scored by a function of the following form,

Here β is a vector of model parameters, z are latent values, and Φ(x, z) is a feature vector. In the case of one of our star models β is the concatenation of the root filter, the part filters, and deformation cost weights, z is a specification of the object configuration, and Φ(x, z) is a concatenation of subwindows from a feature pyramid and part deformation features.

Hard negative mining

Hard-negative example

In the case of object detection the training problem is highly unbalanced because there is vastly more background than objects. This motivates a process of searching through the background data to find a relatively small number of potential false positives, or hard negative examples. Here we analyze data-mining algorithms for SVM and LSVM training. We prove that data-mining methods can be made to converge to the optimal model defined in terms of the entire training set.

PCA and Analytic Dimensionality Reduction

Our object models are defined by filters that score subwindows of a feature pyramid. We have investigated feature sets similar to the HOG features and found lower dimensional features which perform as well as the original ones. By doing principal component analysis on HOG features the dimensionality of the feature vector can be significantly reduced with no noticeable loss of information. Moreover, by examining the principal eigenvectors we discover structure that leads to “analytic” versions of low-dimensional features which are easily interpretable and can be computed efficiently.

Lots of details are not covered in this post, please refer to the original paper.

Post-processing: Bounding Box Prediction

In the previous work, the bounding boxes derived from root filter locations are directly returned as the detection results. However, the model actually also localizes each part filter in addition to the root filter. Furthermore, part filters are localized with greater spatial precision than root filters. Thus it makes sense to use the complete configuration of an object hypothesis, z, to predict a bounding box for the object.

This is implemented using functions that map a feature vector g(z), to the upper-left, (x1, y1), and lower-right, (x2, y2), corners of the bounding box. For a model with n parts, g(z) is a 2n + 3 dimensional vector containing the width of the root filter in image pixels (this provides scale information) and the location of the upper-left corner of each filter in the image. This is done via linear least-squares regression.

Post-processing: Non-Maximum Suppression

Using the matching procedure described in the previous session we usually get multiple overlapping detections for each instance of an object. We use a greedy procedure for eliminating repeated detections via non-maximum suppression.

We have a set of detections D for a particular object category in an image. Each detection is defined by a bounding box and a score. We sort the detections in D by score, and greedily select the highest scoring ones while skipping detections with bounding boxes that are at least 50% covered by a bounding box of a previously selected detection.

Post-processing: Contextual Information

A simple procedure to rescore detections using contextual information. Let (D1, . . . ,Dk) be a set of detections obtained using k different models (for different object categories) in an image I. Each detection (B, s) ∈ Di is defined by a bounding box B = (x1, y1, x2, y2) and a score s. We define the context of I in terms of a k-dimensional vector c(I) = (σ(s1), . . . , σ(sk)) where si is the score of the highest scoring detection in Di, and σ(x) = 1/(1+exp(−2x)) is a logistic function for renormalizing the scores. To rescore a detection (B, s) in an image I we build a 25-dimensional feature vector with the original score of the detection, the top-left and bottom-right bounding box coordinates, and the image context,

The coordinates x1, y1, x2, y2 ∈ [0, 1] are normalized by the width and height of the image. We use a category-specific classifier to score this vector to obtain a new score for the detection. The classifier is trained to distinguish correct detections from false positives by integrating contextual information defined by g.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: