Idea explained: SIFT
SIFT stands for Scale Invariant Feature Transform
Why?
To be able to detect key points even if they are in different scales (scale invariant). For example, a simple corner detector which detects a corner is unable to do so in a zoomed in image using the same sampling window.
How?
By creating a feature vector which includes a locally distinct point (key point) and a descriptor that describes the key point’s surrounding.
How to find key points? — use Difference of Gaussian (DoG) approach
○ Take the image and blur it with a gaussian blur at different magnitudes. This will give a set of images with different levels of blurs; Image slightly blurred, more blurred so on… (Left stack of Figure 2)
○ Then subtract those blurred images from the adjacent images (DoG)
○ Stack those difference images on top of each other and look for extreme points (Right stack of Figure 2 and then Figure 3)
○ This is performed in an image pyramid — images are differently scaled in size. Notice sets of different scaled octaves in Figure 2.
○ This enable to find key points which are invariant to scale changes
How to compute the descriptor vector?
○ Take the local neighborhood around a key point (16x16 block)
○ Compute the gradient for each pixel (orientation and magnitude)
○ Divide the block into smaller sub blocks ( 16, 4x4 blocks)
○ For each block compute gradient direction histogram over 8 directions
○ Concatenate the histograms to obtain a 128 (16*8) dimensional feature vector
Descriptor creation illustrated
Source: Gil’s CV Blog
SIFT features are shown to be invariant to image rotation and scale and robust across a substantial range of affine distortion, addition of noise, and change in illumination.
Note: Writing this while learning, as a self note. If there are any mistakes or comments, please let me know. Thank you!
References
[1] Distinctive Image Features from Scale-Invariant Keypoints
[2] Gil’s CV Blog