Sunday, July 18, 2010

Details behind SIFT Feature detection


First octave blur progressively
second octave: resize to half the image and blur progressively

Getting LoG. Subtract image 1 from image 2. Then subtract image 2 from image 3. then subtract image 3 from image 4
Find the maxima/minima of images from one scale higher and lower.

NOW FOR THE DETAILS:(based on my understanding)
images and details from :
Lowe, David G. “Distinctive Image Features from Scale­ Invariant Keypoints”. International Journal of Computer Vision, 60, 2 (2004)

Scale invariant feature transform is used for detection and extracting local features of an image.
the steps are as follows:

  1. Generate a Difference of Gaussian(DoG) or a laplacian pyramid
  2. Extrema detection from the DoG pyramid which is the local maxima and minima, the point found is an extrema
  3. Eliminate low contrast or poorly localized points, what remains are the keypoints
  4. Assign an orientation to the points based on the image properties
  5. Compute and generate keypoint descriptors
The details:

The main SIFT implementation consists of 4 major stages described by Lowe . The first step is finding a scale space extrema. This is done via image pyramids. The process involves repeatedly smoothing an image by convolving it with a gaussian operator and subsequently sub-sampling in order to achieve higher levels of the pyramid. For this stage, Lowe suggests 4 octaves and 5 levels of blur. An octave corresponds to doubling the value of σ.



After obtaining a full pyramid, the difference of each octave results in an approximate solution to the Laplacian of Gaussian.



The amount of blur per interval and octave is very important in obtaining keypoints (matlab source screenshot)


this is a manual computation i did in excel


The next step is to iterate through each pixel and compare it with its 8 surounding neighbors and 9 neighbors at one scale higher and lower. If the pixel value is higher or lower amongst its neighbors then it is considered as a keypoint.



After determining the keypoint, tylor expansion is used to generate subpixel values from the pixel data.

Doing so, will generate a lot of keypoints. These keypoints are low contrast points or points along an edge. To eliminate these points, tylor expansion is used to get an intensity value for each subpixel. It is then compared against a threshold. If the value is higher or lower than the threshold, the point is accepted or rejected. Following this step a series of stable keypoints will be left that are scale invariant. To make the keypoints rotation invariant, a weighted histogram of local gradient direction and magnitudes around each keypoint is computed at a selected scale and the most prominent orientation in that region is selected and assigned to the keypoint.


The orientation of these keypoints are broken into 36 bins, each representing 10 degrees from the 360 degree. After creating the histogram any peaks above 80 are converted to another keypoint with a different orientation.
To do this, check the highest peek in the histogram and get the left and right neighbors. you will get 3 points, fit a downward parabola and solve the 3 equations to get a new keypoint. do this for all octaves and all iterations.


The next step is to produce unique ID’s for each keypoint that will differentiate them from each other. To accomplish this, a 16x16 window is created between pixels around the keypoint. It is then split into sixteen 4x4 windows and an 8 bin histogram is generated. The histogram and gradient values are interpolated to produce a 128 dimensional feature vector which is invariant to affine illumination. To obtain illumination invariance, the descriptor is normalized by the square root of the sum of squared components

The keypoints are now invariant to affine transformation and partial illumination changes.
Although SIFT has been used in the computer vision community for quite some time there are several studies that try to improve on it.

7 comments:

  1. hello,
    I want to ask you one question.
    Do you know the purpose of downsample in SIFT?
    I mean why we need to downsample the picture to create pyramid Gaussian? Is there any purpose behind this action?

    I think it will make we find more keypoints? Is it right? or it will make calculation faster.
    If yes, which or what calculation need to be faster here?
    I am not sure why this technique need to downsample to create many level of Gaussian.
    Do you have any idea?

    ReplyDelete
  2. please read lindenberg "scale space theory".... that will give you the answer on why you need to downsample....

    ReplyDelete
  3. Hi, I already read as you told, I understand that it is need for scale invariant. Am I right? Or to make calculation faster?
    Thanks.

    ReplyDelete
  4. You are right on both. You do it to get scale invariance and DoG with constant ratio of scales is a close approximation to Lindeberg’s scale-normalized Laplacian.

    Most likely you also target at getting points that give you informations on the affine transformation.

    ReplyDelete
  5. Hi, I am new to SIFT. Currently implementing the code in Matlab. I had generated the keypoint by detecting the extrema of the DoG. How can I generate the subpixel value and applying the Hessian matrix?

    ReplyDelete
  6. I have the same question as Trang, what is the purpose behind the resampling action? Thank you very much.

    ReplyDelete
  7. i need help in locating the extrema on the image.. Ive made DOG images.. ive made a filter to be applied on the 3 images to find extrema.. but i could not understand how to make those yellow point appear as that in your imges shown.. plz help me ASAP i am a bad coder as well..

    ReplyDelete