Sunday, July 18, 2010

Details behind SIFT Feature detection


First octave blur progressively
second octave: resize to half the image and blur progressively

Getting LoG. Subtract image 1 from image 2. Then subtract image 2 from image 3. then subtract image 3 from image 4
Find the maxima/minima of images from one scale higher and lower.

NOW FOR THE DETAILS:(based on my understanding)
images and details from :
Lowe, David G. “Distinctive Image Features from Scale­ Invariant Keypoints”. International Journal of Computer Vision, 60, 2 (2004)

Scale invariant feature transform is used for detection and extracting local features of an image.
the steps are as follows:

  1. Generate a Difference of Gaussian(DoG) or a laplacian pyramid
  2. Extrema detection from the DoG pyramid which is the local maxima and minima, the point found is an extrema
  3. Eliminate low contrast or poorly localized points, what remains are the keypoints
  4. Assign an orientation to the points based on the image properties
  5. Compute and generate keypoint descriptors
The details:

The main SIFT implementation consists of 4 major stages described by Lowe . The first step is finding a scale space extrema. This is done via image pyramids. The process involves repeatedly smoothing an image by convolving it with a gaussian operator and subsequently sub-sampling in order to achieve higher levels of the pyramid. For this stage, Lowe suggests 4 octaves and 5 levels of blur. An octave corresponds to doubling the value of σ.



After obtaining a full pyramid, the difference of each octave results in an approximate solution to the Laplacian of Gaussian.



The amount of blur per interval and octave is very important in obtaining keypoints (matlab source screenshot)


this is a manual computation i did in excel


The next step is to iterate through each pixel and compare it with its 8 surounding neighbors and 9 neighbors at one scale higher and lower. If the pixel value is higher or lower amongst its neighbors then it is considered as a keypoint.



After determining the keypoint, tylor expansion is used to generate subpixel values from the pixel data.

Doing so, will generate a lot of keypoints. These keypoints are low contrast points or points along an edge. To eliminate these points, tylor expansion is used to get an intensity value for each subpixel. It is then compared against a threshold. If the value is higher or lower than the threshold, the point is accepted or rejected. Following this step a series of stable keypoints will be left that are scale invariant. To make the keypoints rotation invariant, a weighted histogram of local gradient direction and magnitudes around each keypoint is computed at a selected scale and the most prominent orientation in that region is selected and assigned to the keypoint.


The orientation of these keypoints are broken into 36 bins, each representing 10 degrees from the 360 degree. After creating the histogram any peaks above 80 are converted to another keypoint with a different orientation.
To do this, check the highest peek in the histogram and get the left and right neighbors. you will get 3 points, fit a downward parabola and solve the 3 equations to get a new keypoint. do this for all octaves and all iterations.


The next step is to produce unique ID’s for each keypoint that will differentiate them from each other. To accomplish this, a 16x16 window is created between pixels around the keypoint. It is then split into sixteen 4x4 windows and an 8 bin histogram is generated. The histogram and gradient values are interpolated to produce a 128 dimensional feature vector which is invariant to affine illumination. To obtain illumination invariance, the descriptor is normalized by the square root of the sum of squared components

The keypoints are now invariant to affine transformation and partial illumination changes.
Although SIFT has been used in the computer vision community for quite some time there are several studies that try to improve on it.

Wednesday, July 7, 2010

Stereo vision and motion vector estimation


INTERESTING:
i used 2 consecutive images to represent a stereo image, since the moving car wont be aligned in an overlap. then the disparity will assume that the car is near to the camera. therefore a object is identified.



Magnitude of the motion vector is very high because the object is closer. object that are far away from the camera have a lower magnitude.




Tuesday, July 6, 2010

Standard deviation Map or Image variance


The implementation calculates the variance of neighboring pixels given a window around a center pixel.

ive used a window size of 3. this means i have a square window of 3x3. the image ive used if the following:



This is the result. ive tried to vary the threshold to get different results. in the bottom image. starting from the top left to right ive used a threshold of 10, 100, 160, 200. The results are very similar to to an adaptive threshold with the exception of it being a negative image. you can see my results on adaptive threshold HERE>>>

Take note that variance is computed for a single image. Just like Background subtraction, it is very sensitive to the threshold and may require trial and error to get good result. however on a cluttered image, it seems very irrational to use this method to identify objects.

I used the same method on a single object on the screen. and i get different results by varying the threshold. Over all it can identify the object very clearly. although setting the right threshold requires alot of trial and error. In the end it has some noise which needs to be removed in the end using morphology.


Sunday, July 4, 2010

Disparity Map




In stereo vision we have 2 camera's at a constant distance from each other. we take a picture of a scene. the image that we get from each camera is slightly different from each other. if we place both of them together we will see that they are a little blurred. some of the objects in the scene will look mis-aligned. between the left and right image. if the alignment of an object in the scene is very BIG, then the object is closer to the camera's if the object is less aligned or clear or lined up better, that means its further away from the camera.













Im using the dataset available here:

DATASET HERE >>>
H. Hirschmüller and D. Scharstein. Evaluation of cost functions for stereo matching.
In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), Minneapolis, MN, June 2007.

this is because i dont have 2 cameras to take pictures.
they also keep a list of algorithms on who has the best implementation: HERE>>



data is the similarity of the right image from the left, the code uses left to right. i made a mistake in correcting the image dataset. i thought the image left side is from the left camera while its from the other infact.

comparing the wrong images will give unwanted results. the left image and the right image should clearly be defined in order for the disparity map to be correct.
another important factor is the disparity range, i used 0-16 with a window size of 9x9. you can see the results are not so clear.

for this disparity map image i used the 9x9 window with a disparity range of 0-16. the similarity measurement which was used is the sum of square difference.

also the higher the disparity range the longer it will take to process. so im guessing each image should have a different disparity range just like different threshold in background subtraction. you can see from the image below that the disparity range gave a different result.


Its clear that a higher disparity range will give a better result as you can see from the image below i used a disparity range of 0-4 with the same window size of 9x9

CODE TO FOLLOW