Lowes SIFT (Scale Invariant Feature Transform) [5] detect similarity invariant features in gaussian scale space of images, and it has been successfully applied in many computer vision problems [3, 7]. By exploiting the data parallel computing feature of GPU, scale invariant feature transform can run much faster on GPU than on CPU. This report will discuss the implementation details of dierent stages of SIFT, and show the result of them.
Besides the original binary given by Lowe, there has been several existing CPU and GPU implementations. The original version, dated 2005, is ne, but it is only a binary without much you can change. With the strong interest in SIFT of many researcher, there then come several SIFT implementations in dierent programming languages including C#, Matlab, and C++. Sift++ [8], a nice C++ version developed by Andrea Vedaldi, gives users a lot of exibility. With this implementation, it is easy to change many parameters of SIFT, for example, number of octaves, number of DOG levels, edge threshold, etc. This kind of exibility is also a goal of our GPU implementation. Sudipta Sinha is the rst one to implement SIFT on GPU [6]. This version was using OpenGL+cg as the shader language, and achieved a high speed up over CPU. Due to some hardware limitations and OpenGL limitations at that time, several important steps were not running on GPU, and there are data transfers between GPU and CPU which took a fair portion time. Recently, [4] demonstrates another SIFT implementation on GPU. This version has a smart idea to achieve high performance on scale space generation by packing 2x2 squares in a single pixel to save the number of texture etchings. It is not clear how the result is nally downloaded from GPU and allocating 128 vector for every pixels instead of every feature is a waste of memory and bandwidth. This paper also does not claim the support of multiple orientations of one keypoint. The goal of this project is to combine the exibility and generality feature of SIFT++, and implement a free source library of SIFT on GPU. Unlike the Sif++ library, both of the above two GPU versions are commercial versions that cannot be distributed, and it is also unclear how the exibility of them are. 1
The the parallel computing nature and programmability pipeline makes GPU a powerful tool for data parallel computation problems, and it has been widely used for general purpose computation [1]. This project is trying to utilize the computing power of GPU to run fast scale invariant transform. SIFT detects the local maxima and minima of dierence of gaussian in the gaussian scale space. Local dominant gradient orientations are then computed for each feature point, and sub-pixel localization is applied. Descriptors are then generated from the scale-and-orientation-normalized image patche for each feature. The rst part, scale space computation, can be cast to a pixel parallel computation. It runs gaussian lters on input images to get each pixel of new ltered images, and GPU can use a fragment shader to compute multiple pixels simultaneous. The second part, localization, orientation computation, and descriptor generation, can be seen as a feature parallel computation. Each feature can also be mapped to a pixel to run parallel on GPU. The rest of this report is organized as follows: Section 2 discusses some existing SIFT implementations and talks about their features. Section 3 will then explain the implementation details of dierent stages of SIFT. Conclusions and future work are given in the end.
X Y # of Orientations
This section discusses some details of implementing SIFT using GPU shaders. First the overall design of this library is discussed, and then details of each stages of SIFT.
Shader Language
R X Y Orientation 1 R G B A X Y Orientation 2 G B A
Here the traditional GPU shaders are chosen as the implementation tool instead of CUDA, considering the fact that images are easily mapped to textures on GPU, . Initially, this work was using GLSL( OpenGL Shading Language), and later CG version of the shaders are also developed, and a parameter is provided to switch between them. The two version are also able to work on both nVidia and ATI graphic cards.
Figure 2. Storage of feature list as textures. This is very necessary to achieve good performance, because the gaussian kernel needs to be very large for large . 6 is used as the lter width, when the number of DOG levels is 3, the largest gaussian lter will be 3.0, and it will require a 19x19 gaussian kernel. Using separable lter will save a lot texture fetches and also reduce the shader code size. Gaussian Filter Shaders are generated on the y according to the parameter user inputs, each with dierent different sizes and kernels. Multiple texture coordinate feature of OpenGL is used, and when the number of coordinates is more than 8, they will be computed automatically in shaders.
R G Intensity1 Copy Subtract R G Horizontal Filtering B A Input Texture Temp Vertical Filtering Copy Intensity1 R G B A Intensity2 DOG Temp
Storage Design
The level images in the scale space are intuitively stored as pixel-by-pixel mapped texture. Shown in Fig. 1, the four color channels RGBA are used to store intensity, dierence of Gaussian, gradient magnitude, and gradient orientation respectively.
R G B A Intensity DOG Gradient Magnitude Gradient Oriteation
Figure 1. Storage of scale space as texture To save memory usage, feature list is used in this implementation. The feature list are also stored as textures as shown in Fig. 2. Features on dierent levels are stored separately and the scale information does not need to be stored. After the rst stage of feature detection, a feature list texture that saves feature location and orientation count are used, and then the feature list texture is reshaped to make a list of features with separate orientations. A point that needs to mention is that all the feature list generation and feature list reshaping are implemented on GPU, which is dierent with Sinhas download/Upload. Descriptor generation is currently not nished yet, but I am planing to use the method in [4]. In this method two textures are used to store descriptor data, and 32 pixels (16 pixels from each ) are used to save the 128D (32*4 = 128) feature vector.
Output Texture
Figure 3. Two pass of gaussian lter that uses texture from destination.
Similar with Sinhas work, separable gaussian ltering is used to run ltering horizontally and vertically separately. 2
Fig. 3 demonstrates the two stage gaussian lter. The second pass, by carefully writing back the temporary intensity to the original color channel, can read and write the same texture. My experiments show that reading and writing the same texture is faster than PingPong [2]. It can be explained by that PingPong requires more switching of texture caching. Dierence of Gaussian is also computed in the second pass. After one octave is computed, sub-samping is used to get the rst several level images of the next octave. For example when the level range is from 1 to s + 2, the scale doubles every s steps. There are 3 pairs of doubling in one octave,
and the highest 3 level of an octave can be used to generate the rst 3 level of the next octave. One restriction is that the lter size cannot be truncated for higher level, other wise the gaussian will be inaccurate for sub-sampling. When subsampling more than one level, both intensity and DOG can be generated from sub-sampling, and this can save some time on ltering. This trick hasnt been seen in other SIFT implementations.
into intra-level suppression and inter-level to save texture fetching. As shown in Fig 5, the rst pass will compare the DOG value of a pixel with its 8 neighbours, and save whether the point is a local minimum and local maximum to an auxiliary texture. The maximum and minimum of the 9 pixels are also stored in the auxiliary texture. Gradient magnitude and orientation is also computed in this pass. Edge elimination is also applied in this pass to delete the features that are on edges. In the second pass, early-z if rst applied to exclude the pixels that are already ltered out in the rst pass, then each pixel is compared with the maximum or minimum value of its 2 neighbor in the scale space. A point is the maximum in the 3 x3 x3 cube only when it is identied as a intra-level local maximum and it is larger than the maximum values in its two neighbours. Similar thing applies to minimum.
Intra-level Suppression With 8 neighbours R G B A Intensity DOG Gradient Magnitude Orientation Scale Space Texture R G B A IsKey DOG Maximum Minimum Auxiliary Texture Early-Z Inter-level Suppression With 2 neighbours R G B A IsKey DOG Maximum Minimum
Method in [9] is used here to generate feature lists on GPU. Our implementation used the full 4 color channel to build the histogram pyramid, which can be seen as pointer textures, and the feature list generated by tracking down the histogram pyramid. For every image, only one pixel at the top of the histogram pyramid needs to be read back, and the number of features is the sum of the four channels. This method can avoid the read-back of textures, and also avoid the upload of feature list. The left image in Fig 2 shows how the nal results are.
Figure 4. Gaussian scale space Pyramid and DOG Pyramid (Absolute value). Fig 4 shows the gaussian scale space and the absolute value of DOG. Images with the same dimension are the different levels of a same octave.
Orientation Computation
Keypoint Detection
Keypoint detection need to compare the DOG of a pixel with its 26 neighbours in the scale space. This step is split 3
This step computes the orientation candidates for each feature. It rst obtain an weighted orientation histogram in the circular window of radius 3, then apply smoothing on the histogram, and nally the angles whose voting is larger than 0.8 times the maximum are outputted. The 36 angle for orientation histogram is implemented as 9 oat4/vec4. Since GPU arrays does not support dynamic indexing, a binary search of index is used here to locate the expected 4-angle bin. Then this bin is added with a voting vector as follows bin += weight * oat4( fmod(idx,4) == oat4(0,1,2,3) ).
With this kind of 4 angle bins, smoothing can be easily applied with a larger window. The smoothing in sift++ runs (1, 1, 1)/3 ltering for 6 times, and, because four values are stored in one bin, it can be implemented as running (1, 3, 6, 7, 6, 3, 1)/27 ltering for twice. Finally, the orientations are writing to the orientation texture, and the numbers of features are writing to the original feature texture. Then the point list generation method as in the feature list generation is used again to reshape the feature list, and this step is shown in Fig 2. Instead of assigning dierent point location in the last step, dierent feature orientations are assigned to dierent feature candidates.
Descriptor Generation
This section shows that display list can also be generated on GPU without reading back the features. SIFT features here are displayed as scaled and rotated squares. A texture with 4 times space is allocated for saving the output vertices, and a shader will automatically compute the feature index of the point, and also the sub-index in the rectangle. The point can then be rotated and translated according to the feature orientation and scale. Fig 6 shows this vertex generation. The vertex result can then be copying to a Vertex Buer Object to demonstrate SIFT features. Fig 7 show an example of the result.
This version of SIFT can also handle very large images, normally it takes 0.9 second on a 1600x1200 image, and 2.8 second on a 2048*1365 image.
The project currently nished scale space generation, keypoint detection, feature list generation, orientation computation and visualization on GPU. This implementation also provides exibility of changing many parameters by generating shaders on the y. Almost all the parameters in sift++ are ported. In the following days, Ill rst nish the descriptor generation, sub-pixel localization, and then SIFT matching on GPU. The packed image format in [4] may also need a try. The codes also needs to be optimized to make it a good library.
The current implementation runs about 6-7 hz on nVidia 7900 GTX on average. Due to the limit of time, here only one sample result from nvidia 7900 GTX are presented. The size of the test image is 800x600. In the rst round, it spend 0.157 second in pyramid generation, 0.265 second in the rst pass of Keypoint detection, 0.031 second in the second pass, 0.094 second in generate the feature list on GPU, 0.266 second in orientation computation, and 0.031 in feature list reshaping. The rst run of SIFT will be slower because of the starting overhead, but it is getting stable from the second round.After the rst round, the following processing can nish all the stages in about 0.157 second. 4
I acknowledge Sudipta Sinha for many helpful discussions before I started this project and during the development. Also thank Florian Erik Muecke for giving me some helpful tips and sharing of his work.
