Implementation of Conjugate Gradients (CG) on Programmable Graphics Hardware (GPU)

Andrew Corrigan

Advisor: Professor H. Quynh Dinh

Motivation

  1. Reconstruction of implicit surfaces (our particular application) with radial basis functions requires a large sparse system to be solved.
  2. The GPU is much more powerful than a CPU. [Harris et al. 2004]  This gap in performance will continue to widen.
  3. CG has been implemented on the GPU before, and it outperformed a counterpart CPU implementation [Bolz, et al. 2003]

Figures

  1.  Component-wise multiplication.

  2. In order to perform sum reduction, the vector must be of a power-of-two dimension.  If it is not, it is increased in dimension with the new entries set to the additive identity (zero).

  3. Sum reduction combines pairs of entries, until there is only one left.

  4. Encoding of nonzero entries.

  5. Encoding of columns.

  6. Sparse Matrix-Vector Multiplication

  7. Sorting Rows and Columns by Number of Nonzero Entries

Summary

    This summer I attempted to implement CG on the GPU.  I used NVIDIA’s Cg language with OpenGL to get access to the graphics hardware.  I used [Bolz, et al. 2003] as a guide to implement CG.  CG is composed of the following operations: matrix-vector multiplication, scaled vector addition and subtraction, dot products, and scalar operations.
    When I started off this summer I implemented dense matrix-vector multiplication and dot products by drawing to either the front or back buffer in OpenGL.  At the time I didn’t realize that this only allows for 8-bit channels, with values clamped between 0 and 1.  Within that domain matrix-vector multiplication and dot products seemed to work.  But then I realized that it was necessary to use 32-bit floating point numbers.  After some looking around I came across Mark Harris’s RenderTexture class [Harris, M. 2003].  RenderTexture provides a clean interface for working with Pbuffers.  Pbuffers are off-screen buffers that are allowed 32-bit unclamped floating point channels.  This is absolutely necessary in order to get meaningful results when performing matrix operations.  An example is that if the answer to a problem is actually five.  If you are using a clamped 8-bit floating point number, the computer will tell you the answer is one.  So now with RenderTexture in hand I was able to really implement matrix and vector algebra operations on the GPU.
    A dot product on the GPU consists of two operations.  The first is component wise multiplication into a temporary vector (Figure 1).  To implement component wise multiplication simply multiply the samples of each of the two textures at the same texture coordinate and return that as the value for the current fragment.  The temporary vector has dimension of the next highest power of two (Figure 2).  The second operation is sum reduction on that vector (Figure 3).  If you look at that figure you can see what is being done.  The reason why the elements in the vector are added up in pairs like this is that you cannot just write a for-loop like you might when programming a CPU.  If you eavesdrop on some threads at the gpgpu.org forums you’ll hear talk that “ping-ponging” buffers is the best way to perform sum reduction.  I did not do this.  I instead took what I thought was the intuitive (yet inefficient) route and created a pyramid of RenderTextures.  The top RenderTexture is the size of the previous  During each pass, neighboring values are added together.  This continues into there is only one value left which is the result of the dot product.
    Sparse matrix-vector multiplication works the same as matrix-vector multiplication, except any entries that are zero are not considered.  When most of the matrix is zero, this can save a lot of time.  Sparse matrix-vector multiplication uses a sparse representation of the matrix.  This means that only nonzero values are stored (Figure 4).  However storing the matrix this way discards which column each entry is stored in, which is necessary for matrix-vector multiplication, so this is also stored (Figure 5).  The number of nonzero entries is hard coded into each fragment program.  This means that there are many fragment programs involved in matrix-vector multiplication.  (Figure 6)
    Each row has a different fragment program based on its number of nonzero entries.  To allow for as many rows to be drawn at once, the rows are sorted by their number of nonzero entries.  There is only one system matrix in CG that remains static.  Because of this it is better to pre-sort the system matrix so that it only has to be done once.  However, CG requires that the system matrix is symmetric, positive definite.  To retain these properties the columns of the matrix must also be sorted.  (Figure 7).  Once the system is solved, since the columns were sorted, this means the solution vector was too and it must be unsorted.
    There is also scaled vector addition and subtraction.  Subtraction can be mapped into an addition problem by simply negating the scalar.  Scaled vector addition is much easier than sum reduction and sparse matrix-vector multiplication.  So just look at the implementation to see how it is done.
    To actually solve a sparse system, I wrapped each of these basic operations up into functions that take RenderTexture pointers, and return or store the result.  See the code for details on this.  Then the functions are called according to how [Shewchuk 1994] outlines CG.

Future Work

If this particular implementation is used for any application, I left a lot of room for optimizations and improvements.  Some of these include:
  1. Use of vertex shaders.  If you take a look at the file SumReduction.cg.  I’ve included in the download you’ll see that “left” varies linearly.  Because of this “left” could be computed per vertex and given as a texture coordinate.

  2. Taking advantage of newer hardware.  This implementation was made on an NVIDIA GeForce FX 5900.  Access to a newer piece of hardware would allow for all sparse rows to use the same  fragment program when performing matrix-vector multiplication.

  3. Use of 2D textures. Textures dimensions are limited to either 2048 or 4096.  If 2D textures are use, the square of either of those numbers is the amount of data allowed.  Currently only 2048 or 4096 entries are allowed to be stored per vector or matrix.

  4. Use of a dense representation and operations for dense rows.  There are two problems with using a sparse representation for all rows.  In our application not all rows are sparse, and the Cg compiler simply fails when compiling the corresponding fragment programs for those rows due to limits on the programs length.  Another reason is that, even if the compiler didn’t fail, the sparse matrix-vector multiply is slower than a dense multiply when the row itself is dense.  The dense representation would simply perform a dot product between a row of the matrix and the vector being multiplied.

  5. Taking advantage of a priori knowledge of the values of the matrix.  In our application there are a lot of entries guaranteed to be one.  This includes the diagonal, and the most of the last row.  Since one is the multiplicative identity, multiplying by one is superfluous.

    If this particular implementation is not used, there is a sparse conjugate gradient solver included with Brook for GPUs, which I could tweak to our needs.  Brook is a language that abstracts the GPU (and even the CPU) as a streaming coprocessor.  It is a desirable language to use since the nuances of GPU programming are very distracting, and its multiple back-ends allow for easy comparison between the CPU and the various flavors of the GPU.  The authors found that the performance using Brook was comparable to hand-coding the GPU.  Another advantage is that the nuances of programming the GPU are changing, and it would be nice to let someone else take advantage of those features for me.  For example very soon textures will be accessible in vertex programs and the use of Pbuffers in OpenGL will be made obsolete by Superbuffers.  [Buck, et al. 2004]

Acknowledgements

    My advisor for this project was Professor H. Quynh Dinh.  I received help from Professor Bruno M. Carvalho, as well as Cliff Woolley on the gpgpu.org forums.

Download

   Download my implementation here.  With this code, you can specify a system of equations in data.txt.  The program will simply display the solution.  Keep in mind, that you the matrix you provide should be sparse, symmetric, and positive definite.  I tried to comment my code sufficiently.

References

Questions, Comments, Criticism, Etc.