Andrew Corrigan
Advisor: Professor H. Quynh Dinh
Component-wise
multiplication.
In order to perform
sum reduction, the vector must be of a power-of-two dimension. If
it is not, it is increased in dimension with the new entries set to the
additive identity (zero).
Sum reduction
combines pairs of entries, until there is only one left.
Encoding of nonzero
entries.
Encoding of
columns.
Sparse
Matrix-Vector Multiplication
Sorting Rows and
Columns by Number of Nonzero Entries
This summer I attempted to implement
CG on the GPU. I used NVIDIA’s Cg language with OpenGL to
get access to the graphics hardware. I used [Bolz, et al. 2003]
as a guide to implement CG. CG is composed of the following
operations: matrix-vector multiplication, scaled vector addition and
subtraction, dot products, and scalar operations.
When I started off this summer I
implemented dense matrix-vector multiplication and dot products by
drawing to either the front or back buffer in OpenGL. At the
time I didn’t realize that this only allows for 8-bit channels,
with values clamped between 0 and 1. Within that domain
matrix-vector multiplication and dot products seemed to work.
But then I realized that it was necessary to use 32-bit floating
point numbers. After some looking around I came across Mark
Harris’s RenderTexture class [Harris, M. 2003].
RenderTexture provides a clean interface for working with
Pbuffers.
Pbuffers are off-screen buffers that are allowed 32-bit unclamped
floating point channels. This is absolutely necessary in order
to get meaningful results when performing matrix operations. An
example is that if the answer to a problem is actually five. If
you are using a clamped 8-bit floating point number, the computer
will tell you the answer is one. So now with RenderTexture in
hand I was able to really implement matrix and vector algebra
operations on the GPU.
A dot product on the GPU consists of
two operations. The first is component wise multiplication into
a temporary vector (Figure 1). To implement component wise
multiplication simply multiply the samples of each of the two
textures at the same texture coordinate and return that as the value
for the current fragment. The temporary vector has dimension of
the next highest power of two (Figure 2). The second operation
is sum reduction on that vector (Figure 3). If you look at that
figure you can see what is being done. The reason why the
elements in the vector are added up in pairs like this is that you
cannot just write a for-loop like you might when programming a
CPU.
If you eavesdrop on some threads at the gpgpu.org forums you’ll
hear talk that “ping-ponging” buffers is the best way to
perform sum reduction. I did not do this. I instead took
what I thought was the intuitive (yet inefficient) route and created
a pyramid of RenderTextures. The top RenderTexture is the size
of the previous During each pass, neighboring values are added
together. This continues into there is only one value left
which is the result of the dot product.
Sparse matrix-vector multiplication
works the same as matrix-vector multiplication, except any entries
that are zero are not considered. When most of the matrix is
zero, this can save a lot of time. Sparse matrix-vector
multiplication uses a sparse representation of the matrix. This
means that only nonzero values are stored (Figure 4). However
storing the matrix this way discards which column each entry is
stored in, which is necessary for matrix-vector multiplication, so
this is also stored (Figure 5). The number of nonzero entries
is hard coded into each fragment program. This means that there
are many fragment programs involved in matrix-vector
multiplication.
(Figure 6)
Each row has a different fragment
program based on its number of nonzero entries. To allow for as
many rows to be drawn at once, the rows are sorted by their number of
nonzero entries. There is only one system matrix in CG that
remains static. Because of this it is better to pre-sort the
system matrix so that it only has to be done once. However, CG
requires that the system matrix is symmetric, positive definite.
To retain these properties the columns of the matrix must also be
sorted. (Figure 7). Once the system is solved, since the
columns were sorted, this means the solution vector was too and it
must be unsorted.
There is also scaled vector addition
and subtraction. Subtraction can be mapped into an addition
problem by simply negating the scalar. Scaled vector addition
is much easier than sum reduction and sparse matrix-vector
multiplication. So just look at the implementation to see how it
is done.
To actually solve a sparse system, I
wrapped each of these basic operations up into functions that take
RenderTexture pointers, and return or store the result. See the
code for details on this. Then the functions are called
according to how [Shewchuk 1994] outlines CG.
Use of vertex shaders. If you take a look at the file SumReduction.cg. I’ve included in the download you’ll see that “left” varies linearly. Because of this “left” could be computed per vertex and given as a texture coordinate.
Taking advantage of newer hardware. This implementation was made on an NVIDIA GeForce FX 5900. Access to a newer piece of hardware would allow for all sparse rows to use the same fragment program when performing matrix-vector multiplication.
Use of 2D textures. Textures dimensions are limited to either 2048 or 4096. If 2D textures are use, the square of either of those numbers is the amount of data allowed. Currently only 2048 or 4096 entries are allowed to be stored per vector or matrix.
Use of a dense representation and operations for dense rows. There are two problems with using a sparse representation for all rows. In our application not all rows are sparse, and the Cg compiler simply fails when compiling the corresponding fragment programs for those rows due to limits on the programs length. Another reason is that, even if the compiler didn’t fail, the sparse matrix-vector multiply is slower than a dense multiply when the row itself is dense. The dense representation would simply perform a dot product between a row of the matrix and the vector being multiplied.
Taking advantage of a priori knowledge of the values of the matrix. In our application there are a lot of entries guaranteed to be one. This includes the diagonal, and the most of the last row. Since one is the multiplicative identity, multiplying by one is superfluous.
If this particular implementation is not used, there is a sparse conjugate gradient solver included with Brook for GPUs, which I could tweak to our needs. Brook is a language that abstracts the GPU (and even the CPU) as a streaming coprocessor. It is a desirable language to use since the nuances of GPU programming are very distracting, and its multiple back-ends allow for easy comparison between the CPU and the various flavors of the GPU. The authors found that the performance using Brook was comparable to hand-coding the GPU. Another advantage is that the nuances of programming the GPU are changing, and it would be nice to let someone else take advantage of those features for me. For example very soon textures will be accessible in vertex programs and the use of Pbuffers in OpenGL will be made obsolete by Superbuffers. [Buck, et al. 2004]
![]()