
Graphical Processing Units are becoming more and more popular. Even general purpose GPUs have been getting a lot of press lately. GPUs are dedicated graphics rendering devices which are highly parallelized. Because the pipeline of a GPU is hundreds of stages deep, it is much faster at computing complex algorithms than a general purpose CPUs.
As the name implies, Graphical Processing Units are specially tailored to handle the rendering of 2D and 3D images on screen. This includes rendering not only the browser window that you’re currently viewing this blog through, but also complex Ray Tracing scenes.
Programming on GPUs is difficult: as programmers we often think of how to optimize code for space or runtime, but not often how to optimize it for parallelization. GPU programmers tell us that “this is a level of optimization that requires a PhD-level programmer for success.” Not all code can run efficiently on GPUs either: code that has more logic than computation causes the pipeline to stall, thereby reducing the performance advantages of GPUs. (In contrast, CPU pipelines are only ~40 stages deep, and so control hazards have less of an effect).
One might think that matrix operations would be easily to parallelize since things like matrix multiplication are highly computationally intensive. Numerous studies have gone into using the GPUs performance advantages for applications other than graphics.
http://gamma.cs.unc.edu/LU-GPU/
here’s one on LU decomposition using GPUs

http://graphics.stanford.edu/papers/gpumatrixmult/
I found this one the most interesting. These Stanford researchers looked into using GPU algorithms for matrix-matrix multiplication. What they found was unexpected: the GPU’s efficiency fared no better than the CPU’s. The cause? Bandwidth.
“Its regular data access pattern, and highly parallel computational requirements suggest matrix-matrix multiplication as an obvious candidate for efficient evaluation on GPUs, but surprisingly we find that even near-optimal GPU implementations are pronouncedly less efficient than current cache-aware CPU approaches. We find that the key cause of this inefficiency is that the GPU can fetch less data and yet execute more arithmetic operations per clock than the CPU when both are operating out of their closest caches. The lack of high bandwidth access to cached data will impair the performance of GPU implementations of any computation featuring significant input reuse.”
Just goes to show we still have a lot to learn before we fully understand parallelization.
other thoughts on GPUs:
GPUs: Threat or Menace?






Leave a Comment
You must be logged in to post a comment.
* You can follow any responses to this entry through the RSS 2.0 feed.