After reading the learning materials and finishing the preparation steps, you can now start writing programs for GPUs using CUDA. However, you may be disappointed when you see that your CUDA program is not much better than the simple CPU implementation, or even slower! This happens in many cases if you just "translate" the serial CPU code to the CUDA syntax.

Since the GPU architecture as well as the programming model are quite different from those of the CPU, it requires a lot of effort to write an efficient CUDA program. In this section, we provide a matrix multiplication example, which demonstrates a number of important optimization techniques for writing CUDA program. And for image processing, byte accesses are very common, so an memory access grouping example is also provided here.

The examples only demonstrates a few optimization technique. For a real application, other optimization techniques may be required. NVIDIA provides a comprehensive list of optimization techniques in the "CUDA C Best Practice Guide".