Learning Materials

The content of this page was obsolete. Therefore, it is replaced by the version in year 2013.

Introductory Reading (Required)
1. GPU architectures and programming
2. Performance Modeling
CUDA Related Documents (Recommended)
Other GPU Courses (Optional)
Advanced Reading (Optional)

Introductory Reading (Required)

GPU architectures and programming

NVIDIA Tesla: A Unified Graphics and Computing Architecture, in IEEE Micro 2008, link: http://dx.doi.org/10.1109/MM.2008.31
Scalable Parallel Programming with CUDA, in ACM Queue 2008, link: http://dx.doi.org/10.1145/1365490.1365500
Understanding throughput-oriented architectures, in Communications of the ACM 2010, link: http://dx.doi.org/10.1145/1839676.1839694

Performance Modeling

Roofline: an insightful visual performance model for multicore architectures, in Communications of the ACM 2009, link: http://dx.doi.org/10.1145/1498765.1498785

CUDA Related Documents (Recommended)

NVIDIA provides a list of documentations. You can selectively read these documents according to your needs. Yet two of them are particularly relevant to the assignment (see the two bullets below). Therefore, we recommend you to look into them. It takes some time to read them, but they will save you a lot of effort later.

CUDA Programming Guide
CUDA Best Practices Guide

Other GPU Courses (Optional)

Here is a list of courses related to GPU architecture and/or GPU programming:

Parallel and Heterogeneous Computer Architecture, given in MIT. (http://courses.csail.mit.edu/6.888/spring13/lectures.shtml)
Introduction to Parallel Programming, given in udacity. (https://www.udacity.com/course/cs344)

Advanced Reading (Optional)

This list highlights some recent research works (2009--2012) on GPUs and other throughput-oriented SIMD architectures. Despite that the papers are sorted into different categories, most papers touch all architectural aspects of the GPUs.

Thread Scheduling and Context Managing

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor, in MICRO 2012, link: http://dx.doi.org/10.1109/MICRO.2012.18
A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors, in ACM Transactions on Computer Systems (TOCS) 2012, link: http://dx.doi.org/10.1145/2166879.2166882
Energy-efficient mechanisms for managing thread context in throughput processors, in ISCA 2011, link: http://dx.doi.org/10.1145/2024723.2000093
A compile-time managed multi-level register file hierarchy, in MICRO 2011, link: http://dx.doi.org/10.1145/2155620.2155675
Improving GPU performance via large warps and two-level warp scheduling, in MICRO 2011, link: http://dx.doi.org/10.1145/2155620.2155656

Branch and Control Flow

Simultaneous branch and warp interweaving for sustained GPU performance, in ISCA 2012, link: http://dx.doi.org/10.1109/ISCA.2012.6237005
SIMD re-convergence at thread frontiers, in MICRO 2011, link: http://dx.doi.org/10.1145/2155620.2155676
CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures, in ISCA 2012, http://dx.doi.org/10.1145/2366231.2337167
Thread block compaction for efficient SIMT control flow, in HPCA 2011, link: http://dx.doi.org/10.1109/HPCA.2011.5749714
Dynamic warp subdivision for integrated branch and memory divergence tolerance, in ISCA 2010, link: http://dx.doi.org/10.1145/1816038.1815992
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware, in ACM Transactions on Architecture and Code Optimization (TACO) 2009, link: http://dx.doi.org/10.1145/1543753.1543765

Memory Hierarchy and Network-On-Chip

Cache-Conscious Wavefront Scheduling, to appear in MICRO 2012, link: http://www.ece.ubc.ca/~aamodt/papers/tgrogers.micro2012.pdf
Throughput-Effective On-Chip Networks for Manycore Accelerators, in MICRO 2010, link: http://dx.doi.org/10.1109/MICRO.2010.50
Complexity effective memory access scheduling for many-core accelerator architectures, in MICRO 2009, link: http://dx.doi.org/10.1145/1669112.1669119

5KK73 GPU Assignment 2012