Memory Access Grouping Example

Memory access grouping is an important memory access optimization. Memory accesses can only be coalesced (see the MatrixMul Example) when 32-bit or larger words are accessed. Grouping four 8-bit memory accesses reduces the number of memory accesses by a factor of four, and in most cases this leads to a speed-up of a factor of four. Sometimes the grouped memory accesses can also be coalesced, which leads to an even greater speedup.

The kernel below adds two matrices (in1 and in2) into a thrid matrix (out). This kernel uses 8-bit memory access, which means they can never be coalesced.

__global__ void MatrixAddKernel(unsigned char *in1, unsigned char *in2, unsigned char *out)
    unsigned char tmp1 = in1[blockIdx.x * 64 + threadIdx.x];
    unsigned char tmp2 = in2[blockIdx.x * 64 + threadIdx.x];

    out[blockIdx.x * 64 + threadIdx.x] = tmp1 + tmp2;

MatrixAddKernel<<<64, 64>>>(in1, in2, out);

The kernel below performs the same function, but here four 8-bit memory accesses are grouped into one 32-bit access. Now the memory accesses can be coalesced, which leads to a speedup of 30 times compared to the previous kernel.

__global__ void MatrixAddKernel(unsigned char *in1, unsigned char *in2, unsigned char *out)
    uchar4 tmp1 = ((uchar4*)in1)[blockIdx.x * 64/4 + threadIdx.x];
    uchar4 tmp2 = ((uchar4*)in2)[blockIdx.x * 64/4 + threadIdx.x];
    uchar4 tmp3;
    tmp3.x = tmp1.x + tmp2.x;
    tmp3.y = tmp1.y + tmp2.y;
    tmp3.z = tmp1.z + tmp2.z;
    tmp3.w = tmp1.w + tmp2.w;

    ((uchar4*)out)[blockIdx.x * 64/4 + threadIdx.x] = tmp3;

MatrixAddKernel<<<64, 64/4>>>(in1, in2, out);