Performance Issues

Warp divergence

All threads in a warp execute the same instruction at the same time. i.e. avoid threads in a warp take different logic paths.

If threads of warp diverge, the warp serially execuates each branch path.

  1. Warp divergence (Example 1):

    __global__ void warpDivergenceKernel(int *data) {
        int index = threadIdx.x + blockIdx.x * blockDim.x;
           
        // Conditional statement causing warp divergence
        if (index % 2 == 0) {
            data[index] = index;  // Path for even indices
        } else {
            data[index] = 0;        // Path for odd indices
        }
    }
    
  2. No warp divergence (Example 1):

    __global__ void optimizedKernel(int *data) {
        int index = threadIdx.x + blockIdx.x * blockDim.x;
       
        // Using arithmetic operations to avoid conditional branching
        int isEven = index % 2;  // will be 1 for even indices, 0 for odd
        data[index] = (1-isEven) * index;
    }
    
  3. Warp divergence (Example 2):

    __global__ void mathKernel1(float *c) {
        int tid = blockIdx.x * blockDim.x + threadIdx.x;
        float a, b;
        a = b = 0.0f;
        if (tid % 2 == 0) {
            a = 100.0f;
        } else {
            b = 200.0f;
        }
        c[tid] = a + b;
    }
    
  4. No warp divergence (Example 2):

    __global__ void mathKernel2(void) {
        int tid = blockIdx.x * blockDim.x + threadIdx.x;
        float a, b;
        a = b = 0.0f;
        if ((tid / warpSize) % 2 == 0) {
            a = 100.0f;
        } else {
            b = 200.0f;
        }
        c[tid] = a + b;
    }
    

Note that the two versions of Example 2 give the same output, but in differnt order.

int f(int x){
   
    if (0 <=x && x <= 10){
        return 1;
    }else{
        return 0;
    }
}
int f(int x){
    return (int)(0 <=x && x <= 10)
}

Occupancy

$\displaystyle\text{Warp occupancy}=\frac{\text{number of active warps per SM}}{\text{maximum active warps per SM}}$

Latency Hiding (Little’s law)

Stream Multiprocessors

$\text{Needed warps} = \text{throughput per warp} \times\text{latency}$.

Instruction latency = 20 cycles

Throughput per SM = 32 operations/cycle

Thread parallelism $= 32\times 20 = 640$ operations.

Thus, if the number of threads is less than 64, SM will sometimes be idle.

Memory

$\displaystyle \text{Needed data} = \left(\frac{\text{memory bandwidth}}{\text{memory frequency}}\right)\times \text{instruction latency}$

Multiplying the above two, we can obtain how much data can be moved in a single cycle:

$\displaystyle\frac{ 144 \text{ GB/sec} }{1.566 \text{ G cycle/sec}} = 92$ Bytes/cycle

Thus, data parallelism $= 800\times 92$ Bytes = 74 KB

Suppose each thread moves 4 bytes from global memory to SM for computation, we need at least

$\displaystyle\frac{74 \text{ KB}}{ 4 \text{ bytes/thread} } =$ 18,500 threads

to hide memory latency or to fetch enough data to fulfill memory bandwidth.

Bank conflict

64-bit machine: $\displaystyle\text{Bank index}=\left(\frac{\text{byte addess}}{8 \text{ bytes per bank}}\right)\%32\text{ banks}$

Coalesced and aligned memory access

CUDA streaming

Guidline for Grid and Block Size

  1. Avoid small grid size and large block size.
  2. Avoid large grid size and small block size.
  3. Keep the number of blocks per grid a multiple of the number of SM. e.g., 48 for Turing architecture.
  4. Keep the number of threads per block a multiple of of the number of CUDA cores in an SM. i.e., 64
  5. Keep the number of threads per block a power of 2.
  6. Avoid warp divergence.
  7. Avoid register spilling.
  8. Enhance warp occupancy.
  9. A large bank size may yield higher bandwidth for shared memory access, but may result in more bank conflicts depending on the application?�s shared memory access patterns.
  10. Adjust the amount of shared memory and L2 cache
  11. Avoid bank conflict
  12. Coalesced and aligned memory access.
  13. Make sure data size is a multiple of cache granularity.
  14. Concurrent GPU/CPU executions
  15. Concurrent GPU executions and data transfer
  16. Use constant memory for data that does not change over the course of a kernel execution.
  17. Loop Unrolling.
  18. Kernel Fusion.
  19. Dynamic Parallelism: Use dynamic parallelism to launch kernels from within other kernels where appropriate, reducing the need for CPU intervention and improving data locality.
  20. Efficient Use of Atomic Operations: Use atomic operations judiciously as they can serialize access to memory, but they are essential for certain operations like reductions and histograms.
  21. More computation per memory access
  22. Re?�compute may be faster than re?�loading data
  23. Minimize memory transfers from host to device
  24. Check each metric with nvprof as you can as possible.
  25. Turn on the MPS daemon.
  26. Ensure that no one else is using GPU while you are:

Reference: