Performance Issues

Warp divergence

All threads in a warp execute the same instruction at the same time. i.e. avoid threads in a warp take different logic paths.

If threads of warp diverge, the warp serially execuates each branch path.

Warp divergence (Example 1):

__global__ void warpDivergenceKernel(int *data) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
       
    // Conditional statement causing warp divergence
    if (index % 2 == 0) {
        data[index] = index;  // Path for even indices
    } else {
        data[index] = 0;        // Path for odd indices
    }
}

No warp divergence (Example 1):

__global__ void optimizedKernel(int *data) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
   
    // Using arithmetic operations to avoid conditional branching
    int isEven = index % 2;  // will be 1 for even indices, 0 for odd
    data[index] = (1-isEven) * index;
}

Warp divergence (Example 2):

__global__ void mathKernel1(float *c) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    float a, b;
    a = b = 0.0f;
    if (tid % 2 == 0) {
        a = 100.0f;
    } else {
        b = 200.0f;
    }
    c[tid] = a + b;
}

No warp divergence (Example 2):

__global__ void mathKernel2(void) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    float a, b;
    a = b = 0.0f;
    if ((tid / warpSize) % 2 == 0) {
        a = 100.0f;
    } else {
        b = 200.0f;
    }
    c[tid] = a + b;
}

Note that the two versions of Example 2 give the same output, but in differnt order.

int f(int x){
   
    if (0 <=x && x <= 10){
        return 1;
    }else{
        return 0;
    }
}

int f(int x){
    return (int)(0 <=x && x <= 10)
}

Occupancy

$\displaystyle\text{Warp occupancy}=\frac{\text{number of active warps per SM}}{\text{maximum active warps per SM}}$

Resources are allocated for the entire block.
Utilizing too many resources per thread may limit the occupancy.

Latency Hiding (Little’s law)

Stream Multiprocessors

$\text{Needed warps} = \text{throughput per warp} \times\text{latency}$.

Instruction latency = 20 cycles

Throughput per SM = 32 operations/cycle

Thread parallelism $= 32\times 20 = 640$ operations.

Thus, if the number of threads is less than 64, SM will sometimes be idle.

Memory

$\displaystyle \text{Needed data} = \left(\frac{\text{memory bandwidth}}{\text{memory frequency}}\right)\times \text{instruction latency}$

Instruction latency = 800 cycles
Memory bandwidth = 144 GB/sec
Memory frequency = 1.566 GHz*cycle = 1.566G cycle/sec

Multiplying the above two, we can obtain how much data can be moved in a single cycle:

$\displaystyle\frac{ 144 \text{ GB/sec} }{1.566 \text{ G cycle/sec}} = 92$ Bytes/cycle

Thus, data parallelism $= 800\times 92$ Bytes = 74 KB

Suppose each thread moves 4 bytes from global memory to SM for computation, we need at least

$\displaystyle\frac{74 \text{ KB}}{ 4 \text{ bytes/thread} } =$ 18,500 threads

to hide memory latency or to fetch enough data to fulfill memory bandwidth.

Bank conflict

64-bit machine: $\displaystyle\text{Bank index}=\left(\frac{\text{byte addess}}{8 \text{ bytes per bank}}\right)\%32\text{ banks}$

Coalesced and aligned memory access

CUDA streaming

All independent operations should be issued before dependent operations,
Synchronization of any kind should be delayed as long as possible.

Guidline for Grid and Block Size

Avoid small grid size and large block size.
- Load Balancing: a single Streaming Multiprocessor (SM) can only execute one block at a time. If you have only a single block whose size (in terms of the number of threads) is much greater than the number of CUDA cores on an SM, the SM will execute the block in multiple batches, serially. That’s why a grid size should be a multiple of the number of SMs in a device to make sure no inactive SMs.
- Register Spilling: Large block size could risk the shortage of rigister on an SM. The excess memory will store on the local memory, which has much larger latency than register.
Avoid large grid size and small block size.
- Load Balancing: a single Streaming Multiprocessor (SM) can only execute one block at a time. If you have a block whose size is much smaller than the number of CUDA cores on an SM, some CUDA cores are inactive. That’s why a block size should be a multiple of the number of CUDA cores in an SM.
Keep the number of blocks per grid a multiple of the number of SM. e.g., 48 for Turing architecture.
Keep the number of threads per block a multiple of of the number of CUDA cores in an SM. i.e., 64
Keep the number of threads per block a power of 2.
- In order to ensure 100% warp occupancy.
  Guidline for performance tuning
Avoid warp divergence.
Avoid register spilling.
Enhance warp occupancy.
A large bank size may yield higher bandwidth for shared memory access, but may result in more bank conflicts depending on the application?�s shared memory access patterns.
Adjust the amount of shared memory and L2 cache
Avoid bank conflict
Coalesced and aligned memory access.
Make sure data size is a multiple of cache granularity.
Concurrent GPU/CPU executions
Concurrent GPU executions and data transfer
Use constant memory for data that does not change over the course of a kernel execution.
Loop Unrolling.
Kernel Fusion.
Dynamic Parallelism: Use dynamic parallelism to launch kernels from within other kernels where appropriate, reducing the need for CPU intervention and improving data locality.
Efficient Use of Atomic Operations: Use atomic operations judiciously as they can serialize access to memory, but they are essential for certain operations like reductions and histograms.
More computation per memory access
Re?�compute may be faster than re?�loading data
Minimize memory transfers from host to device
Check each metric with nvprof as you can as possible.
Turn on the MPS daemon.
- Turn on:
  - sudo nvidia-cuda-mps-control -d
  - ps -ef |grep mps # check mps status
- Turn off:
  - echo quit | nvidia-cuda-mps-control
Ensure that no one else is using GPU while you are:
- Switch compute mode from “default” to “exclusive process”.
- nvidia-smi --query | grep 'Compute Mode' (check the current compute mode).
- nvidia-smi -i 1 -c MODE (set the compute mode MODE to default on the device 0).
  - 0/DEFAULT, 1/EXCLUSIVE_PROCESS, 2/PROHIBITED