All threads in a warp execute the same instruction at the same time. i.e. avoid threads in a warp take different logic paths.
If threads of warp diverge, the warp serially execuates each branch path.
Warp divergence (Example 1):
__global__ void warpDivergenceKernel(int *data) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
// Conditional statement causing warp divergence
if (index % 2 == 0) {
data[index] = index; // Path for even indices
} else {
data[index] = 0; // Path for odd indices
}
}
No warp divergence (Example 1):
__global__ void optimizedKernel(int *data) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
// Using arithmetic operations to avoid conditional branching
int isEven = index % 2; // will be 1 for even indices, 0 for odd
data[index] = (1-isEven) * index;
}
Warp divergence (Example 2):
__global__ void mathKernel1(float *c) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
float a, b;
a = b = 0.0f;
if (tid % 2 == 0) {
a = 100.0f;
} else {
b = 200.0f;
}
c[tid] = a + b;
}
No warp divergence (Example 2):
__global__ void mathKernel2(void) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
float a, b;
a = b = 0.0f;
if ((tid / warpSize) % 2 == 0) {
a = 100.0f;
} else {
b = 200.0f;
}
c[tid] = a + b;
}
Note that the two versions of Example 2 give the same output, but in differnt order.
int f(int x){
if (0 <=x && x <= 10){
return 1;
}else{
return 0;
}
}
int f(int x){
return (int)(0 <=x && x <= 10)
}
$\displaystyle\text{Warp occupancy}=\frac{\text{number of active warps per SM}}{\text{maximum active warps per SM}}$
$\text{Needed warps} = \text{throughput per warp} \times\text{latency}$.
Instruction latency = 20 cycles
Throughput per SM = 32 operations/cycle
Thread parallelism $= 32\times 20 = 640$ operations.
Thus, if the number of threads is less than 64, SM will sometimes be idle.
$\displaystyle \text{Needed data} = \left(\frac{\text{memory bandwidth}}{\text{memory frequency}}\right)\times \text{instruction latency}$
Multiplying the above two, we can obtain how much data can be moved in a single cycle:
$\displaystyle\frac{ 144 \text{ GB/sec} }{1.566 \text{ G cycle/sec}} = 92$ Bytes/cycle
Thus, data parallelism $= 800\times 92$ Bytes = 74 KB
Suppose each thread moves 4 bytes from global memory to SM for computation, we need at least
$\displaystyle\frac{74 \text{ KB}}{ 4 \text{ bytes/thread} } =$ 18,500 threads
to hide memory latency or to fetch enough data to fulfill memory bandwidth.
64-bit machine: $\displaystyle\text{Bank index}=\left(\frac{\text{byte addess}}{8 \text{ bytes per bank}}\right)\%32\text{ banks}$
nvprof as you can as possible.sudo nvidia-cuda-mps-control -dps -ef |grep mps # check mps statusecho quit | nvidia-cuda-mps-controlnvidia-smi --query | grep 'Compute Mode' (check the current compute mode).nvidia-smi -i 1 -c MODE (set the compute mode MODE to default on the device 0).
0/DEFAULT, 1/EXCLUSIVE_PROCESS, 2/PROHIBITED