__syncthreads() waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block.
__threadfence_block()__threadfence_block ensures that the writes are visible to “other threads” in the same block before proceeding. However, this does not guarantee that all threads in the block have completed their writes to shared/global memory before the fence. This means if threads are executing at different rates, some might reach the fence and proceed to read memory that other threads have not yet written to, potentially leading to a race condition.__threadfence()__threadfence_system()t0) in the second half of the block starts executing B[idx] = A[(BLOCK_SIZE-1) - idx]; before another thread in the first half has executed its write to A[idx], the t0 thread might read an initial value in A[].
```cuda=
#define BLOCK_SIZE 512global void swap (int* A, int* B)
{
unsigned int idx = threadIdx.x + blockDim.x * blockIdx.x;
// Reaplce “A[idx] = ( idx < BLOCK_SIZE/2 ) ? 1 : 2” by the line below to avoid warp divergence A[idx] = (idx-(idx%(BLOCK_SIZE/2))) / (BLOCK_SIZE/2) + 1;
__threadfence_block();
B[idx] = A[(BLOCK_SIZE-1) - idx]; }
# Example 2
1. thread 1 executes `writeXY()`, while thread 2 executes `readXY()`.
```cuda=
__device__ int X = 1, Y = 2;
__device__ void writeXY()
{
X = 10;
Y = 20;
}
__device__ void readXY()
{
int B = Y;
int A = X;
}
There are 24 possible memory access orderings.
| Column 1 | Access ordering* | Number of combinations | Results |
|---|---|---|---|
| Case 1 | A<X</br>B<Y | 6$\displaystyle=\frac{4!}{2!2!}$ | X=10</br>Y=20</br>A=1</br>B=2 |
| Case 2 | X<A</br>B<Y | 6 | X=10</br>Y=20</br>A=10</br>B=2 |
| Case 3 | A<X</br>Y<B | 6 | X=10</br>Y=20</br>A=1</br>B=20 |
| Case 4 | X<A</br>Y<B | 6 | X=10</br>Y=20</br>A=10</br>B=20 |
* A<X represents the thread 2 write A first, and then the thread 1 write X.
If we add memory fence as shown as below, we can only remove the case 3 but not also the case 1 since memory fence take effect only when initial writings have occurred.
__device__ int X = 1, Y = 2;
__device__ void writeXY()
{
X = 10;
__threadfence();
Y = 20;
}
__device__ void readXY()
{
int B = Y;
__threadfence(); // why we need this line?
int A = X;
}
__threadfence() in writeXY() ensures that the write to the global variable X is visible to all threads in the grid before the same thread proceeds to write the global variable Y.__threadfence() in readXY() ensures that the read of X in readXY() sees the most recent value written to global memory by any thread, including the thread executing writeXY().__device__ unsigned int count = 0;
__shared__ bool isLastBlockDone;
__global__ void sum(const float* array, unsigned int N,
volatile float* result)
{
// Each block sums a subset of the input array.
float partialSum = calculatePartialSum(array, N);
if (threadIdx.x == 0) {
// Thread 0 of each block stores the partial sum
// to global memory. The compiler will use
// a store operation that bypasses the L1 cache
// since the "result" variable is declared as
// volatile. This ensures that the threads of
// the last block will read the correct partial
// sums computed by all other blocks.
result[blockIdx.x] = partialSum;
// Thread 0 makes sure that the incrementation
// of the "count" variable is only performed after
// the partial sum has been written to global memory.
__threadfence();
// Thread 0 signals that it is done.
unsigned int value = atomicInc(&count, gridDim.x);
// Thread 0 determines if its block is the last
// block to be done.
isLastBlockDone = (value == (gridDim.x - 1));
}
// Synchronize to make sure that each thread reads
// the correct value of isLastBlockDone.
__syncthreads();
if (isLastBlockDone) {
// The last block sums the partial sums
// stored in result[0 .. gridDim.x-1]
float totalSum = calculateTotalSum(result);
if (threadIdx.x == 0) {
// Thread 0 of last block stores the total sum
// to global memory and resets the count
// varialble, so that the next kernel call
// works properly.
result[0] = totalSum;
count = 0;
}
}
}
result[blockIdx.x] = partialSum and going to move on to the next line, it typically means that the thread has just issued the command to write partialSum into memory. The write command would be placed in a queue to be executed by the memory subsystem. The thread itself does not wait for the write to physically complete in memory.__syncthreads() would merely synchronise threads in the current block only, without enforcing the global memory writes for other block.