Explicit Barrier

__syncthreads() waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block.

Memory Fence

Example 1

global void swap (int* A, int* B) {
unsigned int idx = threadIdx.x + blockDim.x * blockIdx.x;

// Reaplce “A[idx] = ( idx < BLOCK_SIZE/2 ) ? 1 : 2” by the line below to avoid warp divergence A[idx] = (idx-(idx%(BLOCK_SIZE/2))) / (BLOCK_SIZE/2) + 1;

__threadfence_block();

B[idx] = A[(BLOCK_SIZE-1) - idx]; }

# Example 2
1. thread 1 executes `writeXY()`, while thread 2 executes `readXY()`.

   ```cuda=
   __device__ int X = 1, Y = 2;
   
   __device__ void writeXY()
   {
       X = 10;
       Y = 20;
   }
   
   __device__ void readXY()
   {
       int B = Y;
       int A = X;
   }

There are 24 possible memory access orderings.

Column 1 Access ordering* Number of combinations Results
Case 1 A<X</br>B<Y 6$\displaystyle=\frac{4!}{2!2!}$ X=10</br>Y=20</br>A=1</br>B=2
Case 2 X<A</br>B<Y 6 X=10</br>Y=20</br>A=10</br>B=2
Case 3 A<X</br>Y<B 6 X=10</br>Y=20</br>A=1</br>B=20
Case 4 X<A</br>Y<B 6 X=10</br>Y=20</br>A=10</br>B=20

* A<X represents the thread 2 write A first, and then the thread 1 write X.

If we add memory fence as shown as below, we can only remove the case 3 but not also the case 1 since memory fence take effect only when initial writings have occurred.

__device__ int X = 1, Y = 2;

__device__ void writeXY()
{
    X = 10;
    __threadfence();
    Y = 20;
}

__device__ void readXY()
{
    int B = Y;
    __threadfence(); // why we need this line?
    int A = X;
}

Example 3

__device__ unsigned int count = 0;
__shared__ bool isLastBlockDone;
__global__ void sum(const float* array, unsigned int N,
                    volatile float* result)
{
    // Each block sums a subset of the input array.
    float partialSum = calculatePartialSum(array, N);

    if (threadIdx.x == 0) {

        // Thread 0 of each block stores the partial sum
        // to global memory. The compiler will use
        // a store operation that bypasses the L1 cache
        // since the "result" variable is declared as
        // volatile. This ensures that the threads of
        // the last block will read the correct partial
        // sums computed by all other blocks.
        result[blockIdx.x] = partialSum;

        // Thread 0 makes sure that the incrementation
        // of the "count" variable is only performed after
        // the partial sum has been written to global memory.
        __threadfence();

        // Thread 0 signals that it is done.
        unsigned int value = atomicInc(&count, gridDim.x);

        // Thread 0 determines if its block is the last
        // block to be done.
        isLastBlockDone = (value == (gridDim.x - 1));
    }

    // Synchronize to make sure that each thread reads
    // the correct value of isLastBlockDone.
    __syncthreads();

    if (isLastBlockDone) {

        // The last block sums the partial sums
        // stored in result[0 .. gridDim.x-1]
        float totalSum = calculateTotalSum(result);

        if (threadIdx.x == 0) {

            // Thread 0 of last block stores the total sum
            // to global memory and resets the count
            // varialble, so that the next kernel call
            // works properly.
            result[0] = totalSum;
            count = 0;
        }
    }
}

Discussion

__syncthreads() would merely synchronise threads in the current block only, without enforcing the global memory writes for other block.