
__device__ and __host__ qualifiers can be used together, in which case the function is compiled for both the host and the device.cudaMalloccudaError_t cudaMalloc ( void** devPtr, size_t size )
cudaMemcpycudaError_t cudaMemcpy ( void* dst, const void* src, size_t count, cudaMemcpyKind kind )
kind takes one of the following types:
cudaMemcpyHostToHostcudaMemcpyHostToDevicecudaMemcpyDeviceToHostcudaMemcpyDeviceToDeviceThis function exhibits synchronous behavior because the host application blocks until cudaMemcpy returns and the transfer is complete.
cudaMemsetcudaFree__syncthreads1D block: dim3 BlockDim(int Ntx)
2D block: dim3 BlockDim(int Ntx, int Nty)
2D block: dim3 BlockDim(int Ntx, int Nty, int Ntz)
1D grid: dim3 GridDim(int Nbx)
2D grid: dim3 GridDim(int Nbx, int Nby)
2D grid: dim3 GridDim(int Nbx, int Nby, int Nbz)
Nt[xyz] is the number of threads in x/y/z direction.
Nb[xyz] is the number of blocks in x/y/z direction.
See here
__global__ void Kernel(argument list)
Kernel<<<dim3 GridDim, dim3 BlockDim, size_t Ns, cudaStream_t S>>>(argument list)
Ns specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory; this dynamically allocated memory is used by any of the variables declared as an external array as mentioned in shared; Ns is an optional argument which defaults to 0;S specifies the associated stream, default is 0.
:::info
A kernel call is asynchronous with respect to the host thread. After a kernel is invoked, control returns to the host side immediately. You can call the following function to force the host application to wait for all kernels to complete. cudaError_t cudaDeviceSynchronize(void)
::::::info
void return typestatic variablesgridDimblockIdxblockDimthreadIdxwarpSize#define CHECK(call) \
{ \
const cudaError_t error = call; \
if (error != cudaSuccess) \
{ \
printf("Error: %s:%d, ", __FILE__, __LINE__); \
printf("code:%d, reason: %s\n", error, cudaGetErrorString(error)); \
exit(1); \
} \
}
Built-in API usage:
CHECK(cudaMemcpy(d_C, gpuRef, nBytes, cudaMemcpyHostToDevice));
Kernel call usage:
kernel_function<<<grid, block>>>(argument list);
CHECK(cudaDeviceSynchronize());
double cpuSecond() {
struct timeval tp;
gettimeofday(&tp,NULL);
return ((double)tp.tv_sec +(double)tp.tv_usec*1.e-6);
}
Timing kernel:
double iStart = cpuSecond();
kernel_name<<<grid, block>>>(argument list);
cudaDeviceSynchronize();
double iElaps = cpuSecond() - iStart;