Cuda kernel coding tutorial
https://cuda-tutorial.readthedocs.io/en/latest/tutorials/tutorial01/
Totorial 01
both CPU and GPU are used. (Q. Which CPU and GPU?)
CPU : host / GPU : device system.
__global__ specifier indicates a function that runs on device(GPU).
function that can be called through host code : called "kernels"
compiler - use ncvv toolkit.
1. Allocate host memory and initialized host data
2. Allocate device memory
cudaMalloc(void **devPtr, size_t count) and
cudaFree(void *devPtr).
devPtr: devicePointer to allocate memory.
3. Transfer input data from host to device memory
cudaMemcpy(void *dst, void *src, size_t count, cudaMemcpyKind kind)
4. Execute kernels
5. Transfer output from device memory to host
Profiling performance :
time
or
nvprof ./vector_add
Tutorial 02: CUDA in Actions.
kernel execution configuration <<<...>>> - tell how many threads to launch on GPU.
group called "thread block".
<<< M, T >>>
kernel launches with a grid of M thread blocks. Each thread block has T parallel threads.
- ThreadIdx.x and blockDim.x
- threadIdx.x : contains the index of the thread within the block
- blockDim.x : contains the size of thread block.