Cuda kernel coding tutorial

npu compiler

jyb0101 2025. 5. 5. 15:40

Totorial 01

both CPU and GPU are used. (Q. Which CPU and GPU?)

CPU : host / GPU : device system.

__global__ specifier indicates a function that runs on device(GPU).

function that can be called through host code : called "kernels"

compiler - use ncvv toolkit.

1. Allocate host memory and initialized host data

2. Allocate device memory

cudaMalloc(void **devPtr, size_t count) and

cudaFree(void *devPtr).

devPtr: devicePointer to allocate memory.

3. Transfer input data from host to device memory

cudaMemcpy(void *dst, void *src, size_t count, cudaMemcpyKind kind)

4. Execute kernels

5. Transfer output from device memory to host

Profiling performance :

time

nvprof ./vector_add

Tutorial 02: CUDA in Actions.

kernel execution configuration <<<...>>> - tell how many threads to launch on GPU.

group called "thread block".

<<< M, T >>>

kernel launches with a grid of M thread blocks. Each thread block has T parallel threads.

- ThreadIdx.x and blockDim.x

- threadIdx.x : contains the index of the thread within the block

- blockDim.x : contains the size of thread block.