npu compiler

Cuda kernel coding tutorial

jyb0101 2025. 5. 5. 15:40

https://cuda-tutorial.readthedocs.io/en/latest/tutorials/tutorial01/

Totorial 01

both CPU and GPU are used. (Q. Which CPU and GPU?)

CPU : host / GPU : device system. 

__global__ specifier indicates a function that runs on device(GPU).

function that can be called through host code : called "kernels"

 

compiler - use ncvv toolkit. 

1. Allocate host memory and initialized host data

2. Allocate device memory

cudaMalloc(void **devPtr, size_t count) and

cudaFree(void *devPtr).

devPtr: devicePointer to allocate memory.

3. Transfer input data from host to device memory

cudaMemcpy(void *dst, void *src, size_t count, cudaMemcpyKind kind)

 

4. Execute kernels

5. Transfer output from device memory to host

 

Profiling performance : 

time 

or

 

nvprof ./vector_add

 

Tutorial 02: CUDA in Actions. 

kernel execution configuration <<<...>>> - tell how many threads to launch on GPU.

group called "thread block".

<<< M, T >>>

kernel launches with a grid of M thread blocks. Each thread block has T parallel threads. 

- ThreadIdx.x and blockDim.x

- threadIdx.x : contains the index of the thread within the block

- blockDim.x : contains the size of thread block.