GPGPU programming consists on the writing of two types of programs:
Host programs are running on the CPU.
They are used to manage kernels and memory transfers between the CPU RAM and the Device global memory.
Here is an example of an host code managing one kernel with SPOC:
let host_vec_add vec_size = let devices = Spoc.Devices.init() and a = Spoc.Vector.create Spoc.Vector.float32 vec_size and b = Spoc.Vector.create Spoc.Vector.float32 vec_size and res = Spoc.Vector.create Spoc.Vector.float32 vec_size in let threadsPerBlock = 256 in let blocksPerGrid = (vec_size + threadsPerBlock -1) / threadsPerBlock in let block = {Spoc.Kernel.blockX = threadsPerBlock; Spoc.Kernel.blockY = 1 ; Spoc.Kernel.blockZ = 1;} in let grid = {Spoc.Kernel.gridX = blocksPerGrid; Spoc.Kernel.gridY = 1 ; Spoc.Kernel.gridZ = 1;} in Spoc.Kernel.run devices.(0) (block,grid) vec_add (a, b, res, vec_size); for i = 0 to vec_size - 1 do Printf.printf "res[%d] = %g\n" i res.[<i>]; done
Let see in details: First we need to initialise the Spoc Library:
let devices = Spoc.Devices.init()
does it and devices is now an array containing every devices compatible with Spoc on the system.
Spoc.Vector.create Spoc.Vector.create Spoc.Vector.float32 vec_size
allows us to create a vector transferable between hosts and devices memory.
We must define the vector data type here Spoc.Vector.float32 and the size (here vec_size)
let threadsPerBlock = 256 in let blocksPerGrid = (vec_size + threadsPerBlock -1) / threadsPerBlock in let block = {Spoc.Kernel.blockX = threadsPerBlock; Spoc.Kernel.blockY = 1 ; Spoc.Kernel.blockZ = 1;} in let grid = {Spoc.Kernel.gridX = blocksPerGrid; Spoc.Kernel.gridY = 1 ; Spoc.Kernel.gridZ = 1;}
Here we define local and global dimensions to our problem.
When we launch our kernel, it will be executed with these parameters.
The grid represents the global dimension of the problem (a grid defines the number of block we will launch).
Blocks represent the local dimension of the problem (each block contains blockX*blockY*blockZ threads).
In total, launching the kernel will launch (blockX*blockY*blockZ)*(gridX*gridY*gridZ) threads on the device, split in (gridX*gridY*gridZ) blocks of (blockX*blockY*blockZ) threads.
We can now launch our kernel:
Spoc.Kernel.run devices.(0) (block,grid) vec_add (a, b, res, vec_size);
This will launch the Kernel vec_add on the device devices.(0) with the block and grid defined earlier and with the parameters (a, b, res, vec_size)
We can now print the result computed in res
for i = 0 to vec_size - 1 do Printf.printf "res[%d] = %g\n" i res.[<i>]; done
In this example there is no explicit memory transfers as SPOC handles them automatically, providing the needed vectors on each hardware needing it (CPU or GPGPU Device)
Using an external Kernel (Cuda or OpenCL)
To use an external Kernel we have to tell the host code where to look.
This is done using the following code:
kernel vec_add : Spoc.Vector.vfloat32 -> Spoc.Vector.vfloat32 -> Spoc.Vector.vfloat32 -> int -> unit = "kernels/Spoc_kernels" "vec_add"
This defines the kernel vec_add using it's type (corresponding to it's argument types associated to it's return type (a kernel must always return unit)).
We then have to tell SPOC where to find the kernel using the relative path to the correspondig .ptx (for Cuda) or .cl (for OpenCL) file without it's extension and the name of the function in this file we want to associate with our kernel.
Here vec_add is an external kernel located in the file kernels/Spoc_kernels.(ptx/cl) corresponding to the function vec_add and with 3 Spoc.Vector.vfloat32 and an int as its arguments.
kernel vec_add : Spoc.Vector.vfloat32 -> Spoc.Vector.vfloat32 -> Spoc.Vector.vfloat32 -> int -> unit = "kernels/Spoc_kernels" "vec_add" let host_vec_add vec_size = let devices = Spoc.Devices.init() and a = Spoc.Vector.create Spoc.Vector.float32 vec_size and b = Spoc.Vector.create Spoc.Vector.float32 vec_size and res = Spoc.Vector.create Spoc.Vector.float32 vec_size in let threadsPerBlock = 256 in let blocksPerGrid = (vec_size + threadsPerBlock -1) / threadsPerBlock in let block = {Spoc.Kernel.blockX = threadsPerBlock; Spoc.Kernel.blockY = 1 ; Spoc.Kernel.blockZ = 1;} in let grid = {Spoc.Kernel.gridX = blocksPerGrid; Spoc.Kernel.gridY = 1 ; Spoc.Kernel.gridZ = 1;} in Spoc.Kernel.run devices.(0) (block,grid) vec_add (a, b, res, vec_size); for i = 0 to vec_size - 1 do Printf.printf "res[%d] = %g\n" i res.[<i>]; done let _ = host_vec_add 100000