A Simple Example

GPGPU programming consists on the writing of two types of programs:

host programs running on the CPU
device programs (called kernels) which run on the GPGPU device (for exemple a GPU)

Host Code

Host programs are running on the CPU.

They are used to manage kernels and memory transfers between the CPU RAM and the Device global memory.

Here is an example of an host code managing one kernel with SPOC:

let host_vec_add vec_size = 
  let devices = Spoc.Devices.init()
  and a = Spoc.Vector.create Spoc.Vector.float32 vec_size
  and b = Spoc.Vector.create Spoc.Vector.float32 vec_size
  and res = Spoc.Vector.create Spoc.Vector.float32 vec_size in
    let threadsPerBlock = 256 in
      let blocksPerGrid = (vec_size + threadsPerBlock -1) / threadsPerBlock in
        let block = {Spoc.Kernel.blockX = threadsPerBlock; Spoc.Kernel.blockY = 1 ;  Spoc.Kernel.blockZ = 1;} in
	  let grid = {Spoc.Kernel.gridX = blocksPerGrid; Spoc.Kernel.gridY = 1 ;  Spoc.Kernel.gridZ = 1;} in
	    Spoc.Kernel.run devices.(0) (block,grid) vec_add (a, b, res, vec_size);
            for i = 0 to vec_size - 1 do 
              Printf.printf "res[%d] = %g\n" i res.[<i>];
            done

Let see in details: First we need to initialise the Spoc Library:

  let devices = Spoc.Devices.init()

does it and devices is now an array containing every devices compatible with Spoc on the system.

Spoc.Vector.create Spoc.Vector.create Spoc.Vector.float32 vec_size

allows us to create a vector transferable between hosts and devices memory.

We must define the vector data type here Spoc.Vector.float32 and the size (here vec_size)

let threadsPerBlock = 256 in
  let blocksPerGrid = (vec_size + threadsPerBlock -1) / threadsPerBlock in
    let block = {Spoc.Kernel.blockX = threadsPerBlock; Spoc.Kernel.blockY = 1 ;  Spoc.Kernel.blockZ = 1;} in
      let grid = {Spoc.Kernel.gridX = blocksPerGrid; Spoc.Kernel.gridY = 1 ;  Spoc.Kernel.gridZ = 1;}

Here we define local and global dimensions to our problem.

When we launch our kernel, it will be executed with these parameters.

The grid represents the global dimension of the problem (a grid defines the number of block we will launch).

Blocks represent the local dimension of the problem (each block contains blockX*blockY*blockZ threads).

In total, launching the kernel will launch (blockX*blockY*blockZ)*(gridX*gridY*gridZ) threads on the device, split in (gridX*gridY*gridZ) blocks of (blockX*blockY*blockZ) threads.

We can now launch our kernel:

Spoc.Kernel.run devices.(0) (block,grid) vec_add (a, b, res, vec_size);

This will launch the Kernel vec_add on the device devices.(0) with the block and grid defined earlier and with the parameters (a, b, res, vec_size)

We can now print the result computed in res

for i = 0 to vec_size - 1 do 
  Printf.printf "res[%d] = %g\n" i res.[<i>];
done

In this example there is no explicit memory transfers as SPOC handles them automatically, providing the needed vectors on each hardware needing it (CPU or GPGPU Device)

Kernel Code

Using an external Kernel (Cuda or OpenCL)

To use an external Kernel we have to tell the host code where to look.

This is done using the following code:

kernel vec_add : Spoc.Vector.vfloat32 -> Spoc.Vector.vfloat32 -> Spoc.Vector.vfloat32 -> int -> unit = "kernels/Spoc_kernels" "vec_add"

This defines the kernel vec_add using it's type (corresponding to it's argument types associated to it's return type (a kernel must always return unit)).

We then have to tell SPOC where to find the kernel using the relative path to the correspondig .ptx (for Cuda) or .cl (for OpenCL) file without it's extension and the name of the function in this file we want to associate with our kernel.

Here vec_add is an external kernel located in the file kernels/Spoc_kernels.(ptx/cl) corresponding to the function vec_add and with 3 Spoc.Vector.vfloat32 and an int as its arguments.

Full Code

vec_add.ml

kernel vec_add : Spoc.Vector.vfloat32 -> Spoc.Vector.vfloat32 -> Spoc.Vector.vfloat32 -> int -> unit = "kernels/Spoc_kernels" "vec_add"
 
 
let host_vec_add vec_size = 
  let devices = Spoc.Devices.init()
  and a = Spoc.Vector.create Spoc.Vector.float32 vec_size
  and b = Spoc.Vector.create Spoc.Vector.float32 vec_size
  and res = Spoc.Vector.create Spoc.Vector.float32 vec_size in
    let threadsPerBlock = 256 in
      let blocksPerGrid = (vec_size + threadsPerBlock -1) / threadsPerBlock in
        let block = {Spoc.Kernel.blockX = threadsPerBlock; Spoc.Kernel.blockY = 1 ;  Spoc.Kernel.blockZ = 1;} in
	  let grid = {Spoc.Kernel.gridX = blocksPerGrid; Spoc.Kernel.gridY = 1 ;  Spoc.Kernel.gridZ = 1;} in
	    Spoc.Kernel.run devices.(0) (block,grid) vec_add (a, b, res, vec_size);
            for i = 0 to vec_size - 1 do 
              Printf.printf "res[%d] = %g\n" i res.[<i>];
            done
 
let _ = host_vec_add 100000