OpenCL is an effective tool to improve the performance of our programs. However, there are far less references of OpenCL than CUDA, which makes it more difficult to get start of it.
Recently, I used OpenCL to write a kernel in a deep-learning framework. Since there was already a CUDA version of this kernel, I directly transfer the kernel from CUDA to OpenCL. I'm familier with neither CUDA nor OpenCL at first, so I write down here about how to map your code from CUDA to OpenCL. I will update this blog if I work on OpenCL about some more complex issues in the future.
The mapping of basic variables from CUDA to OpenCL
In CUDA, there are three important and common variables, which are threadIdx
, blockIdx
and blockDim
.
In OpenCL, instead, we use some API to get similar items, which are get_local_id()
, get_local_size()
, get_group_id()
and get_num_groups()
.
Note: here I assume you have some essential knowledges about CUDA. If not, you could search related concepts on your search engine first.
The mapping is simple, as is listed below.
CUDA | OpenCL |
threadIdx | get_local_id() |
blockIdx | get_group_id() |
blockDim | get_local_size() |
threadIdx.x | get_local_id(0) |
blockIdx.y | get_group_id(1) |
blockDim.z | get_local_size(2) |
The API get_num_groups()
, obviously, is to get how many groups(blocks) there are in the device in OpenCL. So far I haven't found a corresponding variable in CUDA.
The mapping of parameter settings from CUDA to OpenCL
The parameter settings here mainly refer to the setting of block size, block count and global size.
In CUDA, we use <<<>>>
to set these parameters, for example, forward_kernel<<<dim3(bx, by), dim3(tx, ty), 0, stream>>>
set block size to bx, by
and block count to tx, ty
respectively on two dimensions.
In OpenCL, we use the API below to set them.
cl_int
clEnqueueNDRangeKernel(
cl_command_queue command_queue,
cl_kernel kernel,
cl_uint work_dim, // set the dim
const size_t *global_work_offset,
const size_t *global_work_size, // set the global size
const size_t *local_work_size, // set the block size
cl_uint num_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event);
The difference is that we set block size and block count in CUDA and set block size and global size in OpenCL.
So what we need to do is to calculate the global size manually in our code before feed it to the API of OpenCL.
I hope this could help you if you are trying to convert a CUDA kernel into OpenCL with little knowledge about them just like me. At least I succeeded :)