As we known, it is faster to use local memory or private memory in Open CL. To compare them, I had made a test on it. Before doing that, we may find the time elapsed of each part:

  • Open CL usage: about 0.9s (look fair)
  • Merging semiResult to finalResult: 18s (too bad)

It is very clear that the bottleneck is merging semiResult to finalResult.

If we use Python Image Library’s histogram function, the elapsed time only took: 1.5s with the same picture, images/7728x4354.jpg.

Private Memory

For private memory testing, it still worths to use private memory in the kernel code without changing others. The commit is a test of private memory. Since we need to save the result to the global memory result, we may save the result to private memory and try to copy them to global memory. If we look closer to the number of access of global memory, it still the same while using global memory. So, the kernel code looks like:

// 1024 pixels per work item.
#define BIN_SIZE 1024
#define RESULT_SIZE 768
 
__kernel void histogram(__global unsigned char* bytes, __global unsigned int* pixelCount,
                        __global unsigned int* result)
{
  unsigned int lid = get_local_id(0);
  unsigned int gid = get_group_id(0);
  unsigned int gsize = get_local_size(0);
  unsigned int globalId = get_global_id(0);
 
  unsigned int i, bValue;
  unsigned int basePixelIdx = lid * BIN_SIZE + gid * gsize * BIN_SIZE;
  unsigned int baseResultIdx = globalId * RESULT_SIZE;
  unsigned int maxPixel = *pixelCount;
  unsigned int privateBuffer[RESULT_SIZE];
 
  for (i = 0; i < RESULT_SIZE; i++) {
    result[baseResultIdx + i] = 0;
  }
 
  unsigned int processIndex = 0;
  while (processIndex < BIN_SIZE && (basePixelIdx + processIndex < maxPixel)) {
    // data partition of bytes is RGBRGBRGB....
    bValue = bytes[basePixelIdx * 3 + processIndex * 3];
    // result partition is RR..RRGG..GGBB..BB.
    privateBuffer[bValue]++;
    // G
    bValue = bytes[basePixelIdx * 3 + processIndex * 3 + 1];
    privateBuffer[256 + bValue]++;
    // B
    bValue = bytes[basePixelIdx * 3 + processIndex * 3 + 2];
    privateBuffer[512 + bValue]++;
    processIndex++;
  }
 
  for (i = 0; i < RESULT_SIZE; i++) {
    result[baseResultIdx + i] = privateBuffer[i];
  }
}

I created a 768 unsigned int array for temporary storage and copied all of them to global memory result. The improvement seems not good. The elapsed time of Open CL looks the same, about 0.9s.

First Attempt to Move the Merging Code to Open CL

Since we already knew the bottleneck, I had moved the merging code from sequence CPU to Open CL. I still kept a huge temporary result for each work item, 768 unsigned integers for each work item. But the difference is the final result is calculated at Open CL. I used the lastest 768 work items to copy each temporary result to final result. If we have 100 work items, there are 768 * 100 read operations on global memory and 768 write operations on global memory. In the previous version, only 768 write operations on global memory was needed. I know we need to loop all of the pixels and this part is the same. The following is the partial kernel code of this version:

  // Only use the latest 768 work items to copy the data. We assume that the
  // latest 768 work items are the last group executes.
  unsigned int lastGroup = totalTasks - RESULT_SIZE;
 
  for (i = 0; i < RESULT_SIZE; i++) {
    privateBuffer[i] = 0;
  }
  if (globalId >= lastGroup) {
    finalResult[globalId] = 0;
  }
 
  unsigned int processIndex = 0;
  while (processIndex < BIN_SIZE && (basePixelIdx + processIndex < maxPixel)) {
    ...(the same)
  }
 
  for (i = 0; i < RESULT_SIZE; i++) {
    tempResult[baseResultIdx + i] = privateBuffer[i];
  }
 
  barrier(CLK_GLOBAL_MEM_FENCE);
 
  if (globalId >= lastGroup) {
    for (i = 0; i < totalTasks; i++) {
      finalResult[globalId - lastGroup] += tempResult[globalId - lastGroup + i * RESULT_SIZE];
    }
  }

The result had significant improvement on the speed. Since we don’t have the merging code in python, the Open CL part became: about 2.5s. It runs faster than before but still slower than Python Image Library.

Conclusion

There are two things can be improved:

  • trying to use local memory to save the temporary result
  • use work group to test if any improvement earned.

Open CL is an open spec in heterogeneous parallel programming. As a front-end engineer, it should not be related to my field. But it’s so interesting that I want to learn more about it. This is my first Open CL program at here.

Before writing atheny code, I am searching for scripting language binding of Open CL. It’s so pitty that I cannot find a good JavaScript binding of it since JavaScript is my primary programming language in Mozilla. So, the alternative is Python. The python binding of opencl is pyopencl. You may find a lots articles in google to know how to install it.

After googling, a good example can be found here. So, I try to expose my owned codes. If you guys know other good examples, please let me know. I will be very happy to list yours in my site.

My first examle is “histogram”. Before discussing it, let me explain a little bit about histogram if you don’t know it. A histogram is a spectrum of R, G, B color. There are so many types of histogram. Here is R, G, B histogram instead of graylevel histogram, or other color space’s histogram.

Before using Open CL, I tried to write a small piece of Open CL code. The first part moved to Open CL is the R, G, B counting. The program looks like:

// 1024 pixels per work item.
#define BIN_SIZE 1024
#define RESULT_SIZE 768
 
__kernel void histogram(__global unsigned char* bytes, __global unsigned int* pixelCount,
                        __global unsigned int* result)
{
  unsigned int lid = get_local_id(0);
  unsigned int gid = get_group_id(0);
  unsigned int gsize = get_local_size(0);
  unsigned int globalId = get_global_id(0);
 
  unsigned int i, bValue;
  unsigned int basePixelIdx = lid * BIN_SIZE + gid * gsize * BIN_SIZE;
  unsigned int baseResultIdx = globalId * RESULT_SIZE;
  unsigned int maxPixel = *pixelCount;
 
  for (i = 0; i < RESULT_SIZE; i++) {
    result[baseResultIdx + i] = 0;
  }
 
  unsigned int processIndex = 0;
  while (processIndex < BIN_SIZE && (basePixelIdx + processIndex < maxPixel)) {
    // data partition of bytes is RGBRGBRGB....
    bValue = bytes[basePixelIdx * 3 + processIndex * 3];
    // result partition is RR..RRGG..GGBB..BB.
    result[baseResultIdx + bValue]++;
    // G
    bValue = bytes[basePixelIdx * 3 + processIndex * 3 + 1];
    result[baseResultIdx + 256 + bValue]++;
    // B
    bValue = bytes[basePixelIdx * 3 + processIndex * 3 + 2];
    result[baseResultIdx + 512 + bValue]++;
    processIndex++;
  }
}

As you seen, I tried to use a work item to process 1024 pixels. The bytes argument is the raw pixel data which is partitioned in RGBRGBRGB…RGB for each bytes. We have to check the pixelCount because the last work item may not need to process a full 1024 pixels array. So, I use (basePixelIdx + processIndex < maxPixel) for the check. The result argument are temporary result for each work item. So, the array is so huge. In this case, if we have 1000 work items, we will have 768,000 unsigned integer at this argument. I know the performance may not be good. But it’s nice to have a simple logic in the first example. The result field is partitioned in RRR…RRRGGG…GGG.BBB…BBB. This is for compatible with python image module.

The python code to invoke this kernel code is:

...
def opencl_histogram(pixels):
  # format of pixels is RGBRGBRGB each of character in a byte
  # calculate buffer size
  groupSize = 4
  binSize = 1024
  pixelSize = len(pixels) / 3 
  trunkSize = int(math.ceil(math.ceil(pixelSize / groupSize) / binSize))
  globalSize = int(math.ceil(pixelSize / binSize))
  globalSize += (groupSize - globalSize % groupSize)
  # buffer size is 768(whole space) * group size * trunk size
  outputBufSize = 768 * groupSize * trunkSize
  #create context/queue
  clContext = cl.create_some_context()
  clQueue = cl.CommandQueue(clContext)
  f = open('histogram.c', 'r')
  fstr = ''.join(f.readlines())
  # create the program
  clProgram = cl.Program(clContext, fstr).build()
  # create buffers
  mf = cl.mem_flags
  bufPixels = cl.Buffer(clContext, mf.READ_ONLY | mf.USE_HOST_PTR, hostbuf=pixels)
  bufPixelSize = cl.Buffer(clContext, mf.READ_ONLY | mf.USE_HOST_PTR, size=4, hostbuf=numpy.asarray([pixelSize]).astype(numpy.uint32))
  bufOutput = cl.Buffer(clContext, mf.WRITE_ONLY, size=outputBufSize * 4, hostbuf=None)
  clProgram.histogram(clQueue, (globalSize, ), (groupSize, ), bufPixels, bufPixelSize, bufOutput)
  semiFinal = numpy.zeros(outputBufSize, dtype=numpy.uint32)
  evt = cl.enqueue_read_buffer(clQueue, bufOutput, semiFinal)
  evt.wait()
 
  finalResult = [0] * 768
  for i in range(outputBufSize):
    finalResult[i % 768] += semiFinal[i]
 
  return finalResult
 
parser = argparse.ArgumentParser(description='Dump histogram data.')
parser.add_argument('--input', help='the input image')
parser.add_argument('--dump', help='dump the histogram if specify any value')
 
args = parser.parse_args()
 
if args.input is None:
  parser.print_help()
  sys.exit(1)
print ('trying to build histogram data for {}'.format(args.input))
 
image = Image.open(args.input)
 
...
print ('-' * 20)
 
start_time = time()
histogramG = opencl_histogram(image.tobytes())
end_time = time()
print ('time elapsed with open cl: {0}s'.format(end_time - start_time))
...

After Open CL computation, we need to merge each work items’ results into a single 768 array.

The performance of this code is very poor since we need to do so many useless loops to merge each work items’ results into a single 768 array.

I have another branch which have better performance but with some known issues or limitations. I will write another article to say how I did.

BTW, I am reading “Open CL Programming by Example” now, but only finished 4 chapters. I will continue to update my code to improve the performance.

This is my first time to build a robot with LEGO EV3. It’s a whole new great experience. The modules are designed very well. The user interface is great. That tells why LEGO has the largest community in the world, even larger than mozilla. Here is my TRACK3R:

Are you playing Real Basketball? No matter yes or no, I am playing it. This is an awesome game. If you don’t play it, you should try it. But I am not a good play honestly. So, I want to hire someone to swipe the phone for me. It sounds non-sense to hire a real person to do it for me. So, I wrote a small ShellScript to do that for me. I called it RealBasketballBot. You may find it at my github repo: https://github.com/john-hu/RealBasketballBot. It’s easy to use it: just plug the USB line and type the command ./throw_ball.sh 400 1280 400 600 0.2 . It works pretty well. BTW, you should install android USB driver if you use a Windows box.

miidio Services on Banana Pi

John Hu had wrote some services running on Banana Pi a few months ago. It is so incredible that Banana Pi is so stable. The services had run about 3 months and merely didn’t be down only for a few network issues. One of the services is an audio conversion service which supports miidio space. Currently, there are 5 services running on Banana Pi. 3 of them run daily and 2 of them run hourly. This hardware is so stable.

John Hu had also tried to run a coin miner on Banana Pi. But its CPU power is so weak, comparing to a normal PC. And there is no GPU version miner for this kind of hardware. So, I turned it off with no coin got. That’s so pitty. I think they will open their source of GPU driver. And it is possible to do such kind of miner.

Official website of Banana Pi: http://www.bananapi.org/