Open CL is an open spec in heterogeneous parallel programming. As a front-end engineer, it should not be related to my field. But it’s so interesting that I want to learn more about it. This is my first Open CL program at here.

Before writing atheny code, I am searching for scripting language binding of Open CL. It’s so pitty that I cannot find a good JavaScript binding of it since JavaScript is my primary programming language in Mozilla. So, the alternative is Python. The python binding of opencl is pyopencl. You may find a lots articles in google to know how to install it.

After googling, a good example can be found here. So, I try to expose my owned codes. If you guys know other good examples, please let me know. I will be very happy to list yours in my site.

My first examle is “histogram”. Before discussing it, let me explain a little bit about histogram if you don’t know it. A histogram is a spectrum of R, G, B color. There are so many types of histogram. Here is R, G, B histogram instead of graylevel histogram, or other color space’s histogram.

Before using Open CL, I tried to write a small piece of Open CL code. The first part moved to Open CL is the R, G, B counting. The program looks like:

// 1024 pixels per work item.
#define BIN_SIZE 1024
#define RESULT_SIZE 768
__kernel void histogram(__global unsigned char* bytes, __global unsigned int* pixelCount,
                        __global unsigned int* result)
  unsigned int lid = get_local_id(0);
  unsigned int gid = get_group_id(0);
  unsigned int gsize = get_local_size(0);
  unsigned int globalId = get_global_id(0);
  unsigned int i, bValue;
  unsigned int basePixelIdx = lid * BIN_SIZE + gid * gsize * BIN_SIZE;
  unsigned int baseResultIdx = globalId * RESULT_SIZE;
  unsigned int maxPixel = *pixelCount;
  for (i = 0; i < RESULT_SIZE; i++) {
    result[baseResultIdx + i] = 0;
  unsigned int processIndex = 0;
  while (processIndex < BIN_SIZE && (basePixelIdx + processIndex < maxPixel)) {
    // data partition of bytes is RGBRGBRGB....
    bValue = bytes[basePixelIdx * 3 + processIndex * 3];
    // result partition is RR..RRGG..GGBB..BB.
    result[baseResultIdx + bValue]++;
    // G
    bValue = bytes[basePixelIdx * 3 + processIndex * 3 + 1];
    result[baseResultIdx + 256 + bValue]++;
    // B
    bValue = bytes[basePixelIdx * 3 + processIndex * 3 + 2];
    result[baseResultIdx + 512 + bValue]++;

As you seen, I tried to use a work item to process 1024 pixels. The bytes argument is the raw pixel data which is partitioned in RGBRGBRGB…RGB for each bytes. We have to check the pixelCount because the last work item may not need to process a full 1024 pixels array. So, I use (basePixelIdx + processIndex < maxPixel) for the check. The result argument are temporary result for each work item. So, the array is so huge. In this case, if we have 1000 work items, we will have 768,000 unsigned integer at this argument. I know the performance may not be good. But it’s nice to have a simple logic in the first example. The result field is partitioned in RRR…RRRGGG…GGG.BBB…BBB. This is for compatible with python image module.

The python code to invoke this kernel code is:

def opencl_histogram(pixels):
  # format of pixels is RGBRGBRGB each of character in a byte
  # calculate buffer size
  groupSize = 4
  binSize = 1024
  pixelSize = len(pixels) / 3 
  trunkSize = int(math.ceil(math.ceil(pixelSize / groupSize) / binSize))
  globalSize = int(math.ceil(pixelSize / binSize))
  globalSize += (groupSize - globalSize % groupSize)
  # buffer size is 768(whole space) * group size * trunk size
  outputBufSize = 768 * groupSize * trunkSize
  #create context/queue
  clContext = cl.create_some_context()
  clQueue = cl.CommandQueue(clContext)
  f = open('histogram.c', 'r')
  fstr = ''.join(f.readlines())
  # create the program
  clProgram = cl.Program(clContext, fstr).build()
  # create buffers
  mf = cl.mem_flags
  bufPixels = cl.Buffer(clContext, mf.READ_ONLY | mf.USE_HOST_PTR, hostbuf=pixels)
  bufPixelSize = cl.Buffer(clContext, mf.READ_ONLY | mf.USE_HOST_PTR, size=4, hostbuf=numpy.asarray([pixelSize]).astype(numpy.uint32))
  bufOutput = cl.Buffer(clContext, mf.WRITE_ONLY, size=outputBufSize * 4, hostbuf=None)
  clProgram.histogram(clQueue, (globalSize, ), (groupSize, ), bufPixels, bufPixelSize, bufOutput)
  semiFinal = numpy.zeros(outputBufSize, dtype=numpy.uint32)
  evt = cl.enqueue_read_buffer(clQueue, bufOutput, semiFinal)
  finalResult = [0] * 768
  for i in range(outputBufSize):
    finalResult[i % 768] += semiFinal[i]
  return finalResult
parser = argparse.ArgumentParser(description='Dump histogram data.')
parser.add_argument('--input', help='the input image')
parser.add_argument('--dump', help='dump the histogram if specify any value')
args = parser.parse_args()
if args.input is None:
print ('trying to build histogram data for {}'.format(args.input))
image =
print ('-' * 20)
start_time = time()
histogramG = opencl_histogram(image.tobytes())
end_time = time()
print ('time elapsed with open cl: {0}s'.format(end_time - start_time))

After Open CL computation, we need to merge each work items’ results into a single 768 array.

The performance of this code is very poor since we need to do so many useless loops to merge each work items’ results into a single 768 array.

I have another branch which have better performance but with some known issues or limitations. I will write another article to say how I did.

BTW, I am reading “Open CL Programming by Example” now, but only finished 4 chapters. I will continue to update my code to improve the performance.

This is my first time to build a robot with LEGO EV3. It’s a whole new great experience. The modules are designed very well. The user interface is great. That tells why LEGO has the largest community in the world, even larger than mozilla. Here is my TRACK3R:

Are you playing Real Basketball? No matter yes or no, I am playing it. This is an awesome game. If you don’t play it, you should try it. But I am not a good play honestly. So, I want to hire someone to swipe the phone for me. It sounds non-sense to hire a real person to do it for me. So, I wrote a small ShellScript to do that for me. I called it RealBasketballBot. You may find it at my github repo: It’s easy to use it: just plug the USB line and type the command ./ 400 1280 400 600 0.2 . It works pretty well. BTW, you should install android USB driver if you use a Windows box.

miidio Services on Banana Pi

John Hu had wrote some services running on Banana Pi a few months ago. It is so incredible that Banana Pi is so stable. The services had run about 3 months and merely didn’t be down only for a few network issues. One of the services is an audio conversion service which supports miidio space. Currently, there are 5 services running on Banana Pi. 3 of them run daily and 2 of them run hourly. This hardware is so stable.

John Hu had also tried to run a coin miner on Banana Pi. But its CPU power is so weak, comparing to a normal PC. And there is no GPU version miner for this kind of hardware. So, I turned it off with no coin got. That’s so pitty. I think they will open their source of GPU driver. And it is possible to do such kind of miner.

Official website of Banana Pi:

What Is the Visibility Monitor?

An app with a lots images brings memory pressure in mobile world. In Android, they tries to recycle the existing UI elements to show images. The original image is freed automatically when the UI element’s image source is changed. That’s handled by Java’s reference count, IIRC.

We have similar tricks in Firefox OS. Gecko, the core engine of Firefox, has an image visibility monitor tool. It releases the uncompressed image when an image is out of screen, but keeps its DOM structure for other updates. That brings you a fast and easy way to manage your gallery view.

Besides Gecko, Gaia, the code name of apps level of Firefox OS, also has a visibility monitor. It tells the app which UI element is in or out of screen. We can use it to free or load image on to img element. It helps Gallery app out of memory presure issues.

The difference between Gecko’s version and Gaia’s version is that the Gecko version releases the uncompressed image but keeps the raw image which is compressed image downloaded from server, like JPEG or PNG. The Gaia version takes images off the UI element which releases both uncompressed and compressed images. The memory usage of Gaia version is lower than Gecko version. But the scrolling performance, FPS, of the Gaia version is worse than Gecko version because the Gecko needs to download the images from server or file system in the Gaia version. This document focuses on the Gaia version.

How Does It Work?

The example of visibility monitor
The example of visibility monitor

The above image is the definition of each area of a scrollable frame. We have pre-rendered area, display port area, margin area, and non-rendered area to comprise the all area. Images in pre-rendered area are loaded, uncompressed and rendered. The full area is the height of display port + 2 x margin. The display port is the real area shown on the screen. The margin area provides a buffer to make scrolling more smooth. The larger margin area brings the higher FPS and larger memory consumption. The default vaue of margin in Gecko version is a size of display port area. We don’t have the default value in Gaia version.

Visibility Monitor API (link)

There is only one function in tag visibility monitor, which is monitorTagVisibility. It takes 6 arguments:

  1. container: the scrollable element we want to monitor the children
  2. tag: the tag to monitor
  3. scrollMargin: the size of margin area
  4. scrollDelta: how much the container needs to be scrolled before onscreen and offscreen are recalculated. The higher value means callbacks fired less frequently, but there are more of them when they are fired/
  5. onscreenCallback: called with the element that is now onscreen
  6. offscreenCallback: called with the element that is now offscreen

An example can be found at Gallery app in v1.3t branch:

  var visibilityMargin = 360;
  var minimumScrollDelta = 1;
  visibilityMonitor =
    monitorTagVisibility(thumbnails, 'li',
                         visibilityMargin, // extra space top and bottom
                         minimumScrollDelta, // min scroll before we do work
                         thumbnailOnscreen, // set background image
                         thumbnailOffscreen); // remove background image

The thumbnails is the DOM element container which hosts all thumbnail items. The tag name of thumbnail item is “li” according to this example. We use 360 which is the 3/4 height of the screen size and 1 as the scrollDelta. We just set or remove the background image in the onscreen/offscreen callbacks.

The Performance of Visibility Monitor

We choose the low-end device, Tarako, to show the performance of it.

The memory consumption of Music app without visibility monitor is:

├──23.48 MB (41.04%) -- images
│  ├──23.48 MB (41.04%) -- content
│  │  ├──23.48 MB (41.04%) -- used
│  │  │  ├──17.27 MB (30.18%) ── uncompressed-nonheap
│  │  │  ├───6.10 MB (10.66%) ── raw
│  │  │  └───0.12 MB (00.20%) ── uncompressed-heap
│  │  └───0.00 MB (00.00%) ++ unused
│  └───0.00 MB (00.00%) ++ chrome

The memory consumption of Music app with visibility monitor is:

├───6.75 MB (16.60%) -- images
│   ├──6.75 MB (16.60%) -- content
│   │  ├──5.77 MB (14.19%) -- used
│   │  │  ├──3.77 MB (09.26%) ── uncompressed-nonheap
│   │  │  ├──1.87 MB (04.59%) ── raw
│   │  │  └──0.14 MB (00.34%) ── uncompressed-heap
│   │  └──0.98 MB (02.41%) ++ unused
│   └──0.00 MB (00.00%) ++ chrome

It seems we have large improvement in memory consumption.