System App Peer

John Hu had became a peer of system app. I am responsible for TV system app since I had worked on the first version of TV gaia. It’s my honor to be part of system app.

It’s so nice to see android has keyboard shortcuts feature with keyboard attached. But it may be a little annoying that we want to use those keys to do what we want. John Hu had created another open source project called shun-feng-er which is a tool for eye impaired person. During this project, we need to override the behavior of those shortcuts.

As per discussions and here, it is not possible to disable it or overide it at app level. The solution I found is to use Xposed Framework to inject code at WindowManagerPolicy. This needs a rooted device. But it doesn’t make sense to ask an user with hardware keyboard to have a rooted device.

After some investigations, I found that there is a keyboard layout config called “apple wireless keyboard”. It had change the meta key to home key. The most magic part is all shortcuts related to meta key are disabled, like META + C, META + TAB, etc That’s amazing. So, I try to find how it does.

The main part is to remap the key, it only remaps META key to HOME key. This is its configuration:

# remap meta key
map key 125 F12
map key 126 F12
 
# remap alt key
map key 56  F11
map key 100 F11

We need to change the keyboard layout once connected. It can be made at Settings app -> Language and input -> Physical keyboard -> (tap it and choose new mapping config)

Please find the whole patch at https://github.com/john-hu/shun-feng-er/commit/a2baf65e0407f294534560278b72474cb9f9dc13.

While writing programs in pyopencl, we should face the indentation of python and CL. According to the coding convention, python uses 4 spaces as the indentation and C uses 2 spaces as the indentation. So, we may want to configure our IDE to have different indentation on different programming languages.

As a user of Sublime Text, I found that the Syntax Specific is the correct place to configure it. If we want to configure C to use 2 spaces as its default indentation, we can do the followings:

  1. Open/create a C file
  2. Open Syntax Specific at Preferences -> Settings-More -> Syntax Specific
  3. use the following content as the opened file which should be C++.sublime-settings:
{
  "tab_size": 2,
  "translate_tabs_to_spaces": true
}

It’s so cool to have that.

One of my workmate had experiences on Open CL. We had a discussion on the improvement of histogram. And I learned that we should calculate the cost of Open CL before applying Open CL to a topic. The cost of Open CL is the time that we prepared everything for running Open CL. If the cost is larger than sequence code, we would never get any improvement with Open CL. But in most of cases, we will introduce a lot of calculation to Open CL. So, I wrote a program to measure the cost.

The program is pretty simple. We only measure the followings:

  1. creating context and queue
  2. reading file from disk
  3. building the kernel program
  4. creating buffer
  5. running the kerner program

The python looks like:

  def run(self):
    # measure elapsed time for creating context and queue
    tQueue = time()
    clContext = cl.create_some_context()
    clQueue = cl.CommandQueue(clContext)
    self.result.append({'type': 'create queue', 'time': (time() - tQueue)})
    # measure elapsed time for reading kernel program from file system
    tReadFile = time()
    f = open('baseline.c', 'r')
    fstr = ''.join(f.readlines())
    self.result.append({'type': 'read file', 'time': (time() - tReadFile)})
    # measure elapsed time for building the kernel program.
    tProgram = time()
    clProgram = cl.Program(clContext, fstr).build()
    self.result.append({'type': 'build program', 'time': (time() - tProgram)})
    # measure elapsed time for creating buffer.
    mf = cl.mem_flags
    pyBuffer = array('i', [0] * self.options.buffer_size)
    tBuffer = time()
    clBuffer = cl.Buffer(clContext, mf.READ_WRITE | mf.USE_HOST_PTR, hostbuf=pyBuffer)
    cl.enqueue_write_buffer(clQueue, clBuffer, pyBuffer).wait()
    self.result.append({'type': 'create buffer', 'time': (time() - tBuffer)})
    # measure elapsed time for running the program
    tRun = time()
    clProgram.baseline(clQueue, (self.options.buffer_size, ), None, clBuffer)
    self.result.append({'type': 'run program', 'time': (time() - tRun)})
    # we don't need to measure the time for reading data back because we use
    # `USE_HOST_PTR` which creates the buffer in main memory instead of GPU's.
    return self.result

The kernel program only does a atom_inc function call:

__kernel void baseline(__global unsigned int* data)
{
  atom_inc(&data[get_global_id(0)]);
}

The cost of Open CL in my Makbook Pro are:

Intel(R) Core(TM) i7-3540M CPU @ 3.00GHz
========================================
buffer size (in int): 1048576
create queue: 0.006128s
read file: 0.000031s
build program: 0.006540s
create buffer: 0.000191s
run program: 0.000128s
----------------------------------------
total: 0.013018s
========================================

HD Graphics 4000 GPU
========================================
buffer size (in int): 1048576
create queue: 0.010630s
read file: 0.000050s
build program: 0.006650s
create buffer: 0.001557s
run program: 0.000194s
----------------------------------------
total: 0.019081s
========================================

It looks not bad.

The URL of this program is: https://github.com/john-hu/pyopencl-example/tree/master/baseline

As we known, it is faster to use local memory or private memory in Open CL. To compare them, I had made a test on it. Before doing that, we may find the time elapsed of each part:

  • Open CL usage: about 0.9s (look fair)
  • Merging semiResult to finalResult: 18s (too bad)

It is very clear that the bottleneck is merging semiResult to finalResult.

If we use Python Image Library’s histogram function, the elapsed time only took: 1.5s with the same picture, images/7728x4354.jpg.

Private Memory

For private memory testing, it still worths to use private memory in the kernel code without changing others. The commit is a test of private memory. Since we need to save the result to the global memory result, we may save the result to private memory and try to copy them to global memory. If we look closer to the number of access of global memory, it still the same while using global memory. So, the kernel code looks like:

// 1024 pixels per work item.
#define BIN_SIZE 1024
#define RESULT_SIZE 768
 
__kernel void histogram(__global unsigned char* bytes, __global unsigned int* pixelCount,
                        __global unsigned int* result)
{
  unsigned int lid = get_local_id(0);
  unsigned int gid = get_group_id(0);
  unsigned int gsize = get_local_size(0);
  unsigned int globalId = get_global_id(0);
 
  unsigned int i, bValue;
  unsigned int basePixelIdx = lid * BIN_SIZE + gid * gsize * BIN_SIZE;
  unsigned int baseResultIdx = globalId * RESULT_SIZE;
  unsigned int maxPixel = *pixelCount;
  unsigned int privateBuffer[RESULT_SIZE];
 
  for (i = 0; i < RESULT_SIZE; i++) {
    result[baseResultIdx + i] = 0;
  }
 
  unsigned int processIndex = 0;
  while (processIndex < BIN_SIZE && (basePixelIdx + processIndex < maxPixel)) {
    // data partition of bytes is RGBRGBRGB....
    bValue = bytes[basePixelIdx * 3 + processIndex * 3];
    // result partition is RR..RRGG..GGBB..BB.
    privateBuffer[bValue]++;
    // G
    bValue = bytes[basePixelIdx * 3 + processIndex * 3 + 1];
    privateBuffer[256 + bValue]++;
    // B
    bValue = bytes[basePixelIdx * 3 + processIndex * 3 + 2];
    privateBuffer[512 + bValue]++;
    processIndex++;
  }
 
  for (i = 0; i < RESULT_SIZE; i++) {
    result[baseResultIdx + i] = privateBuffer[i];
  }
}

I created a 768 unsigned int array for temporary storage and copied all of them to global memory result. The improvement seems not good. The elapsed time of Open CL looks the same, about 0.9s.

First Attempt to Move the Merging Code to Open CL

Since we already knew the bottleneck, I had moved the merging code from sequence CPU to Open CL. I still kept a huge temporary result for each work item, 768 unsigned integers for each work item. But the difference is the final result is calculated at Open CL. I used the lastest 768 work items to copy each temporary result to final result. If we have 100 work items, there are 768 * 100 read operations on global memory and 768 write operations on global memory. In the previous version, only 768 write operations on global memory was needed. I know we need to loop all of the pixels and this part is the same. The following is the partial kernel code of this version:

  // Only use the latest 768 work items to copy the data. We assume that the
  // latest 768 work items are the last group executes.
  unsigned int lastGroup = totalTasks - RESULT_SIZE;
 
  for (i = 0; i < RESULT_SIZE; i++) {
    privateBuffer[i] = 0;
  }
  if (globalId >= lastGroup) {
    finalResult[globalId] = 0;
  }
 
  unsigned int processIndex = 0;
  while (processIndex < BIN_SIZE && (basePixelIdx + processIndex < maxPixel)) {
    ...(the same)
  }
 
  for (i = 0; i < RESULT_SIZE; i++) {
    tempResult[baseResultIdx + i] = privateBuffer[i];
  }
 
  barrier(CLK_GLOBAL_MEM_FENCE);
 
  if (globalId >= lastGroup) {
    for (i = 0; i < totalTasks; i++) {
      finalResult[globalId - lastGroup] += tempResult[globalId - lastGroup + i * RESULT_SIZE];
    }
  }

The result had significant improvement on the speed. Since we don’t have the merging code in python, the Open CL part became: about 2.5s. It runs faster than before but still slower than Python Image Library.

Conclusion

There are two things can be improved:

  • trying to use local memory to save the temporary result
  • use work group to test if any improvement earned.