Opencl improvements #579

kohrt · 2016-02-18T14:05:18Z

Which this PR includes:

Custom Allocator and Frames for OpenCL using pinned memory
Removed arrays for tables, instead data is directly written to the OpenCL buffers
OpenCL buffers are now created only once on initialization
Added definition to enable profiling for the OpenCL kernels
Minor improvements for the first stage

With this improvements the OpenCL depth processor takes between 1.09 ms and 1.11 ms on my system, instead of ~2 ms. Most due to the pinned memory.

kohrt · 2016-02-18T14:40:59Z

Having two tables instead of one is actually slower than to compute them directly. We talked about it here: #222. Or did you meant something else?
Edit: I see, you meant the first commit, but I reverted it later because this improvement wasn't an improvement at all.

kohrt · 2016-02-18T16:17:06Z

I will update the PR with the suggested changes tomorrow.

xlz · 2016-02-18T16:34:28Z

There are two other things. One is PacketProcessor::good() is now the API to propagate errors within packet processors. The specific semantics are not specified yet, but generally when good() returns false, the processor will not be touched anymore.

The other one is please use opencl: Verb some short description as commit subjects then put details in the body.

xlz · 2016-02-18T22:30:13Z

When you're done with the PR, please also provide some benchmark results using steps in https://github.com/OpenKinect/libfreenect2/wiki/Performance

Results so far:

Probably all of it has to be redone after your OpenCL optimization.

…ses on the GPU, they are now precomputed once on the CPU. Details: Replaced sin(a+b) by sin(a)*cos(b)+cos(a)*sin(b), where sin(a),cos(b),cos(a),sin(b) are stored in a LUT. Simplyfied processPixelStage1 code and removed processMeasurementTriple. Moved one if from decodePixelMeasurement to processPixelStage1. Removed the first part of `valid && any(...)` because valid has been checked before.

…ion. loadXZTables, loadLookupTable and loadP0TablesFromCommandResponse will now directly write to the OpenCL buffers.

Reverted back to calculating sine and cosine on the GPU.

…_CL_ERROR obsolete.

Usage of LIBFREENECT2_WITH_PROFILING. Changed CHECK_CL macros. OpenCLAllocator can now be used for input and output buffers. OpenCLFrame now uses OpenCLBuffer from allocator. IMAGE_SIZE and LUT_SIZE as static const. Added Allocators for input and output buffers. Moved allocate_opencl to top. Added good method.

kohrt · 2016-02-19T12:32:02Z

Here are the benchmark results.

Configuration	Depth (min, 5%, median, 95%, max, mean, std)	RGB (min, 5%, median, 95%, max, mean, std)	Thread per core usage
Feb 19, 2016: Intel i7-4770K (@4.1GHz), GTX 980Ti; Ubuntu 14.04, kernel 4.2.0-29, gcc 4.8.5
CPU/TurboJPEG	194.328 196.617 200.837 212.289 225.055 mean=201.911 std=4.65475	12.0173 12.2656 13.1827 19.4198 22.4747 mean=13.6461 std=1.78962	CPU:90% TurboJPEG:40% USB:5% Reg:3%
Nvidia-OpenGL/TurboJPEG	3.22797 3.36801 8.02571 9.06995 108.705 mean=7.09752 std=2.96411	11.9735 12.2769 13.5689 19.4342 28.3156 mean=14.3881 std=2.23828	OpenGL:26% TurboJPEG:44% USB:6% Reg:20%
Nvidia-OpenCL/VAAPI	1.07144 1.08136 1.0924 1.145 2.46014 mean=1.1035 std=0.0599953	4.14765 4.1658 4.66865 7.72171 11.1335 mean=4.98485 std=1.18519	OpenCL:3% VAAPI:2% USB:6% Reg:18%
CUDA/VAAPI	0.857415 0.861542 0.868286 0.924719 3.31855 mean=0.882014 std=0.0699696	4.12401 4.14701 4.6825 10.9971 11.2794 mean=5.18491 std=1.60745	CUDA:5% VAAPI:2% USB:6% Reg:22%

Enabling profiling in OpenCL effects the performance, so for profiling libfreenect2s processors, it should be disabled and only used when testing improvements of the OpenCL code itself.

kohrt · 2016-02-19T12:47:47Z

I added all the recommended changes except the p0_table array as a class member, I changed it to dynamically allocated memory, because p0_table is just used once and then never again.
The rebase of the fork to the upstream master seems to have removed the comments on the commits, sorry for that.

xlz · 2016-02-19T13:37:28Z

Which GPU (Intel or Nvidia) did the OpenCL case use? (Hopefully both Intel and Nvidia GPUs can be tested with OpenCL)

kohrt · 2016-02-19T13:41:30Z

Nvidia, i edited the previous post.

xlz · 2016-02-19T23:31:55Z

I reviewed some of the changes. There is a lot of back and forth so it's no longer practical to reorganize this into standalone functional patches. I'll merge the code as-is but will format the subjects and commit messages.

Do you want to add something in CMakeLists.txt to turn on LIBFREENECT2_WITH_PROFILING_CL more easily or revert it to a file-level macro like #ifdef PROFILE_CL which can only be enabled by developer editing CMakelists.txt? I suppose you didn't turn it on during the benchmarking?

xlz · 2016-02-19T23:38:05Z

I have added your results to the wiki (Do you have permissions to edit it?).

Also after reviewing your patches I realize the CUDA processor still lacks some of the improvement you added here, good catches.

kohrt · 2016-02-20T09:27:01Z

The back and forth wasn't intended.
We could add setting the definition to the CMakeLists, but I don't think it is important, because nearly nobody will use it. I just wanted the code to be there in case of further improvements. I am ok with putting a #define LIBFREENECT2_WITH_PROFILING_CL (or however we name it) in the source code to enable profiling. And yes I disabled the CL profiling, when profiling the processors.
I think I can edit the wiki, there is an edit button, but I haven't tried to change anything yet. I can also benchmark some other combinations next week if needed. It would also be great, if the profiling would measure the CPU utilization of the threads itself, rather then trying to find a good average from the top output.

xlz · 2016-02-20T14:35:11Z

You can add the benchmark results directly to the wiki next time.

I'll try to add the per core usage stats programmatically.

xlz · 2016-02-21T01:16:46Z

Well, per core usage is not that important. Adding the stats is quite intrusive so I give up.

xlz · 2016-02-21T01:23:55Z

Manually merged.

If you don't feel like it, just ignore and not collect the per core usage. So far it is only useful in identifying the memory access on Tegra is extremely slow.

xlz · 2016-02-22T20:32:24Z

@wiedemeyer Can you run the CUDA test again? Hopefully VAAPI will have smaller variance.

kohrt · 2016-02-23T08:39:10Z

yes, I will run it this afternoon.

kohrt · 2016-02-23T14:12:24Z

@xlz
I profiled all combinations and put the results into the wiki. I also created the gnuplot image, but I don't know where and how to replace the current one.

But there is something wrong with the CUDA processor. When I run VAAPI with CUDA Protonect uses ~70% CPU, that is much compared to OpenCL with only ~33% CPU, or OpenGL with ~41% CPU. The CPU usage refers to the usage of the whole process as show by htop. You can also see that in the results from the profiling in the wiki.

xlz · 2016-02-23T15:10:19Z

I'll render the image.

According the your usage data, only the registration in the main thread has unusual CPU usage. It's weird. There really is nothing special in the registration routines. I'll take a look at what's going on on my side. You can check too.

kohrt · 2016-02-24T08:20:52Z

I ran kinect2_bidge and subscribed to /kinect2/sd/camera_info. This way the kinect2_bridge does nearly nothing, except starting capturing data through libfreenect2 and publishing the camera_info. There it made no difference between using OpenCL or Cuda. The CPU usage of the whole process was just 12%. Why would Protonect without a viewer take so much more resources, can it be the profiling? I don't have much time now to look into it, but maybe later.

xlz · 2016-02-24T20:50:08Z

The slowness in Registration is caused by a wrong write combined flag used in CUDA. Fixed in 18d1cff

This fix doesn't seem to have significant effect on performance in CUDA or VAAPI.

kohrt · 2016-02-25T07:49:43Z

But it fixed the high CPU load with CUDA. I updated the CPU usage in my wiki performance entries.

Thiemo Wiedemeyer added 6 commits February 19, 2016 12:51

removed arrays for tables and allocated OpenCL buffers on initializat…

6a1e977

…ion. loadXZTables, loadLookupTable and loadP0TablesFromCommandResponse will now directly write to the OpenCL buffers.

Implemented pinned memory buffers and frames.

8575372

Added (optional) profiling of OpenCL kernels.

dd477d5

Reverted back to calculating sine and cosine on the GPU.

Changed filling methods to return a bool on success, making macro LOG…

37d088c

…_CL_ERROR obsolete.

opencl: different profiling definition for OpenCL

2dfafb7

Enabling profiling in OpenCL effects the performance, so for profiling libfreenect2s processors, it should be disabled and only used when testing improvements of the OpenCL code itself.

xlz closed this Feb 21, 2016

kohrt deleted the opencl_improvements branch February 23, 2016 08:38

Opencl improvements #579

Opencl improvements #579

Uh oh!

Conversation

kohrt commented Feb 18, 2016

Uh oh!

kohrt commented Feb 18, 2016

Uh oh!

kohrt commented Feb 18, 2016

Uh oh!

xlz commented Feb 18, 2016

Uh oh!

xlz commented Feb 18, 2016

Uh oh!

kohrt commented Feb 19, 2016

Uh oh!

kohrt commented Feb 19, 2016

Uh oh!

xlz commented Feb 19, 2016

Uh oh!

kohrt commented Feb 19, 2016

Uh oh!

xlz commented Feb 19, 2016

Uh oh!

xlz commented Feb 19, 2016

Uh oh!

kohrt commented Feb 20, 2016

Uh oh!

xlz commented Feb 20, 2016

Uh oh!

xlz commented Feb 21, 2016

Uh oh!

xlz commented Feb 21, 2016

Uh oh!

xlz commented Feb 22, 2016

Uh oh!

kohrt commented Feb 23, 2016

Uh oh!

kohrt commented Feb 23, 2016

Uh oh!

xlz commented Feb 23, 2016

Uh oh!

kohrt commented Feb 24, 2016

Uh oh!

xlz commented Feb 24, 2016

Uh oh!

kohrt commented Feb 25, 2016

Uh oh!

Uh oh!