Add CUDA depth processor #222

xlz · 2015-05-04T21:11:51Z

Performance:

GeForce GT 750M - 250Hz
Tegra K1 - 90Hz

On Tegra K1, numeric optimization improved from 57Hz to 71Hz, and pinned memory improved to 90Hz. Pinned memory required some modifications to Frame and DoubleBuffer.

This PR needs coordination with #210 because of dependencies.

larshg · 2015-05-26T09:40:04Z

Tried this on my windows 10 i7 950 with geforce GTX 480 and I get about 2ms processing time ~ 500 hz.

Runs a lot faster than opengl/opencl versions.

gaborpapp · 2015-06-08T20:07:27Z

I tried building this in OSX, but got linker errors for the cuda object functions:

Linking CXX shared library ../lib/libfreenect2.dylib
Undefined symbols for architecture x86_64:
  "std::basic_ios<char, std::char_traits<char> >::widen(char) const", referenced from:
      libfreenect2::CudaDepthPacketProcessorKernelImpl::initDevice(int, unsigned long, unsigned long) in cuda_compile_generated_cuda_depth_packet_processor.cu.o
  "std::ostream::put(char)", referenced from:
      libfreenect2::CudaDepthPacketProcessorKernelImpl::initDevice(int, unsigned long, unsigned long) in cuda_compile_generated_cuda_depth_packet_processor.cu.o
  "std::ostream::flush()", referenced from:
      libfreenect2::CudaDepthPacketProcessorKernelImpl::initDevice(int, unsigned long, unsigned long) in cuda_compile_generated_cuda_depth_packet_processor.cu.o
  "std::ostream& std::ostream::_M_insert<unsigned long>(unsigned long)", referenced from:
      libfreenect2::CudaDepthPacketProcessorKernelImpl::initDevice(int, unsigned long, unsigned long) in cuda_compile_generated_cuda_depth_packet_processor.cu.o
...

xlz · 2015-06-08T20:25:55Z

std::cout here is most innocuous usage. Perhaps there is something wrong in your CUDA toolchain.

I don't have a Mac with Nvidia card to test. Apparently you are able to build this despite merge conflicts. Maybe you can come up with the solution.

gaborpapp · 2015-06-08T20:59:06Z

Yes, I had to merge the CMakefile manually.

It seems like a standard c library mixup, libstdc++ vs libc++. I have CUDA 6.5, which seems to require libstdc++, while other libraries are compiled with the default libc++. I tried to build Protonect with libstdc++ instead. Then the linker finds the missing CUDA functions above, but missing the OpenCV ones for the same reason. Recompiling everything seems like a big hassle.

I am going to try CUDA 7.0 instead, which might work with libc++.

gaborpapp · 2015-06-08T21:07:04Z

It works with CUDA 7.0. I had to disable OpenCL, otherwise I got duplicate symbol linker error. It might also affect other platforms.

Linking CXX shared library ../lib/libfreenect2.dylib
duplicate symbol __ZN12libfreenect223loadBufferFromResourcesERKNSt3__112basic_stringIcNS0_11char_traitsIcEENS0_9allocatorIcEEEEPhm in:
    CMakeFiles/freenect2.dir/src/opencl_depth_packet_processor.cpp.o
    CMakeFiles/freenect2.dir/src/cuda_depth_packet_processor.cpp.o
ld: 1 duplicate symbol for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

xlz · 2015-06-08T21:09:43Z

OK this is a legitimate bug. loadBufferFromResources() is duplicated in both processors. I'll fix it.

CUDA depth processor will also use this function.

Simplify math and turn on nvcc -use_fast_math. Performance improved from 57Hz to 71Hz on Tegra K1

This improves memory access by avoiding an extra copy.

xlz · 2015-06-09T02:04:21Z

Work in progress until PR 276, 278 and one more PR finish.

kohrt · 2015-06-11T14:06:32Z

@xlz I read that the CUDA implementation is even faster then the OpenCL implementation, so I just took a quick look at the code and it looks pretty similar to the OpenCL code. Did you do some modifications so that it runs faster, like using float4 instead of float3 or anything else? Maybe we could adopt those changes also to the OpenCL processor. Or is it just that nvidia optimizes more for CUDA?

xlz · 2015-06-11T15:18:24Z

I don't think they are directly comparable. AMD doesn't support CUDA, and Nvidia's OpenCL is obviously worse than CUDA because API abstraction, unoptimized code etc.

Edit: For comparison purposes, you can still see how well OpenCL works with Nvidia by just commenting out CL_DEVICE_TYPE_CUSTOM, and add $ENV{CUDA_ROOT} accordingly in FindOpenCL.cmake. It is a trivial change, but I don't think anyone will use OpenCL with Nvidia given we have CUDA driver for Nvidia devices.

I have some math optimization here https://github.com/xlz/libfreenect2/commit/8e5a7c8b3353dd5a3f1446dbe5cacc88504370c3. Most improvement is attributed to -use_fast_math, but the OpenCL counterpart of this didn't seem quite effective in my previous testing. There are also some minor math tweaks in this commit that improve things a little bit.

The other optimization is memory access. A CUDA processor with unoptimized memory access works like this: copy the depth packet from the user provided pointer to a page-locked area, send it to
GPU, and finally send the result to CPU.

The first copy can be removed by making the memory of the depth packet page-locked in the first place with CUDA's "pinned memory" https://github.com/xlz/libfreenect2/commit/1024bc87b0040ad12323af188985d97db0b28523. OpenCL has equivalent for this. This requires modifications to Frame and DoubleBuffer to allow custom GPU-aware memory allocator.
The copy from CPU to GPU can be avoided with mapped memory, but only on integrated platforms where CPU and GPU share the same memory. PCIe bus transmission has to happen if GPU doesn't share memory with CPU. However mapping memory on Tegra seems to reduce the performance of its JPEG decoder so I didn't post that commit here. OpenCL/Intel might benefit from this.
Finally, the result on GPU doesn't need to be sent back to CPU at all if it is only for displaying or further computation on the same GPU.

float3 is internally float4 so I don't think that matters.

xlz · 2015-06-26T23:40:17Z

Current status of this PR: The CUDA runtime setup, CUDA kernels, and memory management are complete. It is broken by recent changes in the CMake build system (change of context, addition of static/shared build configurations). This can be fixed better after the build system is stable. But I don't have a Kinect 2 to test recently, and anyone is welcome to work upon the current code.

There is still room of improvement about how to best interface with other parts involving minimal memory copy. In this PR, https://github.com/xlz/libfreenect2/commit/de911375b4d63dc7c909cc6d622ee2c1c5db2087 Allow processors to define custom Frame classes and passes CUDA "pinned memory" to the frame listener to avoid a memory copy internal in CUDA. It might be better to implement zero-copy between the depth processor and the frame listener listener.waitForNewFrame() by passing a pointer to a fixed buffer defined by the processor. This way avoids allocating, page locking, and freeing 848KB 30 times per second. The problems with this:

It requires a double buffer structure between the depth processor and the frame listener, and changes to Protonect.cpp and iai_kinect2.
Coordination with OpenCL processor. The OpenCL processor currently does not use zero-copy so this interface change might also impact it.
Coordination with VAAPI color processor.
Coordination with Tegra color processor.
In my previous testing on Jetson, zero-copy did not seem vastly better than one-copy, and zero-copy seemed to interfere with Tegra libjpeg. This might be different on other platforms.
Optionally avoiding GPU to CPU copy inside the depth processor if the output of the depth processor is for display or further GPU-only computation. (Intel: OpenCL + VA-API, Nvidia: CUDA + VDPAU/VA-API)

To help with understanding the data flow, it can be illustrated as:

libusb -> stream parsers -> packet processors -> frame listeners

dbworth · 2015-06-27T02:35:13Z

Hi all,

I rebased xlz's "cuda-depth-processor" branch of libfreenect2 here:
https://github.com/dbworth/libfreenect2/commits/cuda-depth-processor

Then I made 4 changes to
libfreenect2/examples/protonect/CMakeLists.txt
to bring it into line with the same file in the upstream master branch.

And one more change is to makes things more user-friendly, so CUDA is disabled by default, and CMake will abort if you try to compile with OpenCL and CUDA at the same time.

(In my opinion OpenCL should also be disabled by default because it is optional and I know some people have problems getting OpenCL working.)

.

The performance results[1] are quite interesting:

Protonect with CPU only
CpuDepth approx 160ms 6Hz
TurboJpeg approx 13ms 73Hz
8% GPU utilization, 7% memory used (201MB)
2% PCIe bandwidth utilization

Protonect with OpenGL
OpenGLDepth approx 8ms 127Hz
TurboJpeg approx 12ms 82Hz
18% GPU utilization, 8% memory used (248MB)
10% PCIe bandwidth utilization

Protonect with OpenCL
OpenCLDepth approx 2.3ms 434Hz
TurboJpeg approx 12.4ms 81Hz
5% GPU utilization, 10% memory used (293MB)
2% PCIe bandwidth utilization

Protonect with CUDA
CudaDepth approx 1.1ms 870Hz
TurboJpeg approx 12.6ms 79Hz
3% GPU utilization, 10% memory used (294MB)
1% PCIe bandwidth utilization

.

[1] These figures are approximate.

Running Ubuntu desktop.
In Protonect executable, all imshow() commands were disabled,
registration->apply() was enabled.

System specs:
Ubuntu 14.04.2 LTS
Linux 3.16.0-41-generic
Intel quad-core i7-4790k @ 4.0GHz
16GB RAM
NVidia GeForce GTX 780, 2304 cores, 3GB memory, PCIe Gen3 x16 8GT/s
Intel xHCI USB controller 8CB1, 9 series chipset

There is more fluctuation in depth packet processing, say 0.5ms.
Depth packet processing times are often faster by 2ms when Protonect is initially started. I wait for at least 3 minutes before reading these figures.
And measured rates can vary by as much as 1.5ms if I test again at a different time.

xlz · 2015-06-27T03:26:42Z

(In my opinion OpenCL should also be disabled by default because it is optional and I know some people have problems getting OpenCL working.)

It should still be enabled by default, because at this development stage we want to make it easier to test. If there is a bug, we should fix it to make it work.

Protonect with CPU only
8% GPU utilization, 7% memory used (201MB)

Why does this have GPU usage at all.

Protonect with OpenGL

Protonect with OpenCL

Overhead is expected.

christiankerl · 2015-06-27T08:19:21Z

from a design point of view I would think it's better to add a method getAllocator() to PacketProcessor, which returns an allocator object similar to std::allocator. This allocator is passed to the ctors of DoubleBuffer and Frame. The getAllocator() method can have a default implementation in PacketProcessor. An "advanced" allocator implementation could implement some memory pooling strategy to avoid 30-times (de-)allocation for the frame memory.

The Frame class could be changed to allow access to the buffer only through methods. The GPU->CPU transfer could then be delayed until one of the CPU memory access methods is called.

What do you think?

xlz · 2015-06-27T10:20:05Z

I think memory pool allocators are just implicit double buffers with extra complexity.

Suppose a double buffer is to be used between the packet processor and the frame listener. OK, the modification to DoubleBuffer in this PR is actually outdated as I intended for VA-API PR #210 to be merged first.

https://github.com/xlz/libfreenect2/commit/89947ab500af7a7de1cca1ea24cd79a58565a28a is the newer modification to DoubleBuffer that allows inheritance by providing a method virtual libfreenect2::DoubleBuffer *getPacketBuffer() to PacketProcessor, because VA-API needs more flexibility than custom memory allocator with DoubleBuffer. I think this commit basically does what you said (adds inheritance keywords, defines defaults, moves code around a bit, removes redundant member definitions), much less messy than in this PR.

The usage pattern is each provider of DoubleBuffer defines a subclass that implements its own allocator and custom methods. Example: https://github.com/xlz/libfreenect2/commit/eaaa7bcfaa804d6744c6894d2fb045c833a7f893#diff-24ed06cdc0ce177f67a0567e42d2b09fR85. VA-API allocates memory handles that are specific to its own API instead of addressable pointers. And users of DoubleBuffer need to access the memory handles for mapping and unmapping. The allocator cannot be stateless if it stores the memory handles. And because mapping and unmapping deal with whole allocated memory, DoubleBuffer needs to hold two allocations instead of one allocation of two halves.

kohrt · 2015-06-27T11:41:23Z

I am against disabling OpenCL by default, because it is supported by most hardware. The problem people have is to setup OpenCL correctly on their system and that is not an issue of the OpenCL code in libfreenect2. If needed one can always disable it manually with the cmake flags.

@xlz I already looked into pinned memory for the OpenCL processor, but didn't found the time implement it.

If a pull-request or a branch with the changed packed buffer interface is created, then I could update the OpenCL processor accordingly and prepare a changed version of iai_kinect2.

christiankerl · 2015-06-27T16:23:42Z

@xlz ok your update is fine for me.

however, as an idea for future improvements: maybe RgbPacket / DepthPacket should directly provide methods to access a buffer into which the stream parsers copy the data. CUDA/VA-API processors could subclass RgbPacket/DepthPacket to implement their custom allocation and store eg. buffer ids. Each processor provides a method allocatePacket(), which creates a packet instance, which can be handled by its process() method. DoubleBuffer would just reduce to a class managing two pointers to respective packet instances (front and back).

xlz · 2015-06-27T23:03:26Z

@christiankerl
I think your point is to avoid duplication of allocation code of two buffers in each subclass of DoubleBuffer. Though it does not look very elegant to me to put Buffer into Packet.

It seems your first comment about allocators was actually suggesting a factory pattern: add virtual const BufferFactory *getBufferFactory() to PacketProcessor, and let DoubleBuffer call Buffer *BufferFactory::create().

I'll write some draft about this design.

christiankerl · 2015-06-28T07:41:55Z

@xlz my point is more like: with your changes the processor needs to know how the parser works, e.g. that it uses a doublebuffer or that process is called on the back buffer

previously the parser managed the memory and just told the processor: hey there is a packet with data at a certain memory location, please process it

I agree that for implementation/efficiency reasons the processor has to decide how to allocate this memory

my proposal suggesting the allocators was a bit short sighted. I think it would be better if the parser asks the processor to create an appropriate packet instance and then just populates its contents with the incoming data. Afterwards this packet is passed to the process method. This way the parser could request as many packet instances as necessary to implement double buffering for example.

xlz · 2015-07-01T10:27:25Z

@christiankerl
How about this https://github.com/xlz/libfreenect2/commit/b2490af7e462e819bbd3815aa23dc6c1ae990467? A bit different from your suggestion is that Packet itself will not be polymorphic, instead the packet buffer embedded in Packet will be. CUDA/VA-API will inherit from libfreenect2::Buffer, because that's the core difference.

It compiles, but I haven't tested further.

christiankerl · 2015-07-01T11:53:46Z

@xlz looks good. I added some minor comments

RyanGordon · 2016-01-21T09:15:01Z

Now that 0.1.0 is out, any chance we could get this and #210 rebased on master @xlz? And hopefully merged in soon!

xlz · 2016-01-21T15:59:57Z

100% chance, but takes time.

xlz · 2016-02-14T02:58:29Z

A new PR for CUDA is expected next week after the VAAPI one is merged.

xlz · 2016-02-16T00:49:02Z

I directly merged the cleaned up patches into master after testing on Ubuntu and Tegra.

xlz · 2016-02-16T13:12:34Z

@wiedemeyer

It looks like https://github.com/OpenKinect/libfreenect2/blob/master/src/opencl_depth_packet_processor.cl#L48
The cos and sin values can be precomputed on CPU? Maybe this will save some time on GPU.

kohrt · 2016-02-16T15:28:32Z

@xlz
True, one could compute those values for the complete p0 table.
I created a float3 buffer with 512 * 424 * 6 entries (cos and sin for each of the 3 phases). But it seems that computing it on the GPU is actually faster that the lookup. The time increased from around 1.7 ms to 1.9 ms. Did you tried it on the CUDA implementation?
Other improvements are to use sincos to compute both at once or native_sin and native_cos. I think the last option is the fastest, but both were around 1.7 to 1.75 ms, so no significant difference. Maybe I can check how it differs on slower GPUs later.
By the way, the CUDA processor is running smoothly at 1 ms, but really great is the VAAPI jpeg decoding, 5 ms instead of 15 ms. That is an awesome improvement, great job!

xlz · 2016-02-16T15:37:38Z

I believe you only need one pair of cos and sin values for the 3 phases. PHASE_IN_RAD0 is 0, PHASE_IN_RAD1 is 2pi/3, PHASE_IN_RAD2 is 4pi/3. You can expand cos(p0 + PHASE_IN_RADx) and the result is something like r=(v.z-v.y)*SQRT3_DIV_2; s=(v.x-v.y+v.x-v.z)*0.5f; make_float2(sinp0*r+s*cosp0, cos0*r+s*sin0) (the order and signs may be wrong).

kohrt · 2016-02-17T12:17:25Z

I created now 2 LUTs for p0, one for the sines and one for the cosines and replaced the constants PHASE_IN_RADx, by PHASE_IN_RADx_SIN and PHASE_IN_RADx_COS. Then I used sin(p0 + PHASE_IN_RADx) = sin(p0) * cos(PHASE_IN_RADx) + cos(p0) * sin(PHASE_IN_RADx) and cos(p0 + PHASE_IN_RADx) = cos(p0) * cos(PHASE_IN_RADx) - sin(p0) * sin(PHASE_IN_RADx) and replaced sin(p0),cos(p0) by the values from the LUT and sin(PHASE_IN_RADx),cos(PHASE_IN_RADx) by the constants PHASE_IN_RADx_SIN, PHASE_IN_RADx_COS. This should also work, if the paramters for phases are changed (for whatever reason).

xlz · 2016-02-17T13:21:18Z

Come on, HASE_IN_RAD0 is 0, PHASE_IN_RAD1 is 2pi/3, PHASE_IN_RAD2 is 4pi/3. These are constants that can never change. There are 3 phases that cover 2pi. If they change, if means they no longer cover 2pi equally or there are 2 or 4 phases.

xlz · 2016-02-17T23:39:10Z

Indeed, I tried it with the formula I described and it was slower than sincos(). So this is not a good optimization.

kohrt · 2016-02-18T14:10:39Z

I found the time to integrate the pinned memory (and other things) into the OpenCL processor. It is now nearly as fast as the CUDA one, ~ 0.2 ms difference on my system.

xlz mentioned this pull request May 5, 2015

Add VA-API JPEG decoder #210

Closed

floe added this to the 0.2 milestone May 22, 2015

floe added 2 commits June 8, 2015 09:35

move resources.inc to resources.inc.h so cmake knows how to handle it

2f9c552

create static and shared library from same source build

00e55d1

xlz added 12 commits June 8, 2015 23:21

Merge branch 'pr276' into cmake

8ccf4d2

Copy DLLs for Protonect on Windows

1abbc30

Move loadBufferFromResources() to resource.h

0f80e61

CUDA depth processor will also use this function.

Add CUDA depth packet processor

89f97fe

Add build support for CUDA depth processor

db4d44f

Optimize CUDA depth processor

8e5a7c8

Simplify math and turn on nvcc -use_fast_math. Performance improved from 57Hz to 71Hz on Tegra K1

Allow processors to define custom Frame classes

de91137

Allow processors to provide packet buffer

bbc5a9f

Use pinned memory in CUDA depth processor

1024bc8

This improves memory access by avoiding an extra copy.

Add a CUDA packet pipeline to Protonect

e7824d2

Merge branch 'cmake' into cuda-depth-processor

e725485

Fix shared/static build for CUDA

e0fa1a1

xlz closed this Feb 14, 2016

xlz deleted the cuda-depth-processor branch February 16, 2016 00:49

kohrt mentioned this pull request Feb 18, 2016

Opencl improvements #579

Closed

Add CUDA depth processor #222

Add CUDA depth processor #222

Uh oh!

Conversation

xlz commented May 4, 2015

Uh oh!

larshg commented May 26, 2015

Uh oh!

gaborpapp commented Jun 8, 2015

Uh oh!

xlz commented Jun 8, 2015

Uh oh!

gaborpapp commented Jun 8, 2015

Uh oh!

gaborpapp commented Jun 8, 2015

Uh oh!

xlz commented Jun 8, 2015

Uh oh!

xlz commented Jun 9, 2015

Uh oh!

kohrt commented Jun 11, 2015

Uh oh!

xlz commented Jun 11, 2015

Uh oh!

xlz commented Jun 26, 2015

Uh oh!

dbworth commented Jun 27, 2015

Uh oh!

xlz commented Jun 27, 2015

Uh oh!

christiankerl commented Jun 27, 2015

Uh oh!

xlz commented Jun 27, 2015

Uh oh!

kohrt commented Jun 27, 2015

Uh oh!

christiankerl commented Jun 27, 2015

Uh oh!

xlz commented Jun 27, 2015

Uh oh!

christiankerl commented Jun 28, 2015

Uh oh!

xlz commented Jul 1, 2015

Uh oh!

christiankerl commented Jul 1, 2015

Uh oh!

RyanGordon commented Jan 21, 2016

Uh oh!

xlz commented Jan 21, 2016

Uh oh!

xlz commented Feb 14, 2016

Uh oh!

xlz commented Feb 16, 2016

Uh oh!

xlz commented Feb 16, 2016

Uh oh!

kohrt commented Feb 16, 2016

Uh oh!

xlz commented Feb 16, 2016

Uh oh!

kohrt commented Feb 17, 2016

Uh oh!

xlz commented Feb 17, 2016

Uh oh!

xlz commented Feb 17, 2016

Uh oh!

kohrt commented Feb 18, 2016

Uh oh!

Uh oh!