Skip to content

Add CUDA depth processor #222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from
Closed

Add CUDA depth processor #222

wants to merge 14 commits into from

Conversation

xlz
Copy link
Member

@xlz xlz commented May 4, 2015

Performance:

  • GeForce GT 750M - 250Hz
  • Tegra K1 - 90Hz

On Tegra K1, numeric optimization improved from 57Hz to 71Hz, and pinned memory improved to 90Hz. Pinned memory required some modifications to Frame and DoubleBuffer.

This PR needs coordination with #210 because of dependencies.

@xlz xlz mentioned this pull request May 5, 2015
@floe floe added this to the 0.2 milestone May 22, 2015
@larshg
Copy link
Contributor

larshg commented May 26, 2015

Tried this on my windows 10 i7 950 with geforce GTX 480 and I get about 2ms processing time ~ 500 hz.

Runs a lot faster than opengl/opencl versions.

@gaborpapp
Copy link
Contributor

I tried building this in OSX, but got linker errors for the cuda object functions:

Linking CXX shared library ../lib/libfreenect2.dylib
Undefined symbols for architecture x86_64:
  "std::basic_ios<char, std::char_traits<char> >::widen(char) const", referenced from:
      libfreenect2::CudaDepthPacketProcessorKernelImpl::initDevice(int, unsigned long, unsigned long) in cuda_compile_generated_cuda_depth_packet_processor.cu.o
  "std::ostream::put(char)", referenced from:
      libfreenect2::CudaDepthPacketProcessorKernelImpl::initDevice(int, unsigned long, unsigned long) in cuda_compile_generated_cuda_depth_packet_processor.cu.o
  "std::ostream::flush()", referenced from:
      libfreenect2::CudaDepthPacketProcessorKernelImpl::initDevice(int, unsigned long, unsigned long) in cuda_compile_generated_cuda_depth_packet_processor.cu.o
  "std::ostream& std::ostream::_M_insert<unsigned long>(unsigned long)", referenced from:
      libfreenect2::CudaDepthPacketProcessorKernelImpl::initDevice(int, unsigned long, unsigned long) in cuda_compile_generated_cuda_depth_packet_processor.cu.o
...

@xlz
Copy link
Member Author

xlz commented Jun 8, 2015

std::cout here is most innocuous usage. Perhaps there is something wrong in your CUDA toolchain.

I don't have a Mac with Nvidia card to test. Apparently you are able to build this despite merge conflicts. Maybe you can come up with the solution.

@gaborpapp
Copy link
Contributor

Yes, I had to merge the CMakefile manually.

It seems like a standard c library mixup, libstdc++ vs libc++. I have CUDA 6.5, which seems to require libstdc++, while other libraries are compiled with the default libc++. I tried to build Protonect with libstdc++ instead. Then the linker finds the missing CUDA functions above, but missing the OpenCV ones for the same reason. Recompiling everything seems like a big hassle.

I am going to try CUDA 7.0 instead, which might work with libc++.

@gaborpapp
Copy link
Contributor

It works with CUDA 7.0. I had to disable OpenCL, otherwise I got duplicate symbol linker error. It might also affect other platforms.

Linking CXX shared library ../lib/libfreenect2.dylib
duplicate symbol __ZN12libfreenect223loadBufferFromResourcesERKNSt3__112basic_stringIcNS0_11char_traitsIcEENS0_9allocatorIcEEEEPhm in:
    CMakeFiles/freenect2.dir/src/opencl_depth_packet_processor.cpp.o
    CMakeFiles/freenect2.dir/src/cuda_depth_packet_processor.cpp.o
ld: 1 duplicate symbol for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

@xlz
Copy link
Member Author

xlz commented Jun 8, 2015

OK this is a legitimate bug. loadBufferFromResources() is duplicated in both processors. I'll fix it.

@xlz
Copy link
Member Author

xlz commented Jun 9, 2015

Work in progress until PR 276, 278 and one more PR finish.

@kohrt
Copy link
Contributor

kohrt commented Jun 11, 2015

@xlz I read that the CUDA implementation is even faster then the OpenCL implementation, so I just took a quick look at the code and it looks pretty similar to the OpenCL code. Did you do some modifications so that it runs faster, like using float4 instead of float3 or anything else? Maybe we could adopt those changes also to the OpenCL processor. Or is it just that nvidia optimizes more for CUDA?

@xlz
Copy link
Member Author

xlz commented Jun 11, 2015

I don't think they are directly comparable. AMD doesn't support CUDA, and Nvidia's OpenCL is obviously worse than CUDA because API abstraction, unoptimized code etc.

Edit: For comparison purposes, you can still see how well OpenCL works with Nvidia by just commenting out CL_DEVICE_TYPE_CUSTOM, and add $ENV{CUDA_ROOT} accordingly in FindOpenCL.cmake. It is a trivial change, but I don't think anyone will use OpenCL with Nvidia given we have CUDA driver for Nvidia devices.

I have some math optimization here https://github.com/xlz/libfreenect2/commit/8e5a7c8b3353dd5a3f1446dbe5cacc88504370c3. Most improvement is attributed to -use_fast_math, but the OpenCL counterpart of this didn't seem quite effective in my previous testing. There are also some minor math tweaks in this commit that improve things a little bit.

The other optimization is memory access. A CUDA processor with unoptimized memory access works like this: copy the depth packet from the user provided pointer to a page-locked area, send it to
GPU, and finally send the result to CPU.

  • The first copy can be removed by making the memory of the depth packet page-locked in the first place with CUDA's "pinned memory" https://github.com/xlz/libfreenect2/commit/1024bc87b0040ad12323af188985d97db0b28523. OpenCL has equivalent for this. This requires modifications to Frame and DoubleBuffer to allow custom GPU-aware memory allocator.
  • The copy from CPU to GPU can be avoided with mapped memory, but only on integrated platforms where CPU and GPU share the same memory. PCIe bus transmission has to happen if GPU doesn't share memory with CPU. However mapping memory on Tegra seems to reduce the performance of its JPEG decoder so I didn't post that commit here. OpenCL/Intel might benefit from this.
  • Finally, the result on GPU doesn't need to be sent back to CPU at all if it is only for displaying or further computation on the same GPU.

float3 is internally float4 so I don't think that matters.

@xlz
Copy link
Member Author

xlz commented Jun 26, 2015

Current status of this PR: The CUDA runtime setup, CUDA kernels, and memory management are complete. It is broken by recent changes in the CMake build system (change of context, addition of static/shared build configurations). This can be fixed better after the build system is stable. But I don't have a Kinect 2 to test recently, and anyone is welcome to work upon the current code.

There is still room of improvement about how to best interface with other parts involving minimal memory copy. In this PR, https://github.com/xlz/libfreenect2/commit/de911375b4d63dc7c909cc6d622ee2c1c5db2087 Allow processors to define custom Frame classes and passes CUDA "pinned memory" to the frame listener to avoid a memory copy internal in CUDA. It might be better to implement zero-copy between the depth processor and the frame listener listener.waitForNewFrame() by passing a pointer to a fixed buffer defined by the processor. This way avoids allocating, page locking, and freeing 848KB 30 times per second. The problems with this:

  • It requires a double buffer structure between the depth processor and the frame listener, and changes to Protonect.cpp and iai_kinect2.
  • Coordination with OpenCL processor. The OpenCL processor currently does not use zero-copy so this interface change might also impact it.
  • Coordination with VAAPI color processor.
  • Coordination with Tegra color processor.
  • In my previous testing on Jetson, zero-copy did not seem vastly better than one-copy, and zero-copy seemed to interfere with Tegra libjpeg. This might be different on other platforms.
  • Optionally avoiding GPU to CPU copy inside the depth processor if the output of the depth processor is for display or further GPU-only computation. (Intel: OpenCL + VA-API, Nvidia: CUDA + VDPAU/VA-API)

To help with understanding the data flow, it can be illustrated as:

libusb -> stream parsers -> packet processors -> frame listeners

@dbworth
Copy link

dbworth commented Jun 27, 2015

Hi all,

I rebased xlz's "cuda-depth-processor" branch of libfreenect2 here:
https://github.com/dbworth/libfreenect2/commits/cuda-depth-processor

Then I made 4 changes to
libfreenect2/examples/protonect/CMakeLists.txt
to bring it into line with the same file in the upstream master branch.

And one more change is to makes things more user-friendly, so CUDA is disabled by default, and CMake will abort if you try to compile with OpenCL and CUDA at the same time.

(In my opinion OpenCL should also be disabled by default because it is optional and I know some people have problems getting OpenCL working.)

.

The performance results[1] are quite interesting:

Protonect with CPU only
CpuDepth approx 160ms 6Hz
TurboJpeg approx 13ms 73Hz
8% GPU utilization, 7% memory used (201MB)
2% PCIe bandwidth utilization

Protonect with OpenGL
OpenGLDepth approx 8ms 127Hz
TurboJpeg approx 12ms 82Hz
18% GPU utilization, 8% memory used (248MB)
10% PCIe bandwidth utilization

Protonect with OpenCL
OpenCLDepth approx 2.3ms 434Hz
TurboJpeg approx 12.4ms 81Hz
5% GPU utilization, 10% memory used (293MB)
2% PCIe bandwidth utilization

Protonect with CUDA
CudaDepth approx 1.1ms 870Hz
TurboJpeg approx 12.6ms 79Hz
3% GPU utilization, 10% memory used (294MB)
1% PCIe bandwidth utilization

.

[1] These figures are approximate.

Running Ubuntu desktop.
In Protonect executable, all imshow() commands were disabled,
registration->apply() was enabled.

System specs:
Ubuntu 14.04.2 LTS
Linux 3.16.0-41-generic
Intel quad-core i7-4790k @ 4.0GHz
16GB RAM
NVidia GeForce GTX 780, 2304 cores, 3GB memory, PCIe Gen3 x16 8GT/s
Intel xHCI USB controller 8CB1, 9 series chipset

There is more fluctuation in depth packet processing, say 0.5ms.
Depth packet processing times are often faster by 2ms when Protonect is initially started. I wait for at least 3 minutes before reading these figures.
And measured rates can vary by as much as 1.5ms if I test again at a different time.

@xlz
Copy link
Member Author

xlz commented Jun 27, 2015

(In my opinion OpenCL should also be disabled by default because it is optional and I know some people have problems getting OpenCL working.)

It should still be enabled by default, because at this development stage we want to make it easier to test. If there is a bug, we should fix it to make it work.

Protonect with CPU only
8% GPU utilization, 7% memory used (201MB)

Why does this have GPU usage at all.

Protonect with OpenGL

Protonect with OpenCL

Overhead is expected.

@christiankerl
Copy link
Contributor

from a design point of view I would think it's better to add a method getAllocator() to PacketProcessor, which returns an allocator object similar to std::allocator. This allocator is passed to the ctors of DoubleBuffer and Frame. The getAllocator() method can have a default implementation in PacketProcessor. An "advanced" allocator implementation could implement some memory pooling strategy to avoid 30-times (de-)allocation for the frame memory.

The Frame class could be changed to allow access to the buffer only through methods. The GPU->CPU transfer could then be delayed until one of the CPU memory access methods is called.

What do you think?

@xlz
Copy link
Member Author

xlz commented Jun 27, 2015

I think memory pool allocators are just implicit double buffers with extra complexity.

Suppose a double buffer is to be used between the packet processor and the frame listener. OK, the modification to DoubleBuffer in this PR is actually outdated as I intended for VA-API PR #210 to be merged first.

https://github.com/xlz/libfreenect2/commit/89947ab500af7a7de1cca1ea24cd79a58565a28a is the newer modification to DoubleBuffer that allows inheritance by providing a method virtual libfreenect2::DoubleBuffer *getPacketBuffer() to PacketProcessor, because VA-API needs more flexibility than custom memory allocator with DoubleBuffer. I think this commit basically does what you said (adds inheritance keywords, defines defaults, moves code around a bit, removes redundant member definitions), much less messy than in this PR.

The usage pattern is each provider of DoubleBuffer defines a subclass that implements its own allocator and custom methods. Example: https://github.com/xlz/libfreenect2/commit/eaaa7bcfaa804d6744c6894d2fb045c833a7f893#diff-24ed06cdc0ce177f67a0567e42d2b09fR85. VA-API allocates memory handles that are specific to its own API instead of addressable pointers. And users of DoubleBuffer need to access the memory handles for mapping and unmapping. The allocator cannot be stateless if it stores the memory handles. And because mapping and unmapping deal with whole allocated memory, DoubleBuffer needs to hold two allocations instead of one allocation of two halves.

@kohrt
Copy link
Contributor

kohrt commented Jun 27, 2015

I am against disabling OpenCL by default, because it is supported by most hardware. The problem people have is to setup OpenCL correctly on their system and that is not an issue of the OpenCL code in libfreenect2. If needed one can always disable it manually with the cmake flags.

@xlz I already looked into pinned memory for the OpenCL processor, but didn't found the time implement it.

If a pull-request or a branch with the changed packed buffer interface is created, then I could update the OpenCL processor accordingly and prepare a changed version of iai_kinect2.

@christiankerl
Copy link
Contributor

@xlz ok your update is fine for me.

however, as an idea for future improvements: maybe RgbPacket / DepthPacket should directly provide methods to access a buffer into which the stream parsers copy the data. CUDA/VA-API processors could subclass RgbPacket/DepthPacket to implement their custom allocation and store eg. buffer ids. Each processor provides a method allocatePacket(), which creates a packet instance, which can be handled by its process() method. DoubleBuffer would just reduce to a class managing two pointers to respective packet instances (front and back).

@xlz
Copy link
Member Author

xlz commented Jun 27, 2015

@christiankerl
I think your point is to avoid duplication of allocation code of two buffers in each subclass of DoubleBuffer. Though it does not look very elegant to me to put Buffer into Packet.

It seems your first comment about allocators was actually suggesting a factory pattern: add virtual const BufferFactory *getBufferFactory() to PacketProcessor, and let DoubleBuffer call Buffer *BufferFactory::create().

I'll write some draft about this design.

@christiankerl
Copy link
Contributor

@xlz my point is more like: with your changes the processor needs to know how the parser works, e.g. that it uses a doublebuffer or that process is called on the back buffer

previously the parser managed the memory and just told the processor: hey there is a packet with data at a certain memory location, please process it

I agree that for implementation/efficiency reasons the processor has to decide how to allocate this memory

my proposal suggesting the allocators was a bit short sighted. I think it would be better if the parser asks the processor to create an appropriate packet instance and then just populates its contents with the incoming data. Afterwards this packet is passed to the process method. This way the parser could request as many packet instances as necessary to implement double buffering for example.

@xlz
Copy link
Member Author

xlz commented Jul 1, 2015

@christiankerl
How about this https://github.com/xlz/libfreenect2/commit/b2490af7e462e819bbd3815aa23dc6c1ae990467? A bit different from your suggestion is that Packet itself will not be polymorphic, instead the packet buffer embedded in Packet will be. CUDA/VA-API will inherit from libfreenect2::Buffer, because that's the core difference.

It compiles, but I haven't tested further.

@christiankerl
Copy link
Contributor

@xlz looks good. I added some minor comments

@RyanGordon
Copy link
Contributor

Now that 0.1.0 is out, any chance we could get this and #210 rebased on master @xlz? And hopefully merged in soon!

@xlz
Copy link
Member Author

xlz commented Jan 21, 2016

100% chance, but takes time.

@xlz
Copy link
Member Author

xlz commented Feb 14, 2016

A new PR for CUDA is expected next week after the VAAPI one is merged.

@xlz xlz closed this Feb 14, 2016
@xlz
Copy link
Member Author

xlz commented Feb 16, 2016

I directly merged the cleaned up patches into master after testing on Ubuntu and Tegra.

@xlz xlz deleted the cuda-depth-processor branch February 16, 2016 00:49
@xlz
Copy link
Member Author

xlz commented Feb 16, 2016

@wiedemeyer

It looks like https://github.com/OpenKinect/libfreenect2/blob/master/src/opencl_depth_packet_processor.cl#L48
The cos and sin values can be precomputed on CPU? Maybe this will save some time on GPU.

@kohrt
Copy link
Contributor

kohrt commented Feb 16, 2016

@xlz
True, one could compute those values for the complete p0 table.
I created a float3 buffer with 512 * 424 * 6 entries (cos and sin for each of the 3 phases). But it seems that computing it on the GPU is actually faster that the lookup. The time increased from around 1.7 ms to 1.9 ms. Did you tried it on the CUDA implementation?
Other improvements are to use sincos to compute both at once or native_sin and native_cos. I think the last option is the fastest, but both were around 1.7 to 1.75 ms, so no significant difference. Maybe I can check how it differs on slower GPUs later.
By the way, the CUDA processor is running smoothly at 1 ms, but really great is the VAAPI jpeg decoding, 5 ms instead of 15 ms. That is an awesome improvement, great job!

@xlz
Copy link
Member Author

xlz commented Feb 16, 2016

I believe you only need one pair of cos and sin values for the 3 phases. PHASE_IN_RAD0 is 0, PHASE_IN_RAD1 is 2pi/3, PHASE_IN_RAD2 is 4pi/3. You can expand cos(p0 + PHASE_IN_RADx) and the result is something like r=(v.z-v.y)*SQRT3_DIV_2; s=(v.x-v.y+v.x-v.z)*0.5f; make_float2(sinp0*r+s*cosp0, cos0*r+s*sin0) (the order and signs may be wrong).

@kohrt
Copy link
Contributor

kohrt commented Feb 17, 2016

I created now 2 LUTs for p0, one for the sines and one for the cosines and replaced the constants PHASE_IN_RADx, by PHASE_IN_RADx_SIN and PHASE_IN_RADx_COS. Then I used sin(p0 + PHASE_IN_RADx) = sin(p0) * cos(PHASE_IN_RADx) + cos(p0) * sin(PHASE_IN_RADx) and cos(p0 + PHASE_IN_RADx) = cos(p0) * cos(PHASE_IN_RADx) - sin(p0) * sin(PHASE_IN_RADx) and replaced sin(p0),cos(p0) by the values from the LUT and sin(PHASE_IN_RADx),cos(PHASE_IN_RADx) by the constants PHASE_IN_RADx_SIN, PHASE_IN_RADx_COS. This should also work, if the paramters for phases are changed (for whatever reason).

@xlz
Copy link
Member Author

xlz commented Feb 17, 2016

Come on, HASE_IN_RAD0 is 0, PHASE_IN_RAD1 is 2pi/3, PHASE_IN_RAD2 is 4pi/3. These are constants that can never change. There are 3 phases that cover 2pi. If they change, if means they no longer cover 2pi equally or there are 2 or 4 phases.

@xlz
Copy link
Member Author

xlz commented Feb 17, 2016

Indeed, I tried it with the formula I described and it was slower than sincos(). So this is not a good optimization.

@kohrt
Copy link
Contributor

kohrt commented Feb 18, 2016

I found the time to integrate the pinned memory (and other things) into the OpenCL processor. It is now nearly as fast as the CUDA one, ~ 0.2 ms difference on my system.

@kohrt kohrt mentioned this pull request Feb 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants