-
Notifications
You must be signed in to change notification settings - Fork 769
Add CUDA depth processor #222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tried this on my windows 10 i7 950 with geforce GTX 480 and I get about 2ms processing time ~ 500 hz. Runs a lot faster than opengl/opencl versions. |
I tried building this in OSX, but got linker errors for the cuda object functions:
|
std::cout here is most innocuous usage. Perhaps there is something wrong in your CUDA toolchain. I don't have a Mac with Nvidia card to test. Apparently you are able to build this despite merge conflicts. Maybe you can come up with the solution. |
Yes, I had to merge the CMakefile manually. It seems like a standard c library mixup, libstdc++ vs libc++. I have CUDA 6.5, which seems to require libstdc++, while other libraries are compiled with the default libc++. I tried to build Protonect with libstdc++ instead. Then the linker finds the missing CUDA functions above, but missing the OpenCV ones for the same reason. Recompiling everything seems like a big hassle. I am going to try CUDA 7.0 instead, which might work with libc++. |
It works with CUDA 7.0. I had to disable OpenCL, otherwise I got duplicate symbol linker error. It might also affect other platforms.
|
OK this is a legitimate bug. loadBufferFromResources() is duplicated in both processors. I'll fix it. |
CUDA depth processor will also use this function.
Simplify math and turn on nvcc -use_fast_math. Performance improved from 57Hz to 71Hz on Tegra K1
This improves memory access by avoiding an extra copy.
Work in progress until PR 276, 278 and one more PR finish. |
@xlz I read that the CUDA implementation is even faster then the OpenCL implementation, so I just took a quick look at the code and it looks pretty similar to the OpenCL code. Did you do some modifications so that it runs faster, like using float4 instead of float3 or anything else? Maybe we could adopt those changes also to the OpenCL processor. Or is it just that nvidia optimizes more for CUDA? |
I don't think they are directly comparable. AMD doesn't support CUDA, and Nvidia's OpenCL is obviously worse than CUDA because API abstraction, unoptimized code etc. Edit: For comparison purposes, you can still see how well OpenCL works with Nvidia by just commenting out I have some math optimization here https://github.com/xlz/libfreenect2/commit/8e5a7c8b3353dd5a3f1446dbe5cacc88504370c3. Most improvement is attributed to The other optimization is memory access. A CUDA processor with unoptimized memory access works like this: copy the depth packet from the user provided pointer to a page-locked area, send it to
float3 is internally float4 so I don't think that matters. |
Current status of this PR: The CUDA runtime setup, CUDA kernels, and memory management are complete. It is broken by recent changes in the CMake build system (change of context, addition of static/shared build configurations). This can be fixed better after the build system is stable. But I don't have a Kinect 2 to test recently, and anyone is welcome to work upon the current code. There is still room of improvement about how to best interface with other parts involving minimal memory copy. In this PR, https://github.com/xlz/libfreenect2/commit/de911375b4d63dc7c909cc6d622ee2c1c5db2087
To help with understanding the data flow, it can be illustrated as:
|
Hi all, I rebased xlz's "cuda-depth-processor" branch of libfreenect2 here: Then I made 4 changes to And one more change is to makes things more user-friendly, so CUDA is disabled by default, and CMake will abort if you try to compile with OpenCL and CUDA at the same time. (In my opinion OpenCL should also be disabled by default because it is optional and I know some people have problems getting OpenCL working.) . The performance results[1] are quite interesting: Protonect with CPU only Protonect with OpenGL Protonect with OpenCL Protonect with CUDA . [1] These figures are approximate. Running Ubuntu desktop. System specs: There is more fluctuation in depth packet processing, say 0.5ms. |
It should still be enabled by default, because at this development stage we want to make it easier to test. If there is a bug, we should fix it to make it work.
Why does this have GPU usage at all.
Overhead is expected. |
from a design point of view I would think it's better to add a method The Frame class could be changed to allow access to the buffer only through methods. The GPU->CPU transfer could then be delayed until one of the CPU memory access methods is called. What do you think? |
I think memory pool allocators are just implicit double buffers with extra complexity. Suppose a double buffer is to be used between the packet processor and the frame listener. OK, the modification to DoubleBuffer in this PR is actually outdated as I intended for VA-API PR #210 to be merged first. https://github.com/xlz/libfreenect2/commit/89947ab500af7a7de1cca1ea24cd79a58565a28a is the newer modification to DoubleBuffer that allows inheritance by providing a method The usage pattern is each provider of DoubleBuffer defines a subclass that implements its own allocator and custom methods. Example: https://github.com/xlz/libfreenect2/commit/eaaa7bcfaa804d6744c6894d2fb045c833a7f893#diff-24ed06cdc0ce177f67a0567e42d2b09fR85. VA-API allocates memory handles that are specific to its own API instead of addressable pointers. And users of DoubleBuffer need to access the memory handles for mapping and unmapping. The allocator cannot be stateless if it stores the memory handles. And because mapping and unmapping deal with whole allocated memory, DoubleBuffer needs to hold two allocations instead of one allocation of two halves. |
I am against disabling OpenCL by default, because it is supported by most hardware. The problem people have is to setup OpenCL correctly on their system and that is not an issue of the OpenCL code in libfreenect2. If needed one can always disable it manually with the cmake flags. @xlz I already looked into pinned memory for the OpenCL processor, but didn't found the time implement it. If a pull-request or a branch with the changed packed buffer interface is created, then I could update the OpenCL processor accordingly and prepare a changed version of iai_kinect2. |
@xlz ok your update is fine for me. however, as an idea for future improvements: maybe RgbPacket / DepthPacket should directly provide methods to access a buffer into which the stream parsers copy the data. CUDA/VA-API processors could subclass RgbPacket/DepthPacket to implement their custom allocation and store eg. buffer ids. Each processor provides a method |
@christiankerl It seems your first comment about allocators was actually suggesting a factory pattern: add I'll write some draft about this design. |
@xlz my point is more like: with your changes the processor needs to know how the parser works, e.g. that it uses a doublebuffer or that process is called on the back buffer previously the parser managed the memory and just told the processor: hey there is a packet with data at a certain memory location, please process it I agree that for implementation/efficiency reasons the processor has to decide how to allocate this memory my proposal suggesting the allocators was a bit short sighted. I think it would be better if the parser asks the processor to create an appropriate packet instance and then just populates its contents with the incoming data. Afterwards this packet is passed to the process method. This way the parser could request as many packet instances as necessary to implement double buffering for example. |
@christiankerl It compiles, but I haven't tested further. |
@xlz looks good. I added some minor comments |
100% chance, but takes time. |
A new PR for CUDA is expected next week after the VAAPI one is merged. |
I directly merged the cleaned up patches into master after testing on Ubuntu and Tegra. |
@wiedemeyer It looks like https://github.com/OpenKinect/libfreenect2/blob/master/src/opencl_depth_packet_processor.cl#L48 |
@xlz |
I believe you only need one pair of cos and sin values for the 3 phases. |
I created now 2 LUTs for p0, one for the sines and one for the cosines and replaced the constants |
Come on, HASE_IN_RAD0 is 0, PHASE_IN_RAD1 is 2pi/3, PHASE_IN_RAD2 is 4pi/3. These are constants that can never change. There are 3 phases that cover 2pi. If they change, if means they no longer cover 2pi equally or there are 2 or 4 phases. |
Indeed, I tried it with the formula I described and it was slower than sincos(). So this is not a good optimization. |
I found the time to integrate the pinned memory (and other things) into the OpenCL processor. It is now nearly as fast as the CUDA one, ~ 0.2 ms difference on my system. |
Performance:
On Tegra K1, numeric optimization improved from 57Hz to 71Hz, and pinned memory improved to 90Hz. Pinned memory required some modifications to
Frame
andDoubleBuffer
.This PR needs coordination with #210 because of dependencies.