Add VA-API JPEG decoder #210

xlz · 2015-04-18T20:27:35Z

This adds VA-API support for Intel GPUs under Linux.

Combined with OpenCL, the performance given Intel i7-4600U/HD Graphics 4400:

[OpenCLDepthPacketProcessor] avg. time: 14.1288ms -> ~70.7774Hz
[VaapiJpegRgbPacketProcessor] avg. time: 5.07581ms -> ~197.013Hz

JPEG decoding consumes less than 10% of a single core.

This pull request can't be merged as-is. It depends on #221. After those two are merged, dependencies in this PR will be cleaned out.

v2:

Use memory mapped buffers in input and output to avoid extra memory copy. This requires modifications to input and output buffer structures Frame and DoubleBuffer.

Test instructions:

To test this branch, you must have i965-va-driver installed. To avoid degraded color decoding, you can update libva and i965-va-driver to 1.5.0 by temporarily adding vivid to /etc/apt/sources.list, and then update only i965-va-driver and libva-dev.

You may see this warning on Ubuntu 14.04:

[VaapiJpegRgbPacketProcessor::initializeVaapi] warning: YUV444 not supported by libva, chroma will be halved

This is because libva on Ubuntu 14.04 is not new enough. Follow instructions above to update libva and i965-va-driver.

larshg · 2015-04-21T20:46:36Z

Nice work @xlz - but why have you more or less all changes in one pull request? Even some, which you have in a separate pull request ie. #207.

Also it makes it difficult to do code review, since you basically change:

Some explicit template instantiation that was previously reported as problem ( but seems to be corrected from MS side - as I can't reproduce the error now, see Fix FTBFS on ARM introduced in PR #103 #207 )
Modifying the cmakelist for libusb and intels Opencl enviroment variable search path
Adding the VA-API JPEG decoder
Rewrite of depth stream parser

which I personally think should be separate pull request - each dealing with a specific feature :)

I have hard time figuring out if the VA-API is only available on Linux or do you have link for windows (for ati or I7 ) - as I would like to try it out :)

Atleast you get some response now :D

xlz · 2015-04-21T22:48:02Z

As I said, this is just a request for comment for a specific feature, not for merge. I got all commits for this feature in one RFC pull request ready for testing. Otherwise I would have to wait for months before some previous pull requests this feature depends on get merged.

I have submitted other pull requests of these commits:

Update installation documentation for MSVC bcd1800
Support building with MSVC and OpenCL on Windows 075cd3b
Fix FTBFS on ARM introduced in PR generate header file with platform and build configuration macros #103 … 04672f5

The current stream parsers are badly broken for evaluating the feature without these:

Add detailed RGB stream checking … ad79c2f
Rewrite depth stream parser … 1f55c9f

The feature:

Add VA-API JPEG decoder fb33833
Add build support for VA-API JPEG decoder 6971eec
Remove a 8MB memcpy to improve VA-API performance … 3bd1f5c

xlz · 2015-04-21T23:39:00Z

@larshg VA-API is Linux only and Intel only.

I have tried GPU decoding of JPEG. It works but the performance is not good (< 60Hz) even given powerful GPUs. The bottleneck is Huffman decoding which is sequential and hard to parallelize. To achieve 60+Hz performance for JPEG decoding, there must be some specialized chip other than the GPU doing Huffman decoding, and there must be a (usually platform dependent) video codec acceleration interface exposing the hardware.

larshg · 2015-04-22T07:37:54Z

Ah okay, I guess I read the first post a bit too fast and jumped right into the code :)

rastaxe · 2015-04-30T09:00:29Z

I am trying the VA support (3bd1f5c) on a Intel NUC (Ubuntu 14.04 kernel 3.16), but I have this problem:

libva info: VA-API version 0.35.0
libva info: va_getDriverName() returns 0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/i965_drv_video.so
libva info: va_openDriver() returns -1
terminate called after throwing an instance of 'std::runtime_error'
  what():  unknown libva error

xlz · 2015-04-30T10:03:01Z

lspci | grep VGA ?

Does /usr/lib/x86_64-linux-gnu/dri/i965_drv_video.so exist?

rastaxe · 2015-04-30T15:05:14Z

lspci | grep VGA
00:02.0 VGA compatible controller: Intel Corporation Haswell-ULT Integrated Graphics Controller (rev 09)

That file does not exists. I have: i915_dri.so and i965_dri.so and others...

xlz · 2015-04-30T16:21:25Z

apt-get install i965-va-driver

xlz · 2015-05-05T00:22:51Z

I have updated the branch according to other PRs. The commits relevant to this PR are the last three. Previous commits in this branch consist exactly of PR #221 and #222 which are dependencies of this PR.

When the two dependency PRs are merged, I will clean them out from this PR.

To test this branch, you must have i965-va-driver installed. To avoid degraded color decoding, you can update libva and i965-va-driver to 1.5.0 by temporarily adding vivid to /etc/apt/sources.list, and then update only i965-va-driver and libva-dev.

tlind · 2015-05-11T15:38:58Z

I tried this on my i5-3320M on Ubuntu 14.04 and I get significantly better performance with VA-API at lower overall CPU usage (incl. OpenCLDepthPacketProcessor) compared to TurboJPEG:

TurboJPEG    200% CPU    27-35 Hz
VA-API       140% CPU    70-110 Hz

However, there seems to be a massive memory leak, which does not occur in the master branch. Protonect is eating up 0.5% of my memory approx. every second. This happens with both the original libva-dev and i965-va-driver, and the updated one from Vivid. A quick valgrind run shows the following:

==17538== LEAK SUMMARY:
==17538==    definitely lost: 8,720,841 bytes in 17 blocks
==17538==    indirectly lost: 3,477,369 bytes in 14 blocks
==17538==      possibly lost: 7,551,028 bytes in 445 blocks
==17538==    still reachable: 926,276 bytes in 8,639 blocks
==17538==         suppressed: 0 bytes in 0 blocks
==17538== Reachable blocks (those to which a pointer was found) are not shown.
==17538== To see them, rerun with: --leak-check=full --show-leak-kinds=all

However, the leak must be much larger than the 8 MB of memory that are mentioned. Some VA-related leaks I found in the report include:

==17538== 8,388,608 bytes in 1 blocks are definitely lost in loss record 5,316 of 5,316
==17538==    at 0x18677B20: ??? (in /usr/lib/x86_64-linux-gnu/libdrm_intel.so.1.0.0)
==17538==    by 0x1CDBD3E1: ??? (in /usr/lib/x86_64-linux-gnu/dri/i965_drv_video.so)
==17538==    by 0x70E02F2: vaMapBuffer (in /usr/lib/x86_64-linux-gnu/libva.so.1.3700.0)
==17538==    by 0x4E78EFD: libfreenect2::VaapiJpegRgbPacketProcessor::process(libfreenect2::RgbPacket const&) (vaapi_jpeg_rgb_packet_processor.cpp:71)
...
==17538== 442 (24 direct, 418 indirect) bytes in 1 blocks are definitely lost in loss record 5,106 of 5,316
==17538==    at 0x4C2CC70: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==17538==    by 0x1CDBFEC5: ??? (in /usr/lib/x86_64-linux-gnu/dri/i965_drv_video.so)
==17538==    by 0x70E01C2: vaCreateBuffer (in /usr/lib/x86_64-linux-gnu/libva.so.1.3700.0)
==17538==    by 0x4E7A8CF: libfreenect2::VaapiJpegRgbPacketProcessorImpl::createParameters(jpeg_decompress_struct&, unsigned int*, unsigned int*) (vaapi_jpeg_rgb_packet_processor.cpp:225)
==17538==    by 0x4E78E0D: libfreenect2::VaapiJpegRgbPacketProcessor::process(libfreenect2::RgbPacket const&) (vaapi_jpeg_rgb_packet_processor.cpp:319)

Unfortunately, I don't have time right now to dig further into this.

xlz · 2015-05-11T16:01:35Z

I think I have run valgrind through this code.

Protonect is eating up 0.5% of my memory approx. every second.

What does this mean? Does Protonect use 100% memory after 200 seconds?

8,388,608 bytes in 1 blocks are definitely lost by 0x70E02F2: vaMapBuffer

This is a single frame of ~1920*1080*4 bytes. OK, I should delete VaapiFrame *frame; in VaapiJpegRgbPacketProcessor destructor to make valgrind happy, but so does TurboJpegRgbPacketProcessor.

But it should be definitely freed in main() listener.release(frames); and not accumulate. Does it?

442 (24 direct, 418 indirect) bytes in 1 blocks are definitely lost in loss record 5,106 of 5,316 by 0x70E01C2: vaCreateBuffer

vaCreateBuffer creates buffers for vaRenderPicture. "Buffers are automatically destroyed afterwards" by vaRenderPicture. This is some leftover from a previous gc cycle during exit but it doesn't matter.

tlind · 2015-05-11T17:31:55Z

With the VA-API Protonect, my system gets slower and slower as time passes. After about 30 seconds, the depth and RGB frame rates drop noticeably. After 2 minutes, Protonect uses over 2 GB of memory (28%, which is VIRT 3005M, RES 2117M, SHR 2047M).

Then, the entire system comes to a halt and unrelated processes get killed due to running out-of-memory. This has been reproducible each time so far.

If instead in createRgbPacketProcessor() in packet_pipeline.cpp (on your branch), I comment the following so that it uses the TurboJpegRgbPacketProcessor,

//#ifdef LIBFREENECT2_WITH_VAAPI_SUPPORT
//  return new VaapiJpegRgbPacketProcessor();
//#endif

then Protonect has a constant memory usage of 1.3% even after running for 10 minutes.

xlz · 2015-05-11T18:52:55Z

Interesting. I was able to reproduce the shared memory leak as you reported once. Then I have been unable to reproduce it.

Btw, you can toggle features by cmake -DENABLE_VAAPI:BOOL=OFF

xlz · 2015-05-11T22:41:19Z

OK. I am able to reproduce after letting the machine stay on for several hours.

tlind · 2015-05-12T11:21:49Z

Ok, good to hear that you are able to reproduce this. I'm wondering why it takes so long on your machine, while on my laptop the problem appears almost instantly. Could this somehow be related to frame dropping? I observed that my laptop drops frames quite often, maybe some internal VA-API data structure is not being freed properly in that case? I'm just guessing now since I'm not so deep into the internals of libfreenect2, though.

If you have an idea how to narrow this down further, let me know. It seems that valgrind is not of much help, since it claims that only 8 MB were lost.

xlz · 2015-05-12T15:21:54Z

I used valgrind --tool=massif --pages-as-heap=yes and found the "leak" is happening in libdrm. But I still can't reliably reproduce it.

libdrm uses ioctl to allocate memory which valgrind can't track. There is some kind of cache system to maintain a list of allocated memory. It seems libdrm will more likely miss cache and allocate new memory when there are some heavy memory activities going on elsewhere.

xlz · 2015-05-12T21:56:29Z

I posted a simpler reproducer on upstream https://bugs.freedesktop.org/show_bug.cgi?id=90429

Inspect the magic markers at the end of a JPEG frame and match the sequence number and length. Find out the exact size of the JPEG image for decoders that can't handle garbage after JPEG EOI.

Remove magic footer scanning: may appear in the middle. Assume fixed packet size.

@hovren

Pass timestamps and sequence numbers from {rgb,depth} stream processors to turbojpeg rgb processor and {cpu,opengl,opencl} depth processors, then to rgb and depth frames. This commit subsumes PR #71 by @hovren and #148 by @MasWag.

xlz · 2015-05-13T18:55:20Z

@tlind
The cause seems to be vaRenderPicture not actually doing its job of "Buffers are automatically destroyed afterwards". I have to explicitly destroy the buffers otherwise it causes leaks in kernel.

I have pushed a fix to vaapi branch. Please pull and see if there is still any leak.

If this is correct, I'll eventually move to mmap to avoid buffer allocation at all.

tlind · 2015-05-13T19:42:05Z

Thanks! I don't have access to the sensor right now, but I hope I can try this out on Friday!

tlind · 2015-05-15T16:10:38Z

This seems to have fixed it! I am now seeing constant memory usage and Protonect ran stable for over 20 minutes. Looks good to me!

Allow packet processors to define custom zero-copy packet buffers.

JPEG performance is improved from 8ms/frame (125Hz) to 5.2ms/frame (192Hz) on Intel i7-4600U/HD Graphics 4400.

Provide memory-mapped packet buffers allocated by VA-API to the RGB stream parser to save a 700KB malloc & memcpy. Reuse decoding results from the first JPEG packet for all following packets, assuming JPEG coding parameters do not change based on some testing.

xlz · 2015-05-18T17:02:57Z

I have implemented memory-mapped buffer operations for input and output instead of explicitly destroying allocated buffers. This should be even better.

tlind · 2015-05-20T16:06:01Z

Works fine for me. The maximum frame rate has not improved much compared to the previous version (still around 115 Hz), but seems a bit more stable now (doesn't go below 90, previously I sometimes saw 70).

xlz · 2016-02-07T21:06:45Z

Second attempt in #563.

xlz mentioned this pull request Apr 29, 2015

Ubuntu 14.04 libusb error LIBUSB_ERROR_NO_DEVICE (-4) #216

Closed

xlz changed the title ~~[RFC] Add VA-API JPEG decoding support~~ Add VA-API JPEG decoder May 5, 2015

xlz added 3 commits May 13, 2015 09:37

Add detailed RGB stream checking

26c7619

Inspect the magic markers at the end of a JPEG frame and match the sequence number and length. Find out the exact size of the JPEG image for decoders that can't handle garbage after JPEG EOI.

Clean up depth stream parser

9c23155

Remove magic footer scanning: may appear in the middle. Assume fixed packet size.

Pass timestamps and sequence numbers

f0945e7

Pass timestamps and sequence numbers from {rgb,depth} stream processors to turbojpeg rgb processor and {cpu,opengl,opencl} depth processors, then to rgb and depth frames. This commit subsumes PR #71 by @hovren and #148 by @MasWag.

Allow processors to define custom Frame classes

8283332

xlz added 4 commits May 15, 2015 17:14

Move packet buffer to a member of packet processor

89947ab

Allow packet processors to define custom zero-copy packet buffers.

Add VA-API JPEG decoder

3006ded

Add build support for VA-API JPEG decoder

a404279

Remove a 8MB memcpy to improve VA-API performance

34cb559

JPEG performance is improved from 8ms/frame (125Hz) to 5.2ms/frame (192Hz) on Intel i7-4600U/HD Graphics 4400.

xlz mentioned this pull request May 18, 2015

Add CUDA depth processor #222

Closed

floe added this to the 0.2 milestone May 22, 2015

This was referenced May 31, 2015

Question regarding frame latency / comparison to official SDK #192

Closed

Minimum hardware requirements (general question, not a bug or an issue) #251

Closed

xlz closed this Feb 7, 2016

xlz mentioned this pull request Feb 7, 2016

Add VA-API JPEG decoder again #563

Merged

xlz deleted the vaapi branch February 12, 2016 20:53

Add VA-API JPEG decoder #210

Add VA-API JPEG decoder #210

Uh oh!

Conversation

xlz commented Apr 18, 2015

Uh oh!

larshg commented Apr 21, 2015

Uh oh!

xlz commented Apr 21, 2015

Uh oh!

xlz commented Apr 21, 2015

Uh oh!

larshg commented Apr 22, 2015

Uh oh!

rastaxe commented Apr 30, 2015

Uh oh!

xlz commented Apr 30, 2015

Uh oh!

rastaxe commented Apr 30, 2015

Uh oh!

xlz commented Apr 30, 2015

Uh oh!

xlz commented May 5, 2015

Uh oh!

tlind commented May 11, 2015

Uh oh!

xlz commented May 11, 2015

Uh oh!

tlind commented May 11, 2015

Uh oh!

xlz commented May 11, 2015

Uh oh!

xlz commented May 11, 2015

Uh oh!

tlind commented May 12, 2015

Uh oh!

xlz commented May 12, 2015

Uh oh!

xlz commented May 12, 2015

Uh oh!

xlz commented May 13, 2015

Uh oh!

tlind commented May 13, 2015

Uh oh!

tlind commented May 15, 2015

Uh oh!

xlz commented May 18, 2015

Uh oh!

tlind commented May 20, 2015

Uh oh!

xlz commented Feb 7, 2016

Uh oh!

Uh oh!