Skip to content

Add VA-API JPEG decoder #210

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from
Closed

Add VA-API JPEG decoder #210

wants to merge 9 commits into from

Conversation

xlz
Copy link
Member

@xlz xlz commented Apr 18, 2015

This adds VA-API support for Intel GPUs under Linux.

Combined with OpenCL, the performance given Intel i7-4600U/HD Graphics 4400:

[OpenCLDepthPacketProcessor] avg. time: 14.1288ms -> ~70.7774Hz
[VaapiJpegRgbPacketProcessor] avg. time: 5.07581ms -> ~197.013Hz

JPEG decoding consumes less than 10% of a single core.

This pull request can't be merged as-is. It depends on #221. After those two are merged, dependencies in this PR will be cleaned out.

  • v2:

Use memory mapped buffers in input and output to avoid extra memory copy. This requires modifications to input and output buffer structures Frame and DoubleBuffer.

  • Test instructions:

To test this branch, you must have i965-va-driver installed. To avoid degraded color decoding, you can update libva and i965-va-driver to 1.5.0 by temporarily adding vivid to /etc/apt/sources.list, and then update only i965-va-driver and libva-dev.

You may see this warning on Ubuntu 14.04:

[VaapiJpegRgbPacketProcessor::initializeVaapi] warning: YUV444 not supported by libva, chroma will be halved

This is because libva on Ubuntu 14.04 is not new enough. Follow instructions above to update libva and i965-va-driver.

@larshg
Copy link
Contributor

larshg commented Apr 21, 2015

Nice work @xlz - but why have you more or less all changes in one pull request? Even some, which you have in a separate pull request ie. #207.

Also it makes it difficult to do code review, since you basically change:

  1. Some explicit template instantiation that was previously reported as problem ( but seems to be corrected from MS side - as I can't reproduce the error now, see Fix FTBFS on ARM introduced in PR #103 #207 )
  2. Modifying the cmakelist for libusb and intels Opencl enviroment variable search path
  3. Adding the VA-API JPEG decoder
  4. Rewrite of depth stream parser

which I personally think should be separate pull request - each dealing with a specific feature :)

I have hard time figuring out if the VA-API is only available on Linux or do you have link for windows (for ati or I7 ) - as I would like to try it out :)

Atleast you get some response now :D

@xlz
Copy link
Member Author

xlz commented Apr 21, 2015

As I said, this is just a request for comment for a specific feature, not for merge. I got all commits for this feature in one RFC pull request ready for testing. Otherwise I would have to wait for months before some previous pull requests this feature depends on get merged.

I have submitted other pull requests of these commits:

The current stream parsers are badly broken for evaluating the feature without these:

  • Add detailed RGB stream checking … ad79c2f
  • Rewrite depth stream parser … 1f55c9f

The feature:

  • Add VA-API JPEG decoder fb33833
  • Add build support for VA-API JPEG decoder 6971eec
  • Remove a 8MB memcpy to improve VA-API performance … 3bd1f5c

@xlz
Copy link
Member Author

xlz commented Apr 21, 2015

@larshg VA-API is Linux only and Intel only.

I have tried GPU decoding of JPEG. It works but the performance is not good (< 60Hz) even given powerful GPUs. The bottleneck is Huffman decoding which is sequential and hard to parallelize. To achieve 60+Hz performance for JPEG decoding, there must be some specialized chip other than the GPU doing Huffman decoding, and there must be a (usually platform dependent) video codec acceleration interface exposing the hardware.

@larshg
Copy link
Contributor

larshg commented Apr 22, 2015

Ah okay, I guess I read the first post a bit too fast and jumped right into the code :)

@rastaxe
Copy link

rastaxe commented Apr 30, 2015

I am trying the VA support (3bd1f5c) on a Intel NUC (Ubuntu 14.04 kernel 3.16), but I have this problem:

libva info: VA-API version 0.35.0
libva info: va_getDriverName() returns 0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/i965_drv_video.so
libva info: va_openDriver() returns -1
terminate called after throwing an instance of 'std::runtime_error'
  what():  unknown libva error

@xlz
Copy link
Member Author

xlz commented Apr 30, 2015

lspci | grep VGA ?

Does /usr/lib/x86_64-linux-gnu/dri/i965_drv_video.so exist?

@rastaxe
Copy link

rastaxe commented Apr 30, 2015

lspci | grep VGA
00:02.0 VGA compatible controller: Intel Corporation Haswell-ULT Integrated Graphics Controller (rev 09)

That file does not exists. I have: i915_dri.so and i965_dri.so and others...

@xlz
Copy link
Member Author

xlz commented Apr 30, 2015

apt-get install i965-va-driver

@xlz
Copy link
Member Author

xlz commented May 5, 2015

I have updated the branch according to other PRs. The commits relevant to this PR are the last three. Previous commits in this branch consist exactly of PR #221 and #222 which are dependencies of this PR.

When the two dependency PRs are merged, I will clean them out from this PR.

To test this branch, you must have i965-va-driver installed. To avoid degraded color decoding, you can update libva and i965-va-driver to 1.5.0 by temporarily adding vivid to /etc/apt/sources.list, and then update only i965-va-driver and libva-dev.

@xlz xlz changed the title [RFC] Add VA-API JPEG decoding support Add VA-API JPEG decoder May 5, 2015
@tlind
Copy link

tlind commented May 11, 2015

I tried this on my i5-3320M on Ubuntu 14.04 and I get significantly better performance with VA-API at lower overall CPU usage (incl. OpenCLDepthPacketProcessor) compared to TurboJPEG:

TurboJPEG    200% CPU    27-35 Hz
VA-API       140% CPU    70-110 Hz

However, there seems to be a massive memory leak, which does not occur in the master branch. Protonect is eating up 0.5% of my memory approx. every second. This happens with both the original libva-dev and i965-va-driver, and the updated one from Vivid. A quick valgrind run shows the following:

==17538== LEAK SUMMARY:
==17538==    definitely lost: 8,720,841 bytes in 17 blocks
==17538==    indirectly lost: 3,477,369 bytes in 14 blocks
==17538==      possibly lost: 7,551,028 bytes in 445 blocks
==17538==    still reachable: 926,276 bytes in 8,639 blocks
==17538==         suppressed: 0 bytes in 0 blocks
==17538== Reachable blocks (those to which a pointer was found) are not shown.
==17538== To see them, rerun with: --leak-check=full --show-leak-kinds=all

However, the leak must be much larger than the 8 MB of memory that are mentioned. Some VA-related leaks I found in the report include:

==17538== 8,388,608 bytes in 1 blocks are definitely lost in loss record 5,316 of 5,316
==17538==    at 0x18677B20: ??? (in /usr/lib/x86_64-linux-gnu/libdrm_intel.so.1.0.0)
==17538==    by 0x1CDBD3E1: ??? (in /usr/lib/x86_64-linux-gnu/dri/i965_drv_video.so)
==17538==    by 0x70E02F2: vaMapBuffer (in /usr/lib/x86_64-linux-gnu/libva.so.1.3700.0)
==17538==    by 0x4E78EFD: libfreenect2::VaapiJpegRgbPacketProcessor::process(libfreenect2::RgbPacket const&) (vaapi_jpeg_rgb_packet_processor.cpp:71)
...
==17538== 442 (24 direct, 418 indirect) bytes in 1 blocks are definitely lost in loss record 5,106 of 5,316
==17538==    at 0x4C2CC70: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==17538==    by 0x1CDBFEC5: ??? (in /usr/lib/x86_64-linux-gnu/dri/i965_drv_video.so)
==17538==    by 0x70E01C2: vaCreateBuffer (in /usr/lib/x86_64-linux-gnu/libva.so.1.3700.0)
==17538==    by 0x4E7A8CF: libfreenect2::VaapiJpegRgbPacketProcessorImpl::createParameters(jpeg_decompress_struct&, unsigned int*, unsigned int*) (vaapi_jpeg_rgb_packet_processor.cpp:225)
==17538==    by 0x4E78E0D: libfreenect2::VaapiJpegRgbPacketProcessor::process(libfreenect2::RgbPacket const&) (vaapi_jpeg_rgb_packet_processor.cpp:319)

Unfortunately, I don't have time right now to dig further into this.

@xlz
Copy link
Member Author

xlz commented May 11, 2015

I think I have run valgrind through this code.

Protonect is eating up 0.5% of my memory approx. every second.

What does this mean? Does Protonect use 100% memory after 200 seconds?

8,388,608 bytes in 1 blocks are definitely lost by 0x70E02F2: vaMapBuffer

This is a single frame of ~1920*1080*4 bytes. OK, I should delete VaapiFrame *frame; in VaapiJpegRgbPacketProcessor destructor to make valgrind happy, but so does TurboJpegRgbPacketProcessor.

But it should be definitely freed in main() listener.release(frames); and not accumulate. Does it?

442 (24 direct, 418 indirect) bytes in 1 blocks are definitely lost in loss record 5,106 of 5,316 by 0x70E01C2: vaCreateBuffer

vaCreateBuffer creates buffers for vaRenderPicture. "Buffers are automatically destroyed afterwards" by vaRenderPicture. This is some leftover from a previous gc cycle during exit but it doesn't matter.

@tlind
Copy link

tlind commented May 11, 2015

With the VA-API Protonect, my system gets slower and slower as time passes. After about 30 seconds, the depth and RGB frame rates drop noticeably. After 2 minutes, Protonect uses over 2 GB of memory (28%, which is VIRT 3005M, RES 2117M, SHR 2047M).

Then, the entire system comes to a halt and unrelated processes get killed due to running out-of-memory. This has been reproducible each time so far.

If instead in createRgbPacketProcessor() in packet_pipeline.cpp (on your branch), I comment the following so that it uses the TurboJpegRgbPacketProcessor,

//#ifdef LIBFREENECT2_WITH_VAAPI_SUPPORT
//  return new VaapiJpegRgbPacketProcessor();
//#endif

then Protonect has a constant memory usage of 1.3% even after running for 10 minutes.

@xlz
Copy link
Member Author

xlz commented May 11, 2015

Interesting. I was able to reproduce the shared memory leak as you reported once. Then I have been unable to reproduce it.

Btw, you can toggle features by cmake -DENABLE_VAAPI:BOOL=OFF

@xlz
Copy link
Member Author

xlz commented May 11, 2015

OK. I am able to reproduce after letting the machine stay on for several hours.

@tlind
Copy link

tlind commented May 12, 2015

Ok, good to hear that you are able to reproduce this. I'm wondering why it takes so long on your machine, while on my laptop the problem appears almost instantly. Could this somehow be related to frame dropping? I observed that my laptop drops frames quite often, maybe some internal VA-API data structure is not being freed properly in that case? I'm just guessing now since I'm not so deep into the internals of libfreenect2, though.

If you have an idea how to narrow this down further, let me know. It seems that valgrind is not of much help, since it claims that only 8 MB were lost.

@xlz
Copy link
Member Author

xlz commented May 12, 2015

I used valgrind --tool=massif --pages-as-heap=yes and found the "leak" is happening in libdrm. But I still can't reliably reproduce it.

libdrm uses ioctl to allocate memory which valgrind can't track. There is some kind of cache system to maintain a list of allocated memory. It seems libdrm will more likely miss cache and allocate new memory when there are some heavy memory activities going on elsewhere.

@xlz
Copy link
Member Author

xlz commented May 12, 2015

I posted a simpler reproducer on upstream https://bugs.freedesktop.org/show_bug.cgi?id=90429

xlz added 3 commits May 13, 2015 09:37
Inspect the magic markers at the end of a JPEG frame
and match the sequence number and length.
Find out the exact size of the JPEG image for decoders
that can't handle garbage after JPEG EOI.
Remove magic footer scanning: may appear in the middle.
Assume fixed packet size.
Pass timestamps and sequence numbers from {rgb,depth} stream
processors to turbojpeg rgb processor and {cpu,opengl,opencl}
depth processors, then to rgb and depth frames.

This commit subsumes PR #71 by @hovren and #148 by @MasWag.
@xlz
Copy link
Member Author

xlz commented May 13, 2015

@tlind
The cause seems to be vaRenderPicture not actually doing its job of "Buffers are automatically destroyed afterwards". I have to explicitly destroy the buffers otherwise it causes leaks in kernel.

I have pushed a fix to vaapi branch. Please pull and see if there is still any leak.

If this is correct, I'll eventually move to mmap to avoid buffer allocation at all.

@tlind
Copy link

tlind commented May 13, 2015

Thanks! I don't have access to the sensor right now, but I hope I can try this out on Friday!

@tlind
Copy link

tlind commented May 15, 2015

This seems to have fixed it! I am now seeing constant memory usage and Protonect ran stable for over 20 minutes. Looks good to me!

xlz added 4 commits May 15, 2015 17:14
Allow packet processors to define custom zero-copy packet buffers.
JPEG performance is improved from 8ms/frame (125Hz) to
5.2ms/frame (192Hz) on Intel i7-4600U/HD Graphics 4400.
Provide memory-mapped packet buffers allocated by VA-API to the
RGB stream parser to save a 700KB malloc & memcpy.

Reuse decoding results from the first JPEG packet for all
following packets, assuming JPEG coding parameters do not change
based on some testing.
@xlz
Copy link
Member Author

xlz commented May 18, 2015

I have implemented memory-mapped buffer operations for input and output instead of explicitly destroying allocated buffers. This should be even better.

@xlz xlz mentioned this pull request May 18, 2015
@tlind
Copy link

tlind commented May 20, 2015

Works fine for me. The maximum frame rate has not improved much compared to the previous version (still around 115 Hz), but seems a bit more stable now (doesn't go below 90, previously I sometimes saw 70).

@xlz
Copy link
Member Author

xlz commented Feb 7, 2016

Second attempt in #563.

@xlz xlz deleted the vaapi branch February 12, 2016 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants