-
Notifications
You must be signed in to change notification settings - Fork 769
Add VA-API JPEG decoder #210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Nice work @xlz - but why have you more or less all changes in one pull request? Even some, which you have in a separate pull request ie. #207. Also it makes it difficult to do code review, since you basically change:
which I personally think should be separate pull request - each dealing with a specific feature :) I have hard time figuring out if the VA-API is only available on Linux or do you have link for windows (for ati or I7 ) - as I would like to try it out :) Atleast you get some response now :D |
As I said, this is just a request for comment for a specific feature, not for merge. I got all commits for this feature in one RFC pull request ready for testing. Otherwise I would have to wait for months before some previous pull requests this feature depends on get merged. I have submitted other pull requests of these commits:
The current stream parsers are badly broken for evaluating the feature without these: The feature: |
@larshg VA-API is Linux only and Intel only. I have tried GPU decoding of JPEG. It works but the performance is not good (< 60Hz) even given powerful GPUs. The bottleneck is Huffman decoding which is sequential and hard to parallelize. To achieve 60+Hz performance for JPEG decoding, there must be some specialized chip other than the GPU doing Huffman decoding, and there must be a (usually platform dependent) video codec acceleration interface exposing the hardware. |
Ah okay, I guess I read the first post a bit too fast and jumped right into the code :) |
I am trying the VA support (3bd1f5c) on a Intel NUC (Ubuntu 14.04 kernel 3.16), but I have this problem:
|
Does |
That file does not exists. I have: i915_dri.so and i965_dri.so and others... |
|
I have updated the branch according to other PRs. The commits relevant to this PR are the last three. Previous commits in this branch consist exactly of PR #221 and #222 which are dependencies of this PR. When the two dependency PRs are merged, I will clean them out from this PR. To test this branch, you must have |
I tried this on my i5-3320M on Ubuntu 14.04 and I get significantly better performance with VA-API at lower overall CPU usage (incl. OpenCLDepthPacketProcessor) compared to TurboJPEG:
However, there seems to be a massive memory leak, which does not occur in the master branch. Protonect is eating up 0.5% of my memory approx. every second. This happens with both the original
However, the leak must be much larger than the 8 MB of memory that are mentioned. Some VA-related leaks I found in the report include:
Unfortunately, I don't have time right now to dig further into this. |
I think I have run valgrind through this code.
What does this mean? Does Protonect use 100% memory after 200 seconds?
This is a single frame of ~ But it should be definitely freed in main()
vaCreateBuffer creates buffers for vaRenderPicture. "Buffers are automatically destroyed afterwards" by vaRenderPicture. This is some leftover from a previous gc cycle during exit but it doesn't matter. |
With the VA-API Protonect, my system gets slower and slower as time passes. After about 30 seconds, the depth and RGB frame rates drop noticeably. After 2 minutes, Protonect uses over 2 GB of memory (28%, which is VIRT 3005M, RES 2117M, SHR 2047M). Then, the entire system comes to a halt and unrelated processes get killed due to running out-of-memory. This has been reproducible each time so far. If instead in
then Protonect has a constant memory usage of 1.3% even after running for 10 minutes. |
Interesting. I was able to reproduce the shared memory leak as you reported once. Then I have been unable to reproduce it. Btw, you can toggle features by |
OK. I am able to reproduce after letting the machine stay on for several hours. |
Ok, good to hear that you are able to reproduce this. I'm wondering why it takes so long on your machine, while on my laptop the problem appears almost instantly. Could this somehow be related to frame dropping? I observed that my laptop drops frames quite often, maybe some internal VA-API data structure is not being freed properly in that case? I'm just guessing now since I'm not so deep into the internals of libfreenect2, though. If you have an idea how to narrow this down further, let me know. It seems that valgrind is not of much help, since it claims that only 8 MB were lost. |
I used valgrind --tool=massif --pages-as-heap=yes and found the "leak" is happening in libdrm. But I still can't reliably reproduce it. libdrm uses ioctl to allocate memory which valgrind can't track. There is some kind of cache system to maintain a list of allocated memory. It seems libdrm will more likely miss cache and allocate new memory when there are some heavy memory activities going on elsewhere. |
I posted a simpler reproducer on upstream https://bugs.freedesktop.org/show_bug.cgi?id=90429 |
Inspect the magic markers at the end of a JPEG frame and match the sequence number and length. Find out the exact size of the JPEG image for decoders that can't handle garbage after JPEG EOI.
Remove magic footer scanning: may appear in the middle. Assume fixed packet size.
@tlind I have pushed a fix to vaapi branch. Please pull and see if there is still any leak. If this is correct, I'll eventually move to mmap to avoid buffer allocation at all. |
Thanks! I don't have access to the sensor right now, but I hope I can try this out on Friday! |
This seems to have fixed it! I am now seeing constant memory usage and Protonect ran stable for over 20 minutes. Looks good to me! |
Allow packet processors to define custom zero-copy packet buffers.
JPEG performance is improved from 8ms/frame (125Hz) to 5.2ms/frame (192Hz) on Intel i7-4600U/HD Graphics 4400.
Provide memory-mapped packet buffers allocated by VA-API to the RGB stream parser to save a 700KB malloc & memcpy. Reuse decoding results from the first JPEG packet for all following packets, assuming JPEG coding parameters do not change based on some testing.
I have implemented memory-mapped buffer operations for input and output instead of explicitly destroying allocated buffers. This should be even better. |
Works fine for me. The maximum frame rate has not improved much compared to the previous version (still around 115 Hz), but seems a bit more stable now (doesn't go below 90, previously I sometimes saw 70). |
Second attempt in #563. |
This adds VA-API support for Intel GPUs under Linux.
Combined with OpenCL, the performance given Intel i7-4600U/HD Graphics 4400:
JPEG decoding consumes less than 10% of a single core.
This pull request can't be merged as-is. It depends on #221. After those two are merged, dependencies in this PR will be cleaned out.
Use memory mapped buffers in input and output to avoid extra memory copy. This requires modifications to input and output buffer structures
Frame
andDoubleBuffer
.To test this branch, you must have i965-va-driver installed. To avoid degraded color decoding, you can update libva and i965-va-driver to 1.5.0 by temporarily adding vivid to /etc/apt/sources.list, and then update only i965-va-driver and libva-dev.
You may see this warning on Ubuntu 14.04:
This is because libva on Ubuntu 14.04 is not new enough. Follow instructions above to update libva and i965-va-driver.