Skip to content

Conversation

@stephenyan1231
Copy link
Contributor

@stephenyan1231 stephenyan1231 commented Aug 30, 2019

Implement a C++ video decoder, and refer to it as TorchVision (TV) video reader in the following.

Main features

  • Decode both video frames and audio waveform in a single pass
  • Being able to seek to a user-specified timestamp in both video- and audio streams, and decode frames starting from there. Also can take an end timestamp where the decoding should stop.
  • For video decoding, support to rescale the height/width and specific AVPixelFormat (default: AV_PIX_FMT_RGB24)
  • For audio decoding, support to resample audio using user-specified sampling rate and channels. User can also specify AVSampleFormat (default: AV_SAMPLE_FMT_FLT)
  • Support to decode pts only while actual video/audio frame data is skipped. This is useful in the dataset initialization stage where an index of video dataset needs to be built and we only need pts information

APIs

The main API includes

  • FfmpegDecoder::decodeFile(....): decode frames from a given video file. This is useful for both OOS and FB research projects, where videos reside in file folder.
  • FfmpegDecoder::decodeMemory(....): decode frames from a given compressed video byte array. This is useful for decoding everstore videos.

Sanity check

  • No memory leak is detected.

Benchmark

We use several videos from HMDB-51, UCF-101 and Kinetics-400 for benchmarking and unit test. Test videos are listed below.

  • RATRACE_wave_f_nm_np1_fr_goo_37.avi

    • source: hmdb51
    • video: DivX MPEG-4
      • fps: 30
    • audio: N/A
  • SchoolRulesHowTheyHelpUs_wave_f_nm_np1_ba_med_0.avi

    • source: hmdb51
    • video: DivX MPEG-4
      • fps: 30
    • audio: N/A
  • TrumanShow_wave_f_nm_np1_fr_med_26.avi

    • source: hmdb51
    • video: DivX MPEG-4
      • fps: 30
    • audio: N/A
  • v_SoccerJuggling_g23_c01.avi

    • source: ucf101
    • video: Xvid MPEG-4
      • fps: 29.97
    • audio: N/A
  • v_SoccerJuggling_g24_c01.avi

    • source: ucf101
    • video: Xvid MPEG-4
      • fps: 29.97
    • audio: N/A
  • R6llTwEh07w.mp4

    • source: kinetics-400
    • video: H-264 - MPEG-4 AVC (part 10) (avc1)
      • fps: 30
    • audio: MPEG AAC audio (mp4a)
      • sample rate: 44.1K Hz
  • SOX5yA1l24A.mp4

    • source: kinetics-400
    • video: H-264 - MPEG-4 AVC (part 10) (avc1)
      • fps: 29.97
    • audio: MPEG AAC audio (mp4a)
      • sample rate: 48K Hz
  • WUzgd7C1pWA.mp4

    • source: kinetics-400
    • video: H-264 - MPEG-4 AVC (part 10) (avc1)
      • fps: 29.97
    • audio: MPEG AAC audio (mp4a)
      • sample rate: 48K Hz

Unit test

  • we compare the decoding speed between TorchVision video reader and PyAv in the following cases
    • decode full video from file / memory
    • decode a fixed number of frames (e.g. [4, 8, 16, 32, 64, 128]) at a randomly selected timestamp
  • we test the feature of rescaling video frames and resampling audio waveforms
  • we did stress test to iteratively decode videos and ensure no memory leak
  • we compare decoding results between only pts is needed and both pts and video/audio frames are needed. Ensure the returned pts data are identical. Also compare decoding efficiency to validate decoding is faster when only pts is needed

Results of unit test are attached.

[torchvision video reader unit test.log]

torchvision.video.reader.unit.test.log

Comparison with PyAv

  • When decoding all video/audio frames in the video, TorchVision video reader is 1.2x - 6x faster depending on the codec and video length
  • When decoding a fixed number of video frames (e.g. [4, 8, 16, 32, 64, 128]), TorchVision video reader runs equally fast for small values (i.e. [4, 8, 16]) and runs up to 3x faster for large values (e.g. [32, 64, 128])

video_reader_src,
include_dirs=[
video_reader_src_dir,
'/home/zyan3/local/anaconda3/envs/pytorch_py3/include',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmassa , I will remove this line.

For header files from ffmpeg, we need to ensure they are installed at default header file search path.

@stephenyan1231 stephenyan1231 force-pushed the torchvision_video_reader branch from 4eec101 to b3f2a6e Compare August 31, 2019 04:46
zyan3 and others added 9 commits August 30, 2019 21:55
* fixed typo

* fixed some more typos and grammer
* make shufflenet scriptable

* make resnet18 scriptable

* set downsample to identity instead of __constants__ api

* use __constants__ for downsample instead of identity

* import tensor to fix flake

* use torch.Tensor type annotation instead of import
@fmassa fmassa self-requested a review September 2, 2019 15:49
@fmassa
Copy link
Member

fmassa commented Sep 2, 2019

Thanks a lot for the PR Zhicheng!

The first thing I need to figure out before we can merge this is how we will be adding ffmpeg as a dependency for torchvision, and if it will be a soft or hard dependency.

A few options:

Also, what is the version of FFMpeg that we will be relying upon?

Another thing I need to do is to get CI working for Windows and OSX in torchvision, so that we can make sure that this PR compiles and works nicely in the other OS that torchvision supports.

I'll be looking into both the CI and ffmpeg dependency from an OSS perspective.

@soumith
Copy link
Member

soumith commented Sep 3, 2019

i think it might be a good start to start with (1), i.e. the ffmpeg from conda-source or system package manager (brew install ffmpeg / apt install ffmpeg). Also, by ffmpeg I presume you mean libav?

For binaries, we will figure out how to ship ffmpeg the right way ourselves. Just building ffmpeg from source is not sufficient btw, because you need to build it with codec support, and there are tons of codecs we need to build it with.

@fmassa
Copy link
Member

fmassa commented Sep 3, 2019

@soumith

i think it might be a good start to start with (1), i.e. the ffmpeg from conda-source or system package manager (brew install ffmpeg / apt install ffmpeg).

sounds good, I'll be looking into option (1) first (once I get full CI running).

Also, by ffmpeg I presume you mean libav?

We need the underlying libraries that are composed of ffmpeg (could also be called libav, but it now points to a fork of ffmpeg with different functionality).

For binaries, we will figure out how to ship ffmpeg the right way ourselves. Just building ffmpeg from source is not sufficient btw, because you need to build it with codec support, and there are tons of codecs we need to build it with.

Sounds good

}
}

<<<<<<< Updated upstream
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that you forgot some merge conflicts in this file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Lowik, it is fixed in the replacement PR (#1303)

Copy link
Contributor

@bjuncek bjuncek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few things to clean up but this looks very promising! Exciting stuff!!!


class BasicBlock(nn.Module):
expansion = 1
__constants__ = ['downsample']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you separate model changes in a separate PR to track it more easily?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I mess up this PR. I will abandon this PR and create a more clean PR.

audio_timebase = Fraction(0, 1)
if "audio_timebase" in info:
audio_timebase = info["audio_timebase"]
audio_start_pts = pts_convert(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it makes sense to keep the global pts as opposed to doing this conversion?
In case we have more than two streams, we'd have to add more of these if clauses at every iteration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I think it might be better to use a global metric, like seconds for example

import collections
from common_utils import get_tmp_dir
from fractions import Fraction
import logging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen logging used in torchvision in general. What is the best prActice for this @fmassa?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't currently use logging in torchvision, only emit some deprecation warnings at some places.

I'm not yet sure we want to add logging as of now, it might deserve a larger discussion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Those logging are mostly used for my dev. I will remove them now.

@stephenyan1231
Copy link
Contributor Author

Abandon this PR.
Please move to the new PR.

#1303

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants