Add VideoClips and Kinetics dataset #1077

fmassa · 2019-07-03T09:55:23Z

This PR adds functionalities to simplify build video datasets.

The main addition is the VideoClips class.
From a list of videos, VideoClips computes all the possible sub-videos (or clips) which are consecutive and contiguous, and enables to select from a particular clip easily.
This is made possible via unfold, which do some stride tricks on a tensor so that we can have all possible subsequences of a list without duplicating memory, and is very efficient.

As an example on how it should be used, I've written a basic KineticsVideo (better names welcome) illustrating how VideoClips can be used.

I need to add tests for KineticsVideo and improve the documentation overall, but this is a first version to get some feedback.

cc @bjuncek @stephenyan1231 for review

codecov-io · 2019-07-03T11:25:36Z

Codecov Report

Merging #1077 into master will increase coverage by 1.81%.
The diff coverage is 85.27%.

@@            Coverage Diff             @@
##           master    #1077      +/-   ##
==========================================
+ Coverage   63.92%   65.74%   +1.81%     
==========================================
  Files          68       70       +2     
  Lines        5406     5512     +106     
  Branches      829      851      +22     
==========================================
+ Hits         3456     3624     +168     
+ Misses       1707     1629      -78     
- Partials      243      259      +16

Impacted Files	Coverage Δ
torchvision/io/video.py	`70.96% <100%> (+0.63%)`	⬆️
torchvision/datasets/__init__.py	`100% <100%> (ø)`	⬆️
torchvision/datasets/kinetics.py	`40% <40%> (ø)`
torchvision/datasets/video_utils.py	`93.33% <93.33%> (ø)`
torchvision/models/detection/roi_heads.py	`55.93% <0%> (-0.97%)`	⬇️
torchvision/datasets/cifar.py	`78.16% <0%> (-0.5%)`	⬇️
torchvision/datasets/mnist.py	`50.71% <0%> (-0.47%)`	⬇️
torchvision/datasets/folder.py	`82.05% <0%> (-0.45%)`	⬇️
torchvision/transforms/functional.py	`71.09% <0%> (-0.29%)`	⬇️
torchvision/transforms/transforms.py	`81.53% <0%> (-0.25%)`	⬇️
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5508815...37ad0ac. Read the comment docs.

bjuncek

Looks good to me. If you can just clarify the few questions I posted inline and we are good to go.

torchvision/datasets/video_utils.py

torchvision/datasets/kinetics.py

torchvision/datasets/video_utils.py

fmassa · 2019-07-09T15:20:27Z

@bjuncek for some reason I can't comment on your last message about UniformClipSampler

What would be the interface for this function, would it sample (as uniformly as possible) num_max_clips from each video?

bjuncek · 2019-07-09T15:59:09Z

probably because I accepted the PR?
yeah, so you can have num_max_clips equally spaced clips - something like this (feel free to remove the np and add torch equivalent)

for c in self.video_clips.clips:
            length = len(c)
            size = min(length, self.max_clips_per_video)
            sampled = np.round(np.linspace(0, length - 1, size)).astype(int)
            s += length
            idxs.append(sampled)

This would be primarily used for validation I suspect.

stephenyan1231

See my inline comments

stephenyan1231 · 2019-07-10T05:35:20Z

torchvision/datasets/video_utils.py

+    Recreating the clips for different clip lengths is fast, and can be done
+    with the `compute_clips` method.
+    """
+    def __init__(self, video_paths, clip_length_in_frames=16, frames_between_clips=1):


why not exposing argument dilation in the init function, and pass it into compute_clips method?

Good question. I didn't want to expose it because the dilation was a hacky parameter to specify in a general VideoClips, because it assumed that all the videos had the same original frame rate, and that they are an integer value.

I've updated the PR with a (hopefully) more robust approach, where I perform an interpolation of the indices in the clip, so that we can support arbitrary conversion from input_fps to target_fps.

I'd think it makes sense to expose it and set it to a default value; there is no harm in general and there are a few applications that don't like interpolation as it can introduce artifacts, and people tend to use it regardless of the actual fps :)

But it is conceptually wrong to just step every few frames if the user wants to reduce the fps in general, as different videos might have different fps.

Also note that the approach I'm following is as efficient as the dilation one if step is integer.

Actually, I take that back as the new resampling is doing the same thing dilation is doing if the step is an integer

stephenyan1231 · 2019-07-10T05:40:32Z

torchvision/datasets/video_utils.py

+    same video is defined by `frames_between_clips`.
+
+    Creating this instance the first time is time-consuming, as it needs to
+    decode all the videos in `video_paths`. It is recommended that you


Since you mentioned creating this object is time-consuming, do you want to implement methods such as save and load to support caching ?

I think @fmassa was hoping to keep that separate from the torchvision per se, and keep it in examples?

Regardless where we put though it this is a great idea!

I was thinking that this is something that can be achieved by just saving the VideoClips object, and showcase how to do it in the references/video_classification example that I'm planning to add.

So the user would just need to do

torch.save(video_clips, 'cache.pth') ... video_clips = torch.load('cache.pth')

Thoughts?

I agree with that

@fmassa that sounds good.

LowikC · 2019-07-11T13:09:40Z

torchvision/datasets/video_utils.py

+                dilation = max(int((fps + frame_rate - 1) // frame_rate), 1)
+            else:
+                dilation = 1
+            clips = unfold(video_pts, num_frames, step, dilation)


unfold will drop the last frames of the video, if the last clip is smaller than the others.
It would be nice to have an option to pad the last clip.

This is a great point. We might add a drop_last option to VideoClips, as what we have in BatchSampler. But it will be slightly more annoying to implement, because we won't be able to use stride tricks anymore.

Created an issue to track this request in #1113

For now, I've mentioned in the documentation that the last frames of a video can potentially be dropped.

I'm going to be merging this as is for now, but I'll try to address this before the release.

LowikC · 2019-07-12T11:20:56Z

torchvision/datasets/video_utils.py

+        for video_pts, fps in zip(self.video_pts, self.video_fps):
+            if frame_rate is not None:
+                # divup, == int(ceil(fps / frame_rate))
+                dilation = max(int((fps + frame_rate - 1) // frame_rate), 1)


I think you should also modify step, otherwise the step if still relative to the original frame_rate.
unit test:

video_pts = torch.arange(11) fps = 10 frame_rate = 5 num_frames = 3 step = 3 # same as num_frames, so I expect non-overlapping clips

I expect to get [0, 2, 4], [6, 8, 10] (half the original frame rate, non-overlapping clips)
But the current code will give [0, 2, 4], [3, 5, 7], [6, 8, 10]

if frame_rate is not None: dilation = max(int((fps + frame_rate - 1) // frame_rate), 1) step = step * dilation

Good catch!

LowikC · 2019-07-12T12:03:45Z

torchvision/datasets/video_utils.py

+            step = int(step)
+            return video[::step]
+        idxs = torch.arange(self.num_frames, dtype=torch.float32) * step
+        idxs = idxs.floor().to(torch.int64)


I think it doesn't match the way dilation is computed on line 83.

utest

video_pts = torch.arange(30) fps = 30 frame_rate = 13.0 size = 10 dilation = max(int((fps + frame_rate - 1) // frame_rate), 1) # = 3 clips = unfold(video_pts, size, size, dilation) # = tensor([[ 0, 3, 6, 9, 12, 15, 18, 21, 24, 27]]) step = fps / frame_rate idxs = torch.arange(size, dtype=torch.float32) * step idxs = idxs.floor().to(torch.int64) video_pts[idxs] # =tensor([ 0, 2, 4, 6, 9, 11, 13, 16, 18, 20]), whereas it should match clips[0]

I think the current code works only if fps / frame_rate is (almost) an integer
As dilation has to be an integer, it's equivalent to round down frame_rate to a value such that fps/frame_rate is an integer. It should be done also in resample_video.
The best would be to deal with non-integer dilation, but you can't use the stride trick anymore.

This is a great point, once again, thanks Lowik!

So, here is my view of things:

you have 1 second of video at 30 fps. We now resample it to be at 13fps. Now we have 13 frames.
Their time coordinates are given by

torch.arange(13, dtype=torch.float32) * 30 / 13 # tensor([ 0.0000, 2.3077, 4.6154, 6.9231, 9.2308, 11.5385, 13.8462, 16.1538, # 18.4615, 20.7692, 23.0769, 25.3846, 27.6923])

Now, we want to take 10 frames out of this newly sampled video. This gives us

tensor([ 0.0000, 2.3077, 4.6154, 6.9231, 9.2308, 11.5385, 13.8462, 16.1538, 18.4615, 20.7692])

Because we only do nearest interpolation on the frames, this corresponds to (by calling .floor()

tensor([ 0, 2, 4, 6, 9, 11, 13, 16, 18, 20])

Note that the indices are actually exactly the same as what we originally compute

idxs = torch.arange(10, dtype=torch.float32) * (30 / 13) # tensor([ 0.0000, 2.3077, 4.6154, 6.9231, 9.2308, 11.5385, 13.8462, 16.1538, # 18.4615, 20.7692])

So I'm inclined to say that the current behavior that we have, which returns the interpolation on the frames, is actually the behavior we want: interpolate to fps 13, then return 10 consecutive frames.

Thoughts?

So I'm inclined to say that the current behavior that we have, which returns the interpolation on the frames, is actually the behavior we want: interpolate to fps 13, then return 10 consecutive frames.

I agree, although I wonder if the inconsistent gaps would allow networks to cheat in some way. I've never tried out the interpolation where the resampling was not done by integer factor, so I suppose we'll have to try it out.

My current take on this is that we should do the most correct behavior as possible, and rescale the whole video first, and then perform the unfold.

This will be slightly less efficient, but much more predictable.

fmassa added 2 commits July 3, 2019 02:50

Add VideoClips and Kinetics dataset

3e08665

Lint + add back missing line

e50a5d5

bjuncek reviewed Jul 8, 2019

View reviewed changes

torchvision/datasets/video_utils.py Show resolved Hide resolved

torchvision/datasets/video_utils.py Show resolved Hide resolved

torchvision/datasets/kinetics.py Show resolved Hide resolved

Adds ClipSampler following Bruno comment

c62d20c

fmassa commented Jul 9, 2019

View reviewed changes

torchvision/datasets/video_utils.py Outdated Show resolved Hide resolved

Change name following Bruno's suggestion

ebbf942

bjuncek approved these changes Jul 9, 2019

View reviewed changes

torchvision/datasets/video_utils.py Outdated Show resolved Hide resolved

stephenyan1231 reviewed Jul 10, 2019

View reviewed changes

fmassa added 2 commits July 10, 2019 07:05

Enable specifying a target framerate

5cef484

Fix test_io for new interface

129f086

LowikC reviewed Jul 11, 2019

View reviewed changes

Add comment mentioning drop_last behavior

811b286

LowikC reviewed Jul 12, 2019

View reviewed changes

bjuncek mentioned this pull request Jul 16, 2019

Add VideoModelZoo models #1130

Merged

fmassa added 3 commits July 19, 2019 04:44

Make compute_clips more robust

36f1ff0

Flake8

aa134f2

Fix for Python2

37ad0ac

fmassa merged commit 5d1372c into pytorch:master Jul 19, 2019

fmassa deleted the video-clips branch July 19, 2019 12:48

Add VideoClips and Kinetics dataset #1077

Add VideoClips and Kinetics dataset #1077

Uh oh!

Conversation

fmassa commented Jul 3, 2019

Uh oh!

codecov-io commented Jul 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bjuncek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fmassa commented Jul 9, 2019

Uh oh!

bjuncek commented Jul 9, 2019

Uh oh!

stephenyan1231 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LowikC Jul 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LowikC Jul 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LowikC Jul 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-io commented Jul 3, 2019 •

edited

Loading

LowikC Jul 11, 2019 •

edited

Loading

LowikC Jul 12, 2019 •

edited

Loading

LowikC Jul 12, 2019 •

edited

Loading