Skip to content

Conversation

@fmassa
Copy link
Member

@fmassa fmassa commented Jul 3, 2019

This PR adds functionalities to simplify build video datasets.

The main addition is the VideoClips class.
From a list of videos, VideoClips computes all the possible sub-videos (or clips) which are consecutive and contiguous, and enables to select from a particular clip easily.
This is made possible via unfold, which do some stride tricks on a tensor so that we can have all possible subsequences of a list without duplicating memory, and is very efficient.

As an example on how it should be used, I've written a basic KineticsVideo (better names welcome) illustrating how VideoClips can be used.

I need to add tests for KineticsVideo and improve the documentation overall, but this is a first version to get some feedback.

cc @bjuncek @stephenyan1231 for review

@codecov-io
Copy link

codecov-io commented Jul 3, 2019

Codecov Report

Merging #1077 into master will increase coverage by 1.81%.
The diff coverage is 85.27%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1077      +/-   ##
==========================================
+ Coverage   63.92%   65.74%   +1.81%     
==========================================
  Files          68       70       +2     
  Lines        5406     5512     +106     
  Branches      829      851      +22     
==========================================
+ Hits         3456     3624     +168     
+ Misses       1707     1629      -78     
- Partials      243      259      +16
Impacted Files Coverage Δ
torchvision/io/video.py 70.96% <100%> (+0.63%) ⬆️
torchvision/datasets/__init__.py 100% <100%> (ø) ⬆️
torchvision/datasets/kinetics.py 40% <40%> (ø)
torchvision/datasets/video_utils.py 93.33% <93.33%> (ø)
torchvision/models/detection/roi_heads.py 55.93% <0%> (-0.97%) ⬇️
torchvision/datasets/cifar.py 78.16% <0%> (-0.5%) ⬇️
torchvision/datasets/mnist.py 50.71% <0%> (-0.47%) ⬇️
torchvision/datasets/folder.py 82.05% <0%> (-0.45%) ⬇️
torchvision/transforms/functional.py 71.09% <0%> (-0.29%) ⬇️
torchvision/transforms/transforms.py 81.53% <0%> (-0.25%) ⬇️
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5508815...37ad0ac. Read the comment docs.

Copy link
Contributor

@bjuncek bjuncek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. If you can just clarify the few questions I posted inline and we are good to go.

@fmassa
Copy link
Member Author

fmassa commented Jul 9, 2019

@bjuncek for some reason I can't comment on your last message about UniformClipSampler

What would be the interface for this function, would it sample (as uniformly as possible) num_max_clips from each video?

@bjuncek
Copy link
Contributor

bjuncek commented Jul 9, 2019

probably because I accepted the PR?
yeah, so you can have num_max_clips equally spaced clips - something like this (feel free to remove the np and add torch equivalent)

for c in self.video_clips.clips:
            length = len(c)
            size = min(length, self.max_clips_per_video)
            sampled = np.round(np.linspace(0, length - 1, size)).astype(int)
            s += length
            idxs.append(sampled)

This would be primarily used for validation I suspect.

Copy link
Contributor

@stephenyan1231 stephenyan1231 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my inline comments

Recreating the clips for different clip lengths is fast, and can be done
with the `compute_clips` method.
"""
def __init__(self, video_paths, clip_length_in_frames=16, frames_between_clips=1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not exposing argument dilation in the init function, and pass it into compute_clips method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I didn't want to expose it because the dilation was a hacky parameter to specify in a general VideoClips, because it assumed that all the videos had the same original frame rate, and that they are an integer value.

I've updated the PR with a (hopefully) more robust approach, where I perform an interpolation of the indices in the clip, so that we can support arbitrary conversion from input_fps to target_fps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think it makes sense to expose it and set it to a default value; there is no harm in general and there are a few applications that don't like interpolation as it can introduce artifacts, and people tend to use it regardless of the actual fps :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it is conceptually wrong to just step every few frames if the user wants to reduce the fps in general, as different videos might have different fps.

Also note that the approach I'm following is as efficient as the dilation one if step is integer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I take that back as the new resampling is doing the same thing dilation is doing if the step is an integer

same video is defined by `frames_between_clips`.

Creating this instance the first time is time-consuming, as it needs to
decode all the videos in `video_paths`. It is recommended that you
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you mentioned creating this object is time-consuming, do you want to implement methods such as save and load to support caching ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @fmassa was hoping to keep that separate from the torchvision per se, and keep it in examples?

Regardless where we put though it this is a great idea!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that this is something that can be achieved by just saving the VideoClips object, and showcase how to do it in the references/video_classification example that I'm planning to add.

So the user would just need to do

torch.save(video_clips, 'cache.pth')
...
video_clips = torch.load('cache.pth')

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmassa that sounds good.

dilation = max(int((fps + frame_rate - 1) // frame_rate), 1)
else:
dilation = 1
clips = unfold(video_pts, num_frames, step, dilation)
Copy link

@LowikC LowikC Jul 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfold will drop the last frames of the video, if the last clip is smaller than the others.
It would be nice to have an option to pad the last clip.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great point. We might add a drop_last option to VideoClips, as what we have in BatchSampler. But it will be slightly more annoying to implement, because we won't be able to use stride tricks anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created an issue to track this request in #1113

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I've mentioned in the documentation that the last frames of a video can potentially be dropped.

I'm going to be merging this as is for now, but I'll try to address this before the release.

for video_pts, fps in zip(self.video_pts, self.video_fps):
if frame_rate is not None:
# divup, == int(ceil(fps / frame_rate))
dilation = max(int((fps + frame_rate - 1) // frame_rate), 1)
Copy link

@LowikC LowikC Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should also modify step, otherwise the step if still relative to the original frame_rate.
unit test:

video_pts = torch.arange(11)
fps = 10
frame_rate = 5
num_frames = 3
step = 3  # same as num_frames, so I expect non-overlapping clips

I expect to get [0, 2, 4], [6, 8, 10] (half the original frame rate, non-overlapping clips)
But the current code will give [0, 2, 4], [3, 5, 7], [6, 8, 10]

if frame_rate is not None:
    dilation = max(int((fps + frame_rate - 1) // frame_rate), 1)
    step = step * dilation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

step = int(step)
return video[::step]
idxs = torch.arange(self.num_frames, dtype=torch.float32) * step
idxs = idxs.floor().to(torch.int64)
Copy link

@LowikC LowikC Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesn't match the way dilation is computed on line 83.

utest

video_pts = torch.arange(30)
fps = 30
frame_rate = 13.0
size = 10
dilation = max(int((fps + frame_rate - 1) // frame_rate), 1)  # = 3
clips = unfold(video_pts, size, size, dilation)  # = tensor([[ 0,  3,  6,  9, 12, 15, 18, 21, 24, 27]])

step = fps / frame_rate
idxs = torch.arange(size, dtype=torch.float32) * step
idxs = idxs.floor().to(torch.int64)
video_pts[idxs]  # =tensor([ 0,  2,  4,  6,  9, 11, 13, 16, 18, 20]), whereas it should match clips[0]

I think the current code works only if fps / frame_rate is (almost) an integer
As dilation has to be an integer, it's equivalent to round down frame_rate to a value such that fps/frame_rate is an integer. It should be done also in resample_video.
The best would be to deal with non-integer dilation, but you can't use the stride trick anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great point, once again, thanks Lowik!

So, here is my view of things:

you have 1 second of video at 30 fps. We now resample it to be at 13fps. Now we have 13 frames.
Their time coordinates are given by

torch.arange(13, dtype=torch.float32) * 30 / 13
# tensor([ 0.0000,  2.3077,  4.6154,  6.9231,  9.2308, 11.5385, 13.8462, 16.1538,
#        18.4615, 20.7692, 23.0769, 25.3846, 27.6923])

Now, we want to take 10 frames out of this newly sampled video. This gives us

tensor([ 0.0000,  2.3077,  4.6154,  6.9231,  9.2308, 11.5385, 13.8462, 16.1538,
        18.4615, 20.7692])

Because we only do nearest interpolation on the frames, this corresponds to (by calling .floor()

tensor([ 0,  2,  4,  6,  9, 11, 13, 16, 18, 20])

Note that the indices are actually exactly the same as what we originally compute

idxs = torch.arange(10, dtype=torch.float32) * (30 / 13)
# tensor([ 0.0000,  2.3077,  4.6154,  6.9231,  9.2308, 11.5385, 13.8462, 16.1538,
#         18.4615, 20.7692])

So I'm inclined to say that the current behavior that we have, which returns the interpolation on the frames, is actually the behavior we want: interpolate to fps 13, then return 10 consecutive frames.

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm inclined to say that the current behavior that we have, which returns the interpolation on the frames, is actually the behavior we want: interpolate to fps 13, then return 10 consecutive frames.

I agree, although I wonder if the inconsistent gaps would allow networks to cheat in some way. I've never tried out the interpolation where the resampling was not done by integer factor, so I suppose we'll have to try it out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current take on this is that we should do the most correct behavior as possible, and rescale the whole video first, and then perform the unfold.

This will be slightly less efficient, but much more predictable.

@fmassa fmassa merged commit 5d1372c into pytorch:master Jul 19, 2019
@fmassa fmassa deleted the video-clips branch July 19, 2019 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants