-
Notifications
You must be signed in to change notification settings - Fork 7.2k
Add VideoClips and Kinetics dataset #1077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1077 +/- ##
==========================================
+ Coverage 63.92% 65.74% +1.81%
==========================================
Files 68 70 +2
Lines 5406 5512 +106
Branches 829 851 +22
==========================================
+ Hits 3456 3624 +168
+ Misses 1707 1629 -78
- Partials 243 259 +16
Continue to review full report at Codecov.
|
bjuncek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. If you can just clarify the few questions I posted inline and we are good to go.
|
@bjuncek for some reason I can't comment on your last message about What would be the interface for this function, would it sample (as uniformly as possible) |
|
probably because I accepted the PR? This would be primarily used for validation I suspect. |
stephenyan1231
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my inline comments
torchvision/datasets/video_utils.py
Outdated
| Recreating the clips for different clip lengths is fast, and can be done | ||
| with the `compute_clips` method. | ||
| """ | ||
| def __init__(self, video_paths, clip_length_in_frames=16, frames_between_clips=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not exposing argument dilation in the init function, and pass it into compute_clips method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I didn't want to expose it because the dilation was a hacky parameter to specify in a general VideoClips, because it assumed that all the videos had the same original frame rate, and that they are an integer value.
I've updated the PR with a (hopefully) more robust approach, where I perform an interpolation of the indices in the clip, so that we can support arbitrary conversion from input_fps to target_fps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd think it makes sense to expose it and set it to a default value; there is no harm in general and there are a few applications that don't like interpolation as it can introduce artifacts, and people tend to use it regardless of the actual fps :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it is conceptually wrong to just step every few frames if the user wants to reduce the fps in general, as different videos might have different fps.
Also note that the approach I'm following is as efficient as the dilation one if step is integer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I take that back as the new resampling is doing the same thing dilation is doing if the step is an integer
| same video is defined by `frames_between_clips`. | ||
|
|
||
| Creating this instance the first time is time-consuming, as it needs to | ||
| decode all the videos in `video_paths`. It is recommended that you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you mentioned creating this object is time-consuming, do you want to implement methods such as save and load to support caching ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @fmassa was hoping to keep that separate from the torchvision per se, and keep it in examples?
Regardless where we put though it this is a great idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that this is something that can be achieved by just saving the VideoClips object, and showcase how to do it in the references/video_classification example that I'm planning to add.
So the user would just need to do
torch.save(video_clips, 'cache.pth')
...
video_clips = torch.load('cache.pth')Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fmassa that sounds good.
torchvision/datasets/video_utils.py
Outdated
| dilation = max(int((fps + frame_rate - 1) // frame_rate), 1) | ||
| else: | ||
| dilation = 1 | ||
| clips = unfold(video_pts, num_frames, step, dilation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfold will drop the last frames of the video, if the last clip is smaller than the others.
It would be nice to have an option to pad the last clip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great point. We might add a drop_last option to VideoClips, as what we have in BatchSampler. But it will be slightly more annoying to implement, because we won't be able to use stride tricks anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created an issue to track this request in #1113
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I've mentioned in the documentation that the last frames of a video can potentially be dropped.
I'm going to be merging this as is for now, but I'll try to address this before the release.
torchvision/datasets/video_utils.py
Outdated
| for video_pts, fps in zip(self.video_pts, self.video_fps): | ||
| if frame_rate is not None: | ||
| # divup, == int(ceil(fps / frame_rate)) | ||
| dilation = max(int((fps + frame_rate - 1) // frame_rate), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should also modify step, otherwise the step if still relative to the original frame_rate.
unit test:
video_pts = torch.arange(11)
fps = 10
frame_rate = 5
num_frames = 3
step = 3 # same as num_frames, so I expect non-overlapping clips
I expect to get [0, 2, 4], [6, 8, 10] (half the original frame rate, non-overlapping clips)
But the current code will give [0, 2, 4], [3, 5, 7], [6, 8, 10]
if frame_rate is not None:
dilation = max(int((fps + frame_rate - 1) // frame_rate), 1)
step = step * dilation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
| step = int(step) | ||
| return video[::step] | ||
| idxs = torch.arange(self.num_frames, dtype=torch.float32) * step | ||
| idxs = idxs.floor().to(torch.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it doesn't match the way dilation is computed on line 83.
utest
video_pts = torch.arange(30)
fps = 30
frame_rate = 13.0
size = 10
dilation = max(int((fps + frame_rate - 1) // frame_rate), 1) # = 3
clips = unfold(video_pts, size, size, dilation) # = tensor([[ 0, 3, 6, 9, 12, 15, 18, 21, 24, 27]])
step = fps / frame_rate
idxs = torch.arange(size, dtype=torch.float32) * step
idxs = idxs.floor().to(torch.int64)
video_pts[idxs] # =tensor([ 0, 2, 4, 6, 9, 11, 13, 16, 18, 20]), whereas it should match clips[0]
I think the current code works only if fps / frame_rate is (almost) an integer
As dilation has to be an integer, it's equivalent to round down frame_rate to a value such that fps/frame_rate is an integer. It should be done also in resample_video.
The best would be to deal with non-integer dilation, but you can't use the stride trick anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great point, once again, thanks Lowik!
So, here is my view of things:
you have 1 second of video at 30 fps. We now resample it to be at 13fps. Now we have 13 frames.
Their time coordinates are given by
torch.arange(13, dtype=torch.float32) * 30 / 13
# tensor([ 0.0000, 2.3077, 4.6154, 6.9231, 9.2308, 11.5385, 13.8462, 16.1538,
# 18.4615, 20.7692, 23.0769, 25.3846, 27.6923])Now, we want to take 10 frames out of this newly sampled video. This gives us
tensor([ 0.0000, 2.3077, 4.6154, 6.9231, 9.2308, 11.5385, 13.8462, 16.1538,
18.4615, 20.7692])
Because we only do nearest interpolation on the frames, this corresponds to (by calling .floor()
tensor([ 0, 2, 4, 6, 9, 11, 13, 16, 18, 20])
Note that the indices are actually exactly the same as what we originally compute
idxs = torch.arange(10, dtype=torch.float32) * (30 / 13)
# tensor([ 0.0000, 2.3077, 4.6154, 6.9231, 9.2308, 11.5385, 13.8462, 16.1538,
# 18.4615, 20.7692])So I'm inclined to say that the current behavior that we have, which returns the interpolation on the frames, is actually the behavior we want: interpolate to fps 13, then return 10 consecutive frames.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'm inclined to say that the current behavior that we have, which returns the interpolation on the frames, is actually the behavior we want: interpolate to fps 13, then return 10 consecutive frames.
I agree, although I wonder if the inconsistent gaps would allow networks to cheat in some way. I've never tried out the interpolation where the resampling was not done by integer factor, so I suppose we'll have to try it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My current take on this is that we should do the most correct behavior as possible, and rescale the whole video first, and then perform the unfold.
This will be slightly less efficient, but much more predictable.
This PR adds functionalities to simplify build video datasets.
The main addition is the
VideoClipsclass.From a list of videos,
VideoClipscomputes all the possible sub-videos (or clips) which are consecutive and contiguous, and enables to select from a particular clip easily.This is made possible via
unfold, which do some stride tricks on a tensor so that we can have all possible subsequences of a list without duplicating memory, and is very efficient.As an example on how it should be used, I've written a basic
KineticsVideo(better names welcome) illustrating howVideoClipscan be used.I need to add tests for
KineticsVideoand improve the documentation overall, but this is a first version to get some feedback.cc @bjuncek @stephenyan1231 for review