Adding video accuracy for video_classification reference script #6241

YosuaMichael · 2022-07-06T13:16:27Z

During our recent work on improving TorchVision’s support for Video models, we’ve discovered a perceived gap in the accuracy compared to other libraries such as PyTorch Video and SlowFast. This post summarizes the findings of our investigation.

TLDR:

The differences were originally observed on the MViT and later on the ResNet-based Video models (R3D, MC3 and R2+1). We are happy to confirm that the accuracy difference was not due to a bug on TorchVision’s models, video APIs, transforms or datasets but instead was the result of using different metrics (clip accuracy vs video accuracy) and of the frame rate configuration. Our team made the necessary changes on our documentation and reference scripts to allow for the easier reproduction of our pre-trained models and confirmed that there are no accuracy gaps or bugs.

Why do the accuracies differ?

Our initial analysis showed an accuracy gap of 5.3 Acc@1 points for MViT, 10.1 for R3D, 10.6 for MC3 and 10.0 for R2+1. These metrics were estimated by running the same models through the reference scripts of TorchVision and SlowFast. Since the difference is very large, we decided to investigate to find the root cause.

The libraries implement similar features but there are some differences that have small effect on the accuracies:

Different Video decoding backends (PyAV vs TorchVision decoder). Not all videos can be decoded by the 2 backends so one needs to control for running the analysis on the same data.
Different temporal sampling strategies on how clips are sampled from videos.
Minor differences on the order of execution of inference transforms.
TorchVision’s reference scripts support integer frame rates, while floats are supported by SlowFast.

The following differences were found to have significant effects on the accuracies:

SlowFast supports 2 hyper-parameters to control frame sampling: the sampling_rate and target_fps. TorchVision supports a single parameter called frame_rate and thus it needs to be configured when porting pre-trained weights from SlowFast.
Both libraries produce multiple temporal clips from the same videos during inference. Nevertheless, the way the results are combined to produce the final values differ significantly. TorchVision estimates the overall accuracies on the clip-level. SlowFast uses a multi-view ensemble, which aggregates the results of clips on the video-level. This strategy has a material effect on the final reported accuracy.

How to make them equal?

To much the accuracies between TorchVision and SlowFast we took the following steps:

For the case of MViT, we adjust the hyper-parameters of the sampling configuration. SlowFast uses target_fps=30 and sampling_rate=4 while the equivalent value for TorchVision is frame_rate=7.5 (rounded to 8 as we don't support fractional).
We adopt in TorchVision’s reference scripts, the SlowFast’s ensemble strategy for aggregating accuracies on video level.
We document the changes on our documentation and meta-data to ensure our users can reproduce the results easier.

All changes are on the documentation and reference scripts thus they don’t affect production users.

Reproducing the Results

We run the validation on Kinetics-400 datasets with some video excluded because they can't be decoded in either pyav or torchvision backend. Here is the list of the excluded video. Note that the estimates below are on 8-GPUs with batch-size > 1, meaning that a small variation is expected due to #4559.

Here are the commands and result in MViT:

python -u ~/script/run_with_submitit.py \
    --timeout 3000 --ngpus 8 --nodes 1 \
    --data-path="/datasets/clean_kinetics_400/" \
    --batch-size=16 --test-only \
    --clip-len 16 --frame-rate 8 --clips-per-video 5 \
    --cache-dataset \
    --model mvit_v1_b --weights="MViT_V1_B_Weights.DEFAULT" 
    
# result:
# Test: Total time: 1:25:33
# * Clip Acc@1 70.351 Clip Acc@5 88.288
# * Video Acc@1 78.477 Video Acc@5 93.582

And here are commands and result for Video ResNet models:

python -u ~/script/run_with_submitit.py \
    --timeout 3000 --ngpus 8 --nodes 1 \
    --batch-size=64 --test-only \
    --data-path="/datasets/clean_kinetics_400/" \
    --clip-len 16 --frame-rate 15 --clips-per-video 5 \
    --cache-dataset \
    --model mc3_18 --weights="MC3_18_Weights.DEFAULT" 
#    --model r3d_18 --weights="R3D_18_Weights.DEFAULT" 
#    --model r2plus1d_18 --weights="R2Plus1D_18_Weights.DEFAULT"

Results:

MC3_18:
Test: Total time: 0:11:17
 * Clip Acc@1 52.813 Clip Acc@5 75.220
 * Video Acc@1 63.960 Video Acc@5 84.130

R3D_18:
Test: Total time: 0:11:24
 * Clip Acc@1 51.795 Clip Acc@5 74.365
 * Video Acc@1 63.200 Video Acc@5 83.479

R2Plus1D_18:
Test: Total time: 0:11:26
 * Clip Acc@1 56.517 Clip Acc@5 77.739
 * Video Acc@1 67.463 Video Acc@5 86.175

datumbox

Few early comments, I understand it's Draft so feel free to ignore.

references/video_classification/datasets.py

references/video_classification/train.py

datumbox

@YosuaMichael amazing work. I have only 1 minor suggestion, see below.

Can you please provide proof of the new accuracies for all video models (MViT + Video ResNet)? We should update their meta-data stats.

references/video_classification/train.py

datumbox · 2022-07-06T22:01:41Z

@YosuaMichael thanks for providing the stats. My recommendation is to provide the exact commands you used to verifying the models + the raw output of accuracies; this is going to highlight that specific models must be run with specific configurations (for example the frame rate on MViT). The process I describe here is identical with how we deploy models. I would also update the documentation + meta-data to show-case how the models should be verified (simple links will do).

~~This can happen of a follow up but given the impact of this update, I would advise having all on a single PR that explains the changes in one place.~~ Somehow Github didn't show me your updates on meta-data. Good call to make them here.

torchvision/models/video/resnet.py

datumbox

@YosuaMichael LGTM, thanks! Just a few minor suggestions on the documentation (plus a correction of the hyper-param on ResNets) but other than this it looks good.

torchvision/models/video/mvit.py

torchvision/models/video/resnet.py

…g on docs

…ipt (#6241) Summary: * Add ensembled video accuracy on video reference script * Change the parser func to be similar with classification reference * Fix typo type->dtype * Use custom kinetics * Fix dataset to not getting start_pts * Change dataset name, and put video_idx at the back * Ufmt format * Use functional softmax, updating meta and use it to overwrite eval param * Fix typo * Put the eval parameters on the docs for now * Change meta for video resnet to use frame-rate 15, also change wording on docs Reviewed By: jdsgomes Differential Revision: D37993423 fbshipit-source-id: e6ad9fa13c7916d541fb7bc9582650ba9c92b8e0

YosuaMichael added 5 commits July 5, 2022 16:22

Add ensembled video accuracy on video reference script

9365eed

Change the parser func to be similar with classification reference

f26d28b

Fix typo type->dtype

49f6673

Use custom kinetics

9c91312

Fix dataset to not getting start_pts

4b9870c

facebook-github-bot added the cla signed label Jul 6, 2022

YosuaMichael mentioned this pull request Jul 6, 2022

[DONT MERGE] Getting video accuracy of Omnivore model YosuaMichael/vision#8

Draft

datumbox reviewed Jul 6, 2022

View reviewed changes

references/video_classification/datasets.py Outdated Show resolved Hide resolved

references/video_classification/datasets.py Outdated Show resolved Hide resolved

YosuaMichael mentioned this pull request Jul 6, 2022

[DONT MERGE] Getting video accuracy of torchvision video model YosuaMichael/vision#9

Draft

Change dataset name, and put video_idx at the back

2ed389f

YosuaMichael commented Jul 6, 2022

View reviewed changes

references/video_classification/train.py Show resolved Hide resolved

Ufmt format

80684ed

YosuaMichael marked this pull request as ready for review July 6, 2022 17:24

datumbox reviewed Jul 6, 2022

View reviewed changes

references/video_classification/train.py Show resolved Hide resolved

references/video_classification/train.py Outdated Show resolved Hide resolved

references/video_classification/train.py Show resolved Hide resolved

YosuaMichael commented Jul 6, 2022

View reviewed changes

references/video_classification/train.py Show resolved Hide resolved

YosuaMichael added 2 commits July 6, 2022 21:54

Use functional softmax, updating meta and use it to overwrite eval param

c747f15

Fix typo

c62c86f

datumbox reviewed Jul 6, 2022

View reviewed changes

torchvision/models/video/resnet.py Outdated Show resolved Hide resolved

Put the eval parameters on the docs for now

d443151

datumbox approved these changes Jul 7, 2022

View reviewed changes

torchvision/models/video/mvit.py Outdated Show resolved Hide resolved

torchvision/models/video/resnet.py Outdated Show resolved Hide resolved

Change meta for video resnet to use frame-rate 15, also change wordin…

23f3eca

…g on docs

datumbox added enhancement module: reference scripts labels Jul 7, 2022

Merge branch 'main' into reference/video-accuracy-ensemble

53db523

datumbox merged commit 8a45147 into pytorch:main Jul 7, 2022

datumbox mentioned this pull request Aug 4, 2022

Update references to use the new Model Registration API #6369

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding video accuracy for video_classification reference script #6241

Adding video accuracy for video_classification reference script #6241

Uh oh!

YosuaMichael commented Jul 6, 2022 •

edited by datumbox

Loading

Uh oh!

datumbox left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

datumbox left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

datumbox commented Jul 6, 2022 •

edited

Loading

Uh oh!

Uh oh!

datumbox left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Adding video accuracy for video_classification reference script #6241

Adding video accuracy for video_classification reference script #6241

Uh oh!

Conversation

YosuaMichael commented Jul 6, 2022 • edited by datumbox Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why do the accuracies differ?

How to make them equal?

Reproducing the Results

Uh oh!

datumbox left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

datumbox left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

datumbox commented Jul 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

datumbox left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

YosuaMichael commented Jul 6, 2022 •

edited by datumbox

Loading

datumbox commented Jul 6, 2022 •

edited

Loading