Skip to content

Conversation

@YosuaMichael
Copy link
Contributor

@YosuaMichael YosuaMichael commented Jul 6, 2022

During our recent work on improving TorchVision’s support for Video models, we’ve discovered a perceived gap in the accuracy compared to other libraries such as PyTorch Video and SlowFast. This post summarizes the findings of our investigation.

TLDR:

The differences were originally observed on the MViT and later on the ResNet-based Video models (R3D, MC3 and R2+1). We are happy to confirm that the accuracy difference was not due to a bug on TorchVision’s models, video APIs, transforms or datasets but instead was the result of using different metrics (clip accuracy vs video accuracy) and of the frame rate configuration. Our team made the necessary changes on our documentation and reference scripts to allow for the easier reproduction of our pre-trained models and confirmed that there are no accuracy gaps or bugs.

Why do the accuracies differ?

Our initial analysis showed an accuracy gap of 5.3 Acc@1 points for MViT, 10.1 for R3D, 10.6 for MC3 and 10.0 for R2+1. These metrics were estimated by running the same models through the reference scripts of TorchVision and SlowFast. Since the difference is very large, we decided to investigate to find the root cause.

The libraries implement similar features but there are some differences that have small effect on the accuracies:

  • Different Video decoding backends (PyAV vs TorchVision decoder). Not all videos can be decoded by the 2 backends so one needs to control for running the analysis on the same data.
  • Different temporal sampling strategies on how clips are sampled from videos.
  • Minor differences on the order of execution of inference transforms.
  • TorchVision’s reference scripts support integer frame rates, while floats are supported by SlowFast.

The following differences were found to have significant effects on the accuracies:

  • SlowFast supports 2 hyper-parameters to control frame sampling: the sampling_rate and target_fps. TorchVision supports a single parameter called frame_rate and thus it needs to be configured when porting pre-trained weights from SlowFast.
  • Both libraries produce multiple temporal clips from the same videos during inference. Nevertheless, the way the results are combined to produce the final values differ significantly. TorchVision estimates the overall accuracies on the clip-level. SlowFast uses a multi-view ensemble, which aggregates the results of clips on the video-level. This strategy has a material effect on the final reported accuracy.

How to make them equal?

To much the accuracies between TorchVision and SlowFast we took the following steps:

  • For the case of MViT, we adjust the hyper-parameters of the sampling configuration. SlowFast uses target_fps=30 and sampling_rate=4 while the equivalent value for TorchVision is frame_rate=7.5 (rounded to 8 as we don't support fractional).
  • We adopt in TorchVision’s reference scripts, the SlowFast’s ensemble strategy for aggregating accuracies on video level.
  • We document the changes on our documentation and meta-data to ensure our users can reproduce the results easier.

All changes are on the documentation and reference scripts thus they don’t affect production users.

Reproducing the Results

We run the validation on Kinetics-400 datasets with some video excluded because they can't be decoded in either pyav or torchvision backend. Here is the list of the excluded video. Note that the estimates below are on 8-GPUs with batch-size > 1, meaning that a small variation is expected due to #4559.

Here are the commands and result in MViT:

python -u ~/script/run_with_submitit.py \
    --timeout 3000 --ngpus 8 --nodes 1 \
    --data-path="/datasets/clean_kinetics_400/" \
    --batch-size=16 --test-only \
    --clip-len 16 --frame-rate 8 --clips-per-video 5 \
    --cache-dataset \
    --model mvit_v1_b --weights="MViT_V1_B_Weights.DEFAULT" 
    
# result:
# Test: Total time: 1:25:33
# * Clip Acc@1 70.351 Clip Acc@5 88.288
# * Video Acc@1 78.477 Video Acc@5 93.582

And here are commands and result for Video ResNet models:

python -u ~/script/run_with_submitit.py \
    --timeout 3000 --ngpus 8 --nodes 1 \
    --batch-size=64 --test-only \
    --data-path="/datasets/clean_kinetics_400/" \
    --clip-len 16 --frame-rate 15 --clips-per-video 5 \
    --cache-dataset \
    --model mc3_18 --weights="MC3_18_Weights.DEFAULT" 
#    --model r3d_18 --weights="R3D_18_Weights.DEFAULT" 
#    --model r2plus1d_18 --weights="R2Plus1D_18_Weights.DEFAULT" 

Results:

MC3_18:
Test: Total time: 0:11:17
 * Clip Acc@1 52.813 Clip Acc@5 75.220
 * Video Acc@1 63.960 Video Acc@5 84.130

R3D_18:
Test: Total time: 0:11:24
 * Clip Acc@1 51.795 Clip Acc@5 74.365
 * Video Acc@1 63.200 Video Acc@5 83.479

R2Plus1D_18:
Test: Total time: 0:11:26
 * Clip Acc@1 56.517 Clip Acc@5 77.739
 * Video Acc@1 67.463 Video Acc@5 86.175

Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few early comments, I understand it's Draft so feel free to ignore.

@YosuaMichael YosuaMichael marked this pull request as ready for review July 6, 2022 17:24
Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YosuaMichael amazing work. I have only 1 minor suggestion, see below.

Can you please provide proof of the new accuracies for all video models (MViT + Video ResNet)? We should update their meta-data stats.

@datumbox
Copy link
Contributor

datumbox commented Jul 6, 2022

@YosuaMichael thanks for providing the stats. My recommendation is to provide the exact commands you used to verifying the models + the raw output of accuracies; this is going to highlight that specific models must be run with specific configurations (for example the frame rate on MViT). The process I describe here is identical with how we deploy models. I would also update the documentation + meta-data to show-case how the models should be verified (simple links will do).

This can happen of a follow up but given the impact of this update, I would advise having all on a single PR that explains the changes in one place. Somehow Github didn't show me your updates on meta-data. Good call to make them here.

Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YosuaMichael LGTM, thanks! Just a few minor suggestions on the documentation (plus a correction of the hyper-param on ResNets) but other than this it looks good.

@datumbox datumbox merged commit 8a45147 into pytorch:main Jul 7, 2022
facebook-github-bot pushed a commit that referenced this pull request Jul 21, 2022
…ipt (#6241)

Summary:
* Add ensembled video accuracy on video reference script

* Change the parser func to be similar with classification reference

* Fix typo type->dtype

* Use custom kinetics

* Fix dataset to not getting start_pts

* Change dataset name, and put video_idx at the back

* Ufmt format

* Use functional softmax, updating meta and use it to overwrite eval param

* Fix typo

* Put the eval parameters on the docs for now

* Change meta for video resnet to use frame-rate 15, also change wording on docs

Reviewed By: jdsgomes

Differential Revision: D37993423

fbshipit-source-id: e6ad9fa13c7916d541fb7bc9582650ba9c92b8e0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants