-
Notifications
You must be signed in to change notification settings - Fork 7.2k
Adding video accuracy for video_classification reference script #6241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding video accuracy for video_classification reference script #6241
Conversation
datumbox
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few early comments, I understand it's Draft so feel free to ignore.
datumbox
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YosuaMichael amazing work. I have only 1 minor suggestion, see below.
Can you please provide proof of the new accuracies for all video models (MViT + Video ResNet)? We should update their meta-data stats.
|
@YosuaMichael thanks for providing the stats. My recommendation is to provide the exact commands you used to verifying the models + the raw output of accuracies; this is going to highlight that specific models must be run with specific configurations (for example the frame rate on MViT). The process I describe here is identical with how we deploy models. I would also update the documentation + meta-data to show-case how the models should be verified (simple links will do).
|
datumbox
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YosuaMichael LGTM, thanks! Just a few minor suggestions on the documentation (plus a correction of the hyper-param on ResNets) but other than this it looks good.
…ipt (#6241) Summary: * Add ensembled video accuracy on video reference script * Change the parser func to be similar with classification reference * Fix typo type->dtype * Use custom kinetics * Fix dataset to not getting start_pts * Change dataset name, and put video_idx at the back * Ufmt format * Use functional softmax, updating meta and use it to overwrite eval param * Fix typo * Put the eval parameters on the docs for now * Change meta for video resnet to use frame-rate 15, also change wording on docs Reviewed By: jdsgomes Differential Revision: D37993423 fbshipit-source-id: e6ad9fa13c7916d541fb7bc9582650ba9c92b8e0
During our recent work on improving TorchVision’s support for Video models, we’ve discovered a perceived gap in the accuracy compared to other libraries such as PyTorch Video and SlowFast. This post summarizes the findings of our investigation.
TLDR:
The differences were originally observed on the MViT and later on the ResNet-based Video models (R3D, MC3 and R2+1). We are happy to confirm that the accuracy difference was not due to a bug on TorchVision’s models, video APIs, transforms or datasets but instead was the result of using different metrics (clip accuracy vs video accuracy) and of the frame rate configuration. Our team made the necessary changes on our documentation and reference scripts to allow for the easier reproduction of our pre-trained models and confirmed that there are no accuracy gaps or bugs.
Why do the accuracies differ?
Our initial analysis showed an accuracy gap of 5.3 Acc@1 points for MViT, 10.1 for R3D, 10.6 for MC3 and 10.0 for R2+1. These metrics were estimated by running the same models through the reference scripts of TorchVision and SlowFast. Since the difference is very large, we decided to investigate to find the root cause.
The libraries implement similar features but there are some differences that have small effect on the accuracies:
The following differences were found to have significant effects on the accuracies:
sampling_rateandtarget_fps. TorchVision supports a single parameter calledframe_rateand thus it needs to be configured when porting pre-trained weights from SlowFast.How to make them equal?
To much the accuracies between TorchVision and SlowFast we took the following steps:
target_fps=30andsampling_rate=4while the equivalent value for TorchVision isframe_rate=7.5(rounded to 8 as we don't support fractional).All changes are on the documentation and reference scripts thus they don’t affect production users.
Reproducing the Results
We run the validation on Kinetics-400 datasets with some video excluded because they can't be decoded in either pyav or torchvision backend. Here is the list of the excluded video. Note that the estimates below are on 8-GPUs with batch-size > 1, meaning that a small variation is expected due to #4559.
Here are the commands and result in MViT:
And here are commands and result for Video ResNet models:
Results: