From 7685e1abe7882c927672bd816ebd4aa1d7e06321 Mon Sep 17 00:00:00 2001 From: Bruno Korbar Date: Tue, 27 Oct 2020 05:56:27 -0500 Subject: [PATCH 1/3] removing the tab? --- torchvision/io/__init__.py | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/torchvision/io/__init__.py b/torchvision/io/__init__.py index e2840b4862b..d5dac548130 100644 --- a/torchvision/io/__init__.py +++ b/torchvision/io/__init__.py @@ -27,9 +27,13 @@ if _HAS_VIDEO_OPT: + def _has_video_opt(): return True + + else: + def _has_video_opt(): return False From 8629f79d7739cd36c58dde64b5e8012e11061372 Mon Sep 17 00:00:00 2001 From: Bruno Korbar Date: Tue, 3 Nov 2020 13:33:48 -0600 Subject: [PATCH 2/3] initial commit --- references/video_classification/README.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/references/video_classification/README.md b/references/video_classification/README.md index 525cfddd414..642dee1f83c 100644 --- a/references/video_classification/README.md +++ b/references/video_classification/README.md @@ -1,5 +1,7 @@ # Video Classification +We present a simple training script that can be used for replicating the result of [resenet-based video models](https://research.fb.com/wp-content/uploads/2018/04/a-closer-look-at-spatiotemporal-convolutions-for-action-recognition.pdf). All models are trained on [Kinetics400 dataset](https://deepmind.com/research/open-source/kinetics), a benchmark dataset for human-action recognition. The accuracy is reported on the traditional validation split. + TODO: Add some info about the context, dataset we use etc ## Data preparation @@ -7,12 +9,12 @@ TODO: Add some info about the context, dataset we use etc If you already have downloaded [Kinetics400 dataset](https://deepmind.com/research/open-source/kinetics), please proceed directly to the next section. -To download videos, one can use https://github.com/Showmax/kinetics-downloader +To download videos, one can use https://github.com/Showmax/kinetics-downloader. Please note that the dataset can take up upwards of 400GB, depending on the quality setting during download. ## Training We assume the training and validation AVI videos are stored at `/data/kinectics400/train` and -`/data/kinectics400/val`. +`/data/kinectics400/val`. For training we suggest starting with the hyperparameters reported in the [paper](https://research.fb.com/wp-content/uploads/2018/04/a-closer-look-at-spatiotemporal-convolutions-for-action-recognition.pdf), in order to match the performance of said models. Clip sampling strategy is a particularly important parameter during training, and we suggest using random temporal jittering during training - in other words sampling multiple training clips from each video with random start times during at every epoch. This functionality is built into our training script, and optimal hyperparameters are set by default. ### Multiple GPUs @@ -21,7 +23,8 @@ Run the training on a single node with 8 GPUs: python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --data-path=/data/kinectics400 --train-dir=train --val-dir=val --batch-size=16 --cache-dataset --sync-bn --apex ``` - +**Note:** all our models were trained on 8 nodes with 8 V100 GPUs each for a total of 64 GPUs. Expected training time for 64 GPUs is 24 hours, depending on the storage solution. +**Note 2:** hyperparameters for exact replication of our training can be found [here](https://github.com/pytorch/vision/blob/master/torchvision/models/video/README.md). Some hyperparameters such as learning rate are scaled linearly in proportion to the number of GPUs. ### Single GPU @@ -30,6 +33,4 @@ python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --data- ```bash python train.py --data-path=/data/kinectics400 --train-dir=train --val-dir=val --batch-size=8 --cache-dataset -``` - - +``` \ No newline at end of file From dbb1cc0d3b2dfb867b4a5c75a33b27e8fd431126 Mon Sep 17 00:00:00 2001 From: Bruno Korbar Date: Wed, 4 Nov 2020 09:01:06 -0600 Subject: [PATCH 3/3] Addressing Victor's comments --- references/video_classification/README.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/references/video_classification/README.md b/references/video_classification/README.md index 642dee1f83c..9a201c646ca 100644 --- a/references/video_classification/README.md +++ b/references/video_classification/README.md @@ -2,8 +2,6 @@ We present a simple training script that can be used for replicating the result of [resenet-based video models](https://research.fb.com/wp-content/uploads/2018/04/a-closer-look-at-spatiotemporal-convolutions-for-action-recognition.pdf). All models are trained on [Kinetics400 dataset](https://deepmind.com/research/open-source/kinetics), a benchmark dataset for human-action recognition. The accuracy is reported on the traditional validation split. -TODO: Add some info about the context, dataset we use etc - ## Data preparation If you already have downloaded [Kinetics400 dataset](https://deepmind.com/research/open-source/kinetics),