Lightning-AI
diff --git a/‎.azure-pipelines/gpu-tests.yml‎
Lines changed: 2 additions & 12 deletions b/‎.azure-pipelines/gpu-tests.yml‎
Lines changed: 2 additions & 12 deletions
diff --git a/‎.github/workflows/ci_pkg-install.yml‎
Lines changed: 1 addition & 2 deletions b/‎.github/workflows/ci_pkg-install.yml‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 35 additions & 9 deletions b/‎CHANGELOG.md‎
Lines changed: 35 additions & 9 deletions
diff --git a/‎docs/source/advanced/multi_gpu.rst‎
Lines changed: 2 additions & 31 deletions b/‎docs/source/advanced/multi_gpu.rst‎
Lines changed: 2 additions & 31 deletions
diff --git a/‎docs/source/clouds/cluster.rst‎
Lines changed: 1 addition & 24 deletions b/‎docs/source/clouds/cluster.rst‎
Lines changed: 1 addition & 24 deletions
diff --git a/‎docs/source/common/hyperparameters.rst‎
Lines changed: 1 addition & 17 deletions b/‎docs/source/common/hyperparameters.rst‎
Lines changed: 1 addition & 17 deletions
diff --git a/‎docs/source/common/lightning_module.rst‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/common/lightning_module.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/common/trainer.rst‎
Lines changed: 15 additions & 10 deletions b/‎docs/source/common/trainer.rst‎
Lines changed: 15 additions & 10 deletions
diff --git a/‎docs/source/extensions/logging.rst‎
Lines changed: 2 additions & 2 deletions b/‎docs/source/extensions/logging.rst‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎pl_examples/basic_examples/README.md‎
Lines changed: 0 additions & 13 deletions b/‎pl_examples/basic_examples/README.md‎
Lines changed: 0 additions & 13 deletions
@@ -55,13 +55,9 @@ jobs:
       displayName: 'Image info & NVIDIA'
 
     - bash: |
-        export GIT_TERMINAL_PROMPT=1
-        #sudo apt-get install -y cmake
-        # python -m pip install "pip==20.1"
-        pip install --requirement requirements.txt
         python -c "fname = 'requirements/extra.txt' ; lines = [line for line in open(fname).readlines() if 'horovod' not in line] ; open(fname, 'w').writelines(lines)"
-        pip install --requirement ./requirements/devel.txt --upgrade-strategy only-if-needed
-        pip install fairscale>=0.3.4 --upgrade-strategy only-if-needed
+        pip install fairscale>=0.3.4
+        pip install . --requirement requirements/devel.txt
         pip list
       displayName: 'Install dependencies'
 
@@ -114,15 +110,9 @@ jobs:
 
     - script: |
         set -e
-        python setup.py install --user
-        rm -rf pytorch_lightning
-        pip list
         python -m pytest pl_examples -v --maxfail=2 --durations=0
         bash pl_examples/run_examples-args.sh --trainer.gpus 1 --trainer.max_epochs 1 --data.batch_size 64 --trainer.limit_train_batches 5 --trainer.limit_val_batches 3
         bash pl_examples/run_ddp-examples.sh --trainer.max_epochs 1 --data.batch_size 32 --trainer.limit_train_batches 2 --trainer.limit_val_batches 2
-        # cd pl_examples/basic_examples
-        # bash submit_ddp_job.sh
-        # bash submit_ddp2_job.sh
       env:
         PL_USE_MOCKED_MNIST: "1"
       displayName: 'Examples'
@@ -15,8 +15,7 @@ jobs:
       fail-fast: false
       # max-parallel: 6
       matrix:
-        # PyTorch 1.5 is failing on Win and bolts requires torchvision>=0.5
-        os: [ubuntu-20.04, macOS-10.15 , windows-2019]  #
+        os: [ubuntu-20.04, macOS-10.15, windows-2019]
         python-version: [3.6, 3.9]
 
     steps:
 
@@ -9,6 +9,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 
 ### Added
 
+- Added support to `LightningModule.to_torchscript` for saving to custom filesystems with fsspec ([#7617](https://github.com/PyTorchLightning/pytorch-lightning/pull/7617))
+
+
 - Added `KubeflowEnvironment` for use with the `PyTorchJob` operator in Kubeflow
 
 
@@ -18,7 +21,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Added support for checkpointing based on a provided time interval during training ([#7515](https://github.com/PyTorchLightning/pytorch-lightning/pull/7515))
 
 
-- Added dataclasses for progress tracking ([#6603](https://github.com/PyTorchLightning/pytorch-lightning/pull/6603))
+- Added dataclasses for progress tracking (
+    [#6603](https://github.com/PyTorchLightning/pytorch-lightning/pull/6603),
+    [#7574](https://github.com/PyTorchLightning/pytorch-lightning/pull/7574))
 
 
 - Added argument `trainer.predict(ckpt_path)` ([#7430](https://github.com/PyTorchLightning/pytorch-lightning/pull/7430))
@@ -33,8 +38,12 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Added correct `dataloader_idx` to batch transfer hooks ([#6241](https://github.com/PyTorchLightning/pytorch-lightning/pull/6241))
 
 
+- Added `ddp_fully_sharded` support ([#7487](https://github.com/PyTorchLightning/pytorch-lightning/pull/7487))
+
+
 ### Changed
 
+- Changed calling of `untoggle_optimizer(opt_idx)` out of the closure function ([#7563](https://github.com/PyTorchLightning/pytorch-lightning/pull/7563)
 
 - Changed the `Trainer`'s `checkpoint_callback` argument to allow only boolean values ([#7539](https://github.com/PyTorchLightning/pytorch-lightning/pull/7539))
 
@@ -74,38 +83,55 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - MLflowLogger now uses the env variable `MLFLOW_TRACKING_URI` as default tracking uri ([#7457](https://github.com/PyTorchLightning/pytorch-lightning/pull/7457))
 
 
+- MLFlowLogger now accepts `run_name` as an constructor argument ([#7622](https://github.com/PyTorchLightning/pytorch-lightning/issues/7622))
+
+
+- Changed `teardown()` in `Accelerator` to allow `training_type_plugin` to customize `teardown` logic ([#7579](https://github.com/PyTorchLightning/pytorch-lightning/pull/7579))
+
+
 ### Deprecated
 
 
 - Deprecated `TrainerModelHooksMixin` in favor of `pytorch_lightning.utilities.signature_utils` ([#7422](https://github.com/PyTorchLightning/pytorch-lightning/pull/7422))
 
 
-- Deprecated `num_nodes` and `sync_batchnorm` arguments in `DDPPlugin` and `DDPSpawnPlugin` ([7026](https://github.com/PyTorchLightning/pytorch-lightning/pull/7026))
+- Deprecated `num_nodes` and `sync_batchnorm` arguments in `DDPPlugin` and `DDPSpawnPlugin` ([#7026](https://github.com/PyTorchLightning/pytorch-lightning/pull/7026))
 
 
 ### Removed
 
-- Prune deprecated classif. metrics from `pytorch_lightning.metrics.functional.classification` ([7499](https://github.com/PyTorchLightning/pytorch-lightning/pull/7499))
+- Prune deprecated classif. metrics from `pytorch_lightning.metrics.functional.classification` ([#7499](https://github.com/PyTorchLightning/pytorch-lightning/pull/7499))
 
 
-- Removed deprecated data parallel classes `LightningDataParallel` and `LightningDistributedDataParallel` from `pytorch_lightning.overrides.data_parallel` ([7510](https://github.com/PyTorchLightning/pytorch-lightning/pull/7510))
+- Removed deprecated data parallel classes `LightningDataParallel` and `LightningDistributedDataParallel` from `pytorch_lightning.overrides.data_parallel` ([#7510](https://github.com/PyTorchLightning/pytorch-lightning/pull/7510))
 
 
-- Removed deprecated trainer attributes - `get_model` and `accelerator_backend` ([7502](https://github.com/PyTorchLightning/pytorch-lightning/pull/7502))
+- Removed deprecated trainer attributes - `get_model` and `accelerator_backend` ([#7502](https://github.com/PyTorchLightning/pytorch-lightning/pull/7502))
 
 
-- Removed deprecated utils modules `model_utils`, `warning_utils`, `xla_device_utils` and partially `argparse_utils` ([7503](https://github.com/PyTorchLightning/pytorch-lightning/pull/7503))
+- Removed support for `self.log(tbptt_reduce_fx)` and `self.log(tbptt_pad_token)`. Please, open a discussion explaining your use-case if you relied on these. ([#7644](https://github.com/PyTorchLightning/pytorch-lightning/pull/7644))
+
+
+- Removed deprecated utils modules `model_utils`, `warning_utils`, `xla_device_utils` and partially `argparse_utils` ([#7503](https://github.com/PyTorchLightning/pytorch-lightning/pull/7503))
 
 
 - Removed deprecated trainer attributes - `on_cpu`, `on_tpu`, `use_tpu`, `on_gpu`, `use_dp`, `use_ddp`, `use_ddp2`, `use_horovod`, `use_single_gpu` ([#7501](https://github.com/PyTorchLightning/pytorch-lightning/pull/7501))
 
 
 ### Fixed
 
+- Fixed dataloaders are not reset when tuning the model ([#7566](https://github.com/PyTorchLightning/pytorch-lightning/pull/7566))
+
 
 - Fixed parsing of multiple training dataloaders ([#7433](https://github.com/PyTorchLightning/pytorch-lightning/pull/7433))
 
 
+- Fixed broadcasting in multi-node, multi-gpu DDP using torch 1.7 ([#7592](https://github.com/PyTorchLightning/pytorch-lightning/pull/7592))
+
+
+- Fixed `ProgressBar` pickling after calling `trainer.predict` ([#7608](https://github.com/PyTorchLightning/pytorch-lightning/pull/7608))
+
+
 - Fixed recursive passing of `wrong_type` keyword argument in `pytorch_lightning.utilities.apply_to_collection` ([#7433](https://github.com/PyTorchLightning/pytorch-lightning/pull/7433))
 
 
@@ -1326,7 +1352,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Fixed getting `experiment_id` from MLFlow only once instead of each training loop ([#3394](https://github.com/PyTorchLightning/pytorch-lightning/pull/3394))
 - Fixed `overfit_batches` which now correctly disables shuffling for the training loader. ([#3501](https://github.com/PyTorchLightning/pytorch-lightning/pull/3501))
 - Fixed gradient norm tracking for `row_log_interval > 1` ([#3489](https://github.com/PyTorchLightning/pytorch-lightning/pull/3489))
-- Fixed `ModelCheckpoint` name formatting ([3164](https://github.com/PyTorchLightning/pytorch-lightning/pull/3163))
+- Fixed `ModelCheckpoint` name formatting ([#3164](https://github.com/PyTorchLightning/pytorch-lightning/pull/3163))
 - Fixed example implementation of AutoEncoder ([#3190](https://github.com/PyTorchLightning/pytorch-lightning/pull/3190))
 - Fixed invalid paths when remote logging with TensorBoard ([#3236](https://github.com/PyTorchLightning/pytorch-lightning/pull/3236))
 - Fixed change `t()` to `transpose()` as XLA devices do not support `.t()` on 1-dim tensor ([#3252](https://github.com/PyTorchLightning/pytorch-lightning/pull/3252))
@@ -1586,8 +1612,8 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Added option `save_last` to save the model at the end of every epoch in `ModelCheckpoint` ([#1908](https://github.com/PyTorchLightning/pytorch-lightning/pull/1908))
 - Early stopping checks `on_validation_end` ([#1458](https://github.com/PyTorchLightning/pytorch-lightning/pull/1458))
 - Speed up single-core TPU training by loading data using `ParallelLoader` ([#2033](https://github.com/PyTorchLightning/pytorch-lightning/pull/2033))
-- Added a model hook `transfer_batch_to_device` that enables moving custom data structures to the target device ([1756](https://github.com/PyTorchLightning/pytorch-lightning/pull/1756))
-- Added [black](https://black.readthedocs.io/en/stable/) formatter for the code with code-checker on pull ([1610](https://github.com/PyTorchLightning/pytorch-lightning/pull/1610))
+- Added a model hook `transfer_batch_to_device` that enables moving custom data structures to the target device ([#1756](https://github.com/PyTorchLightning/pytorch-lightning/pull/1756))
+- Added [black](https://black.readthedocs.io/en/stable/) formatter for the code with code-checker on pull ([#1610](https://github.com/PyTorchLightning/pytorch-lightning/pull/1610))
 - Added back the slow spawn ddp implementation as `ddp_spawn` ([#2115](https://github.com/PyTorchLightning/pytorch-lightning/pull/2115))
 - Added loading checkpoints from URLs ([#1667](https://github.com/PyTorchLightning/pytorch-lightning/pull/1667))
 - Added a callback method `on_keyboard_interrupt` for handling KeyboardInterrupt events during training ([#2134](https://github.com/PyTorchLightning/pytorch-lightning/pull/2134))
 
@@ -75,39 +75,10 @@ register the tensor as a buffer in your modules's ``__init__`` method with :meth
 
 Remove samplers
 ^^^^^^^^^^^^^^^
-In PyTorch, you must use :class:`~torch.utils.data.distributed.DistributedSampler`
-for multi-node or TPU training. The sampler makes sure each GPU sees the appropriate part of your data.
 
-.. testcode::
-
-    # without lightning
-    def train_dataloader(self):
-        dataset = MNIST(...)
-        sampler = None
-
-        if self.on_tpu:
-            sampler = DistributedSampler(dataset)
-
-        return DataLoader(dataset, sampler=sampler)
-
-Lightning adds the correct samplers when needed, so no need to explicitly add samplers.
-
-.. testcode::
-
-    # with lightning
-    def train_dataloader(self):
-        dataset = MNIST(...)
-        return DataLoader(dataset)
-
-.. note::
-    By default it will add ``shuffle=True`` for train sampler and ``shuffle=False`` for val/test sampler.
-    ``drop_last`` in :class:`~torch.utils.data.distributed.DistributedSampler` will be set to its default value in PyTorch.
-    If you called :func:`~pytorch_lightning.utilities.seed.seed_everyting`, Lightning will set the same seed for the
-    sampler.
-
-.. note:: You can disable this behavior with ``Trainer(replace_sampler_ddp=False)``
+:class:`~torch.utils.data.distributed.DistributedSampler` is automatically handled by Lightning.
 
-.. note:: For iterable datasets, we don't do this automatically.
+See :ref:`replace-sampler-ddp` for more information.
 
 
 Synchronize validation and test logging
 
@@ -300,27 +300,4 @@ Set the ``NCCL_DEBUG=INFO`` environment variable to see the ACTUAL error.
 
 .. code-block:: bash
 
-    python NCCL_DEBUG=INFO train.py ...
-
-
-Distributed sampler
--------------------
-
-Normally now you would need to add a
-:class:`~torch.utils.data.distributed.DistributedSampler` to your dataset, however
-Lightning automates this for you. But if you still need to set a sampler set the Trainer flag
-:paramref:`~pytorch_lightning.Trainer.replace_sampler_ddp` to ``False``.
-
-Here's an example of how to add your own sampler (again, not needed with Lightning).
-
-.. testcode::
-
-    # in your LightningModule
-    def train_dataloader(self):
-        dataset = MyDataset()
-        dist_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
-        dataloader = Dataloader(dataset, sampler=dist_sampler)
-        return dataloader
-
-    # in your training script
-    trainer = Trainer(replace_sampler_ddp=False)
+    NCCL_DEBUG=INFO python train.py ...
@@ -152,23 +152,7 @@ improve readability and reproducibility.
         model = LitMNIST.load_from_checkpoint(PATH, loss_fx=torch.nn.SomeOtherLoss, generator_network=MyGenerator())
 
 
-3.  Assign to `self.hparams`. Anything assigned to `self.hparams` will also be saved automatically.
-
-    .. code-block:: python
-
-        # using a argparse.Namespace
-        class LitMNIST(LightningModule):
-            def __init__(self, hparams, *args, **kwargs):
-                super().__init__()
-                self.hparams = hparams
-                self.layer_1 = nn.Linear(28 * 28, self.hparams.layer_1_dim)
-                self.layer_2 = nn.Linear(self.hparams.layer_1_dim, self.hparams.layer_2_dim)
-                self.layer_3 = nn.Linear(self.hparams.layer_2_dim, 10)
-            def train_dataloader(self):
-                return DataLoader(mnist_train, batch_size=self.hparams.batch_size)
-
-
-4.  You can also save full objects such as `dict` or `Namespace` to the checkpoint.
+3.  You can also save full objects such as `dict` or `Namespace` to the checkpoint.
 
     .. code-block:: python
 
 
@@ -54,7 +54,7 @@ Notice a few things.
         new_x = torch.Tensor(2, 3)
         new_x = new_x.type_as(x)
 
-5.  There are no samplers for distributed, Lightning also does this for you.
+5.  Lightning by default handles the distributed sampler for you.
 
 |
 
 
@@ -1278,6 +1278,8 @@ Set to True to reload dataloaders every epoch.
         train_loader = model.train_dataloader()
         for batch in train_loader:
 
+.. _replace-sampler-ddp:
+
 replace_sampler_ddp
 ^^^^^^^^^^^^^^^^^^^
 
@@ -1289,9 +1291,10 @@ replace_sampler_ddp
 
 |
 
-Enables auto adding of distributed sampler. By default it will add ``shuffle=True``
-for train sampler and ``shuffle=False`` for val/test sampler. If you want to customize
-it, you can set ``replace_sampler_ddp=False`` and add your own distributed sampler.
+Enables auto adding of :class:`~torch.utils.data.distributed.DistributedSampler`. In PyTorch, you must use it in
+distributed settings such as TPUs or multi-node. The sampler makes sure each GPU sees the appropriate part of your data.
+By default it will add ``shuffle=True`` for train sampler and ``shuffle=False`` for val/test sampler.
+If you want to customize it, you can set ``replace_sampler_ddp=False`` and add your own distributed sampler.
 If ``replace_sampler_ddp=True`` and a distributed sampler was already added,
 Lightning will not replace the existing one.
 
@@ -1304,9 +1307,15 @@ By setting to False, you have to add your own distributed sampler:
 
 .. code-block:: python
 
-    # default used by the Trainer
-    sampler = torch.utils.data.distributed.DistributedSampler(dataset, shuffle=True)
-    dataloader = DataLoader(dataset, batch_size=32, sampler=sampler)
+
+    # in your LightningModule or LightningDataModule
+    def train_dataloader(self):
+        # default used by the Trainer
+        sampler = torch.utils.data.distributed.DistributedSampler(dataset, shuffle=True)
+        dataloader = DataLoader(dataset, batch_size=32, sampler=sampler)
+        return dataloader
+
+.. note:: For iterable datasets, we don't do this automatically.
 
 resume_from_checkpoint
 ^^^^^^^^^^^^^^^^^^^^^^
@@ -1389,10 +1398,6 @@ as you request.
 
 Your effective batch size is batch_size * total tpu cores.
 
-.. note::
-    No need to add a :class:`~torch.utils.data.distributed.DistributedSampler`,
-    Lightning automatically does it for you.
-
 This parameter can be either 1 or 8.
 
 Example::
 
@@ -208,8 +208,8 @@ To change this behaviour, set the `log_every_n_steps` :class:`~pytorch_lightning
 Log writing frequency
 =====================
 
-Writing to a logger can be expensive, so by default Lightning write logs to disc or to the given logger every 100 training steps.
-To change this behaviour, set the interval at which you wish to flush logs to the filesystem using `log_every_n_steps` :class:`~pytorch_lightning.trainer.trainer.Trainer` flag.
+Writing to a logger can be expensive, so by default Lightning writes logs to disk or to the given logger every 100 training steps.
+To change this behaviour, set the interval at which you wish to flush logs to the filesystem using the `flush_logs_every_n_steps` :class:`~pytorch_lightning.trainer.trainer.Trainer` flag.
 
 .. testcode::
 
 
@@ -58,16 +58,3 @@ To run this demo do the following:
 1. Log into the jumphost node of your SLURM-managed cluster.
 2. Create a conda environment with Lightning and a GPU PyTorch version.
 3. Choose a script to submit
-
-#### DDP
-Submit this job to run with DistributedDataParallel (2 nodes, 2 gpus each)
-```bash
-sbatch submit_ddp_job.sh YourEnv
-```
-
-#### DDP2
-Submit this job to run with a different implementation of DistributedDataParallel.
-In this version, each node acts like DataParallel but syncs across nodes like DDP.
-```bash
-sbatch submit_ddp2_job.sh YourEnv
-```
Original file line number	Diff line number	Diff line change
`@@ -54,7 +54,7 @@ Notice a few things.`
`54`	`54`	`new_x = torch.Tensor(2, 3)`
`55`	`55`	`new_x = new_x.type_as(x)`
`56`	`56`
`57`		`-5. There are no samplers for distributed, Lightning also does this for you.`
	`57`	`+5. Lightning by default handles the distributed sampler for you.`
`58`	`58`
`59`	`59`	`\|`
`60`	`60`