-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
Recent observations have made it clear that there are many problems with either the TPU implementation in Lightning or the test environment:
- Not all tests written in Lightning for TPU are executed. Only a hand-maintained list of tests ever runs. Fix TPU testing and collect all tests #11098
- Attempting to address 1) reveals further that among the tests that do run, there are many decorated with a wrapper
@pl_multi_process_test, which suppresses assertion errors and exceptions of broken tests.
The result is that we have a lot of tests that are broken but never surface in the CI.
To Reproduce
A simple way to reproduce this is by removing all decorators, which is what I have done in #11098, and then let the tests run and fail. Attached is the full log file of such a CI run: tpu-logs-without-pl-multi.txt
In summary:
17 failed, 48 passed
FAILED tests/tests_pytorch/callbacks/test_device_stats_monitor.py::test_device_stats_monitor_tpu
FAILED tests/tests_pytorch/models/test_tpu.py::test_model_tpu_index[1] - Runt...
FAILED tests/tests_pytorch/models/test_tpu.py::test_model_tpu_index[5] - Runt...
FAILED tests/tests_pytorch/models/test_tpu.py::test_model_tpu_devices_8 - tor...
FAILED tests/tests_pytorch/models/test_tpu.py::test_model_16bit_tpu_index[1]
FAILED tests/tests_pytorch/models/test_tpu.py::test_model_16bit_tpu_index[5]
FAILED tests/tests_pytorch/models/test_tpu.py::test_model_16bit_tpu_devices_8
FAILED tests/tests_pytorch/models/test_tpu.py::test_model_tpu_early_stop - to...
FAILED tests/tests_pytorch/models/test_tpu.py::test_dataloaders_passed_to_fit
FAILED tests/tests_pytorch/models/test_tpu.py::test_broadcast_on_tpu - torch....
FAILED tests/tests_pytorch/models/test_tpu.py::test_tpu_reduce - torch.multip...
FAILED tests/tests_pytorch/models/test_tpu.py::test_if_test_works_with_checkpoint_false
FAILED tests/tests_pytorch/models/test_tpu.py::test_tpu_sync_dist - torch.mul...
FAILED tests/tests_pytorch/models/test_tpu.py::test_tpu_debug_mode - torch.mu...
FAILED tests/tests_pytorch/models/test_tpu.py::test_tpu_host_world_size - tor...
FAILED tests/tests_pytorch/profilers/test_xla_profiler.py::test_xla_profiler_instance
FAILED tests/tests_pytorch/trainer/properties/test_estimated_stepping_batches.py::test_num_stepping_batches_with_tpu[8-8]
ERROR tests/tests_pytorch/models/test_tpu.py::test_model_16bit_tpu_index[1]
ERROR tests/tests_pytorch/models/test_tpu.py::test_model_16bit_tpu_index[5]
There is of course the infamous cryptic error message for several test cases
Exception in device=TPU:2: Cannot replicate if number of devices (1) is different from 8
Which sometimes hints at the possibility that we are accessing the xm.xla_device before spawning processes.
Other examples:
self = <pytorch_lightning.trainer.connectors.accelerator_connector.AcceleratorConnector object at 0x7f7bbead03d0>
@property
def is_distributed(self) -> bool:
# TODO: deprecate this property
# Used for custom plugins.
# Custom plugins should implement is_distributed property.
if hasattr(self.strategy, "is_distributed") and not isinstance(self.accelerator, TPUAccelerator):
return self.strategy.is_distributed
distributed_strategy = (
DDP2Strategy,
DDPStrategy,
DDPSpawnShardedStrategy,
DDPShardedStrategy,
DDPFullyShardedNativeStrategy,
DDPFullyShardedStrategy,
DDPSpawnStrategy,
DeepSpeedStrategy,
TPUSpawnStrategy,
HorovodStrategy,
HPUParallelStrategy,
)
is_distributed = isinstance(self.strategy, distributed_strategy)
if isinstance(self.accelerator, TPUAccelerator):
> is_distributed |= self.strategy.is_distributed
E TypeError: unsupported operand type(s) for |=: 'bool' and 'NoneType'
def has_len_all_ranks(
dataloader: DataLoader,
training_type: "pl.Strategy",
model: Union["pl.LightningModule", "pl.LightningDataModule"],
) -> bool:
"""Checks if a given Dataloader has ``__len__`` method implemented i.e. if it is a finite dataloader or
infinite dataloader."""
try:
local_length = len(dataloader)
total_length = training_type.reduce(torch.tensor(local_length).to(model.device), reduce_op="sum")
> if total_length == 0:
E RuntimeError: Not found: From /job:tpu_worker/replica:0/task:0:
E 2 root error(s) found.
E (0) Not found: No subgraph found for uid 2894109085761937038
E [[{{node XRTExecute}}]]
E (1) Not found: No subgraph found for uid 2894109085761937038
E [[{{node XRTExecute}}]]
E [[XRTExecute_G29]]
E 0 successful operations.
E 0 derived errors ignored.
Furthermore, sometimes, non-deterministically, the CI just stops in the middle of execution:
....
profilers/test_xla_profiler.py::test_xla_profiler_instance FAILED [ 93%]
strategies/test_tpu_spawn.py::test_model_tpu_one_core PASSED [ 95%]
Done with log retrieval attempt.
Exited with code exit status 2
CircleCI received exit code 2
Expected behavior
It is unclear what the intention was when designing the test setup. The decorators were introduced way back in #2512 and have never much changed. Meanwhile, strategy and accelerators have undergone major design changes and countless refactors. I propose to re-evaluate whether the pl_multi_process_test decorator is still needed, and if so, document why, how to use it and when to use it correctly.
Possible Action
My suggestion is to
- Remove the decorator
- Debug each test on the VM
- Run tests that require it in standalone mode
- Reduce the verbosity of the mind boggling thousands of nonsense lines printed in the CI
- Upgrade to the latest XLA and PyTorch version