Skip to content

Conversation

cj401-amd
Copy link

@cj401-amd cj401-amd commented Sep 19, 2025

it builds locally

[47,467 / 47,468] Action tensorflow/tools/pip_package/wheel_house/tf_nightly_rocm-2.20.0.dev20250919-cp310-cp310-linux_x86_64.whl; 40s local
[47,467 / 47,468] Action tensorflow/tools/pip_package/wheel_house/tf_nightly_rocm-2.20.0.dev20250919-cp310-cp310-linux_x86_64.whl; 100s local
[47,467 / 47,468] Action tensorflow/tools/pip_package/wheel_house/tf_nightly_rocm-2.20.0.dev20250919-cp310-cp310-linux_x86_64.whl; 160s local
INFO: Found 1 target...
Target //tensorflow/tools/pip_package:wheel up-to-date:
  bazel-bin/tensorflow/tools/pip_package/wheel_house/tf_nightly_rocm-2.20.0.dev20250919-cp310-cp310-linux_x86_64.whl
INFO: Elapsed time: 1476.100s, Critical Path: 627.42s
INFO: 47468 processes: 19009 internal, 28459 local.
INFO: Build completed successfully, 47468 total actions
Processing ./bazel-bin/tensorflow/tools/pip_package/wheel_house/tf_nightly_rocm-2.20.0.dev20250919-cp310-cp310-linux_x86_64.whl

@cj401-amd
Copy link
Author

cj401-amd commented Sep 22, 2025

Hi @i-chaochen and @hsharsha , it seems work with backporting those commits,

See https://developer.nvidia.com/cuda-gpus for a list of GPUs and their compute capabilities.
If you will use compatible GPU(s) not attached to this host, e.g. by running a multi-worker model, you can ignore this warning. This message will only be logged once
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1758549018.545024  401855 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 194280 MB memory:  -> device: 0, name: AMD Instinct MI300X, pci bus id: 0000:05:00.0
I0000 00:00:1758549018.581677  401855 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 194280 MB memory:  -> device: 1, name: AMD Instinct MI300X, pci bus id: 0000:26:00.0
I0000 00:00:1758549018.617811  401855 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 194280 MB memory:  -> device: 2, name: AMD Instinct MI300X, pci bus id: 0000:46:00.0
I0000 00:00:1758549018.652770  401855 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 194280 MB memory:  -> device: 3, name: AMD Instinct MI300X, pci bus id: 0000:65:00.0
I0000 00:00:1758549018.687675  401855 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:4 with 194280 MB memory:  -> device: 4, name: AMD Instinct MI300X, pci bus id: 0000:85:00.0
I0000 00:00:1758549018.720616  401855 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:5 with 194280 MB memory:  -> device: 5, name: AMD Instinct MI300X, pci bus id: 0000:a6:00.0
I0000 00:00:1758549018.753206  401855 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:6 with 194280 MB memory:  -> device: 6, name: AMD Instinct MI300X, pci bus id: 0000:c6:00.0
I0000 00:00:1758549018.785401  401855 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:7 with 194280 MB memory:  -> device: 7, name: AMD Instinct MI300X, pci bus id: 0000:e5:00.0
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3', '/job:localhost/replica:0/task:0/device:GPU:4', '/job:localhost/replica:0/task:0/device:GPU:5', '/job:localhost/replica:0/task:0/device:GPU:6', '/job:localhost/replica:0/task:0/device:GPU:7')
I0922 13:50:18.840622 140661085996864 mirrored_strategy.py:423] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3', '/job:localhost/replica:0/task:0/device:GPU:4', '/job:localhost/replica:0/task:0/device:GPU:5', '/job:localhost/replica:0/task:0/device:GPU:6', '/job:localhost/replica:0/task:0/device:GPU:7')
I0922 13:50:27.709542 140661085996864 train_utils.py:274] Running default trainer.
I0922 13:50:27.710028 140661085996864 vit.py:348] ViT specs: mlp_dim=768, num_heads=3, num_layers=12,patch_size=16, hidden_size=192, representation_size=768.
I0922 13:50:31.999432 140661085996864 legacy_adamw.py:56] AdamWeightDecay gradient_clip_norm=0.000000
I0922 13:50:32.464362 140661085996864 controller.py:463] restoring or initializing model...
INFO:tensorflow:Customized initialization is done through the passed `init_fn`.
I0922 13:50:32.464454 140661085996864 checkpoint_management.py:894] Customized initialization is done through the passed `init_fn`.
I0922 13:50:32.464510 140661085996864 train_lib.py:266] Starts to execute mode: eval
I0922 13:50:32.465570 140661085996864 controller.py:321]  eval | step:      0 | running 626 steps of evaluation...
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0922 13:50:32.479153 140661085996864 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0922 13:50:32.489452 140661085996864 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2025-09-22 13:50:38.886799: I tensorflow/core/kernels/data/tf_record_dataset_op.cc:390] TFRecordDataset `buffer_size` is unspecified, default to 262144

@cj401-amd
Copy link
Author

cj401-amd commented Sep 22, 2025

it works now @i-chaochen @hsharsha. Previously the tf_records were corrupted as Harsha pointed out.

This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2.
I0922 15:54:15.728991 139653650179904 train_lib.py:298] FLOPs (multi-adds) in model: 1.264925 Billions.
I0922 15:54:15.730370 139653650179904 train_utils.py:409] Saving gin configurations to vit_model_dir/vit-ti16-i224/operative_config.eval.gin
restoring or initializing model...
 eval | step:      0 | running 626 steps of evaluation...
 eval | step:      0 | steps/sec:   13.5 | eval time:   46.3 sec | output:
    {'accuracy': np.float32(0.0),
     'steps_per_second': 13.507119845468551,
     'top_5_accuracy': np.float32(1.0),
     'validation_loss': np.float32(6.9032164)}

@hsharsha
Copy link

hsharsha commented Sep 22, 2025

Try with TF_FORCE_GPU_ALLOW_GROWTH=true in dlm infer.sh and/or train.sh script

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants