backporting from 2.19 #3105

cj401-amd · 2025-09-19T17:56:30Z

backported from https://github.com/orgs/ROCm/projects/14/views/6?pane=issue&itemId=127914294&issue=ROCm%7Cframeworks-internal%7C13610

it builds locally

[47,467 / 47,468] Action tensorflow/tools/pip_package/wheel_house/tf_nightly_rocm-2.20.0.dev20250919-cp310-cp310-linux_x86_64.whl; 40s local
[47,467 / 47,468] Action tensorflow/tools/pip_package/wheel_house/tf_nightly_rocm-2.20.0.dev20250919-cp310-cp310-linux_x86_64.whl; 100s local
[47,467 / 47,468] Action tensorflow/tools/pip_package/wheel_house/tf_nightly_rocm-2.20.0.dev20250919-cp310-cp310-linux_x86_64.whl; 160s local
INFO: Found 1 target...
Target //tensorflow/tools/pip_package:wheel up-to-date:
  bazel-bin/tensorflow/tools/pip_package/wheel_house/tf_nightly_rocm-2.20.0.dev20250919-cp310-cp310-linux_x86_64.whl
INFO: Elapsed time: 1476.100s, Critical Path: 627.42s
INFO: 47468 processes: 19009 internal, 28459 local.
INFO: Build completed successfully, 47468 total actions
Processing ./bazel-bin/tensorflow/tools/pip_package/wheel_house/tf_nightly_rocm-2.20.0.dev20250919-cp310-cp310-linux_x86_64.whl

…3090) failures.

…ed-devicelibs [ROCm] Use bundled bitcode files

…3090) failures.

…ed-devicelibs [ROCm] Use bundled bitcode files

…nsorflow-upstream into ci_cj-bp-fix-r2.20-rocm-enhanced

cj401-amd · 2025-09-22T13:52:57Z

Hi @i-chaochen and @hsharsha , it seems work with backporting those commits,

See https://developer.nvidia.com/cuda-gpus for a list of GPUs and their compute capabilities.
If you will use compatible GPU(s) not attached to this host, e.g. by running a multi-worker model, you can ignore this warning. This message will only be logged once
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1758549018.545024  401855 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 194280 MB memory:  -> device: 0, name: AMD Instinct MI300X, pci bus id: 0000:05:00.0
I0000 00:00:1758549018.581677  401855 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 194280 MB memory:  -> device: 1, name: AMD Instinct MI300X, pci bus id: 0000:26:00.0
I0000 00:00:1758549018.617811  401855 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 194280 MB memory:  -> device: 2, name: AMD Instinct MI300X, pci bus id: 0000:46:00.0
I0000 00:00:1758549018.652770  401855 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 194280 MB memory:  -> device: 3, name: AMD Instinct MI300X, pci bus id: 0000:65:00.0
I0000 00:00:1758549018.687675  401855 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:4 with 194280 MB memory:  -> device: 4, name: AMD Instinct MI300X, pci bus id: 0000:85:00.0
I0000 00:00:1758549018.720616  401855 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:5 with 194280 MB memory:  -> device: 5, name: AMD Instinct MI300X, pci bus id: 0000:a6:00.0
I0000 00:00:1758549018.753206  401855 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:6 with 194280 MB memory:  -> device: 6, name: AMD Instinct MI300X, pci bus id: 0000:c6:00.0
I0000 00:00:1758549018.785401  401855 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:7 with 194280 MB memory:  -> device: 7, name: AMD Instinct MI300X, pci bus id: 0000:e5:00.0
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3', '/job:localhost/replica:0/task:0/device:GPU:4', '/job:localhost/replica:0/task:0/device:GPU:5', '/job:localhost/replica:0/task:0/device:GPU:6', '/job:localhost/replica:0/task:0/device:GPU:7')
I0922 13:50:18.840622 140661085996864 mirrored_strategy.py:423] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3', '/job:localhost/replica:0/task:0/device:GPU:4', '/job:localhost/replica:0/task:0/device:GPU:5', '/job:localhost/replica:0/task:0/device:GPU:6', '/job:localhost/replica:0/task:0/device:GPU:7')
I0922 13:50:27.709542 140661085996864 train_utils.py:274] Running default trainer.
I0922 13:50:27.710028 140661085996864 vit.py:348] ViT specs: mlp_dim=768, num_heads=3, num_layers=12,patch_size=16, hidden_size=192, representation_size=768.
I0922 13:50:31.999432 140661085996864 legacy_adamw.py:56] AdamWeightDecay gradient_clip_norm=0.000000
I0922 13:50:32.464362 140661085996864 controller.py:463] restoring or initializing model...
INFO:tensorflow:Customized initialization is done through the passed `init_fn`.
I0922 13:50:32.464454 140661085996864 checkpoint_management.py:894] Customized initialization is done through the passed `init_fn`.
I0922 13:50:32.464510 140661085996864 train_lib.py:266] Starts to execute mode: eval
I0922 13:50:32.465570 140661085996864 controller.py:321]  eval | step:      0 | running 626 steps of evaluation...
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0922 13:50:32.479153 140661085996864 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0922 13:50:32.489452 140661085996864 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2025-09-22 13:50:38.886799: I tensorflow/core/kernels/data/tf_record_dataset_op.cc:390] TFRecordDataset `buffer_size` is unspecified, default to 262144

cj401-amd · 2025-09-22T13:55:34Z

it works now @i-chaochen @hsharsha. Previously the tf_records were corrupted as Harsha pointed out.

This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2.
I0922 15:54:15.728991 139653650179904 train_lib.py:298] FLOPs (multi-adds) in model: 1.264925 Billions.
I0922 15:54:15.730370 139653650179904 train_utils.py:409] Saving gin configurations to vit_model_dir/vit-ti16-i224/operative_config.eval.gin
restoring or initializing model...
 eval | step:      0 | running 626 steps of evaluation...
 eval | step:      0 | steps/sec:   13.5 | eval time:   46.3 sec | output:
    {'accuracy': np.float32(0.0),
     'steps_per_second': 13.507119845468551,
     'top_5_accuracy': np.float32(1.0),
     'validation_loss': np.float32(6.9032164)}

hsharsha · 2025-09-22T14:03:38Z

Try with TF_FORCE_GPU_ALLOW_GROWTH=true in dlm infer.sh and/or train.sh script

zoranjovanovic-ns and others added 3 commits September 19, 2025 14:03

Remove usage of hipGetLastError from GpuLaunchKernel to avoid false (#…

6fe260c

…3090) failures.

cherry pick from Merge pull request #3064 from ROCm/r2.19-rocm-enhanc…

91a8cb8

…ed-devicelibs [ROCm] Use bundled bitcode files

picked from 2.19-enhanced

f746b2d

cj401-amd requested review from i-chaochen, hsharsha, draganmladjenovic and zoranjovanovic-ns September 19, 2025 17:56

zoranjovanovic-ns and others added 4 commits September 22, 2025 13:23

Remove usage of hipGetLastError from GpuLaunchKernel to avoid false (#…

c3ae0ee

…3090) failures.

cherry pick from Merge pull request #3064 from ROCm/r2.19-rocm-enhanc…

0b88cea

…ed-devicelibs [ROCm] Use bundled bitcode files

picked from 2.19-enhanced

0f3d0f8

Merge branch 'ci_cj-bp-fix-r2.20-rocm-enhanced' of github.com:ROCm/te…

7ab9b1a

…nsorflow-upstream into ci_cj-bp-fix-r2.20-rocm-enhanced

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

backporting from 2.19 #3105

backporting from 2.19 #3105

cj401-amd commented Sep 19, 2025 •

edited

Loading

Uh oh!

cj401-amd commented Sep 22, 2025 •

edited

Loading

Uh oh!

cj401-amd commented Sep 22, 2025 •

edited

Loading

Uh oh!

hsharsha commented Sep 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

backporting from 2.19 #3105

Are you sure you want to change the base?

backporting from 2.19 #3105

Conversation

cj401-amd commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cj401-amd commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cj401-amd commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hsharsha commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cj401-amd commented Sep 19, 2025 •

edited

Loading

cj401-amd commented Sep 22, 2025 •

edited

Loading

cj401-amd commented Sep 22, 2025 •

edited

Loading

hsharsha commented Sep 22, 2025 •

edited

Loading