Skip to content

Conversation

@nrspruit
Copy link
Contributor

@nrspruit nrspruit commented Nov 3, 2025

No description provided.

@nrspruit nrspruit changed the title [test] Update to v1.25.2 L0 [UR][CI] Update to v1.25.2 L0 with fixed L0 api symbols Nov 3, 2025
- Due to symbol conflicts in the L0 gpu driver, zeCommandListAppendLaunchKernelWithArguments
  can only be used from the spec definition and the driver exp cannot be
used as of the update to v1.14 supported symbols.

Signed-off-by: Neil R. Spruit <[email protected]>
@nrspruit nrspruit marked this pull request as ready for review November 3, 2025 21:33
@nrspruit nrspruit requested review from a team as code owners November 3, 2025 21:33
@PatKamin PatKamin removed their assignment Nov 4, 2025
@lukaszstolarczuk
Copy link
Contributor

@intel/llvm-gatekeepers , please consider merging - we need this fix for benchmarks to fully work

@steffenlarsen steffenlarsen merged commit a0ba6ec into intel:sycl Nov 5, 2025
76 of 79 checks passed
@aelovikov-intel
Copy link
Contributor

Some jobs in our pre-commit CI started to fail like this yesterday:

  <LOADER>[INFO]: failed to load adapter 'libur_adapter_level_zero.so.0' with error: /__w/llvm/llvm/toolchain/lib/libur_adapter_level_zero.so.0: undefined symbol: zeCommandListAppendLaunchKernelWithArguments
  <LOADER>[INFO]: failed to load adapter '/__w/llvm/llvm/toolchain/lib/libur_adapter_level_zero.so.0' with error: /__w/llvm/llvm/toolchain/lib/libur_adapter_level_zero.so.0: undefined symbol: zeCommandListAppendLaunchKernelWithArguments
  <LOADER>[INFO]: failed to load adapter 'libur_adapter_level_zero_v2.so.0' with error: /__w/llvm/llvm/toolchain/lib/libur_adapter_level_zero_v2.so.0: undefined symbol: zeCommandListAppendLaunchKernelWithArguments
  <LOADER>[INFO]: failed to load adapter '/__w/llvm/llvm/toolchain/lib/libur_adapter_level_zero_v2.so.0' with error: /__w/llvm/llvm/toolchain/lib/libur_adapter_level_zero_v2.so.0: undefined symbol: zeCommandListAppendLaunchKernelWithArguments

which seems related to this PR. The jobs in question are "Dev IGC" and "ABI Compatibility". The latter, in particular, uses an older docker container image and the test binaries built with an official release. Then we copy newly built toolchain (think install/lib) but still using old driver.

@nrspruit , @ldorau , @pbalcer , @lukaszstolarczuk ,

Can you confirm this PR is the reason of these failures? What can be done here?

@ldorau
Copy link
Contributor

ldorau commented Nov 6, 2025

Some jobs in our pre-commit CI started to fail like this yesterday:

  <LOADER>[INFO]: failed to load adapter 'libur_adapter_level_zero.so.0' with error: /__w/llvm/llvm/toolchain/lib/libur_adapter_level_zero.so.0: undefined symbol: zeCommandListAppendLaunchKernelWithArguments
  <LOADER>[INFO]: failed to load adapter '/__w/llvm/llvm/toolchain/lib/libur_adapter_level_zero.so.0' with error: /__w/llvm/llvm/toolchain/lib/libur_adapter_level_zero.so.0: undefined symbol: zeCommandListAppendLaunchKernelWithArguments
  <LOADER>[INFO]: failed to load adapter 'libur_adapter_level_zero_v2.so.0' with error: /__w/llvm/llvm/toolchain/lib/libur_adapter_level_zero_v2.so.0: undefined symbol: zeCommandListAppendLaunchKernelWithArguments
  <LOADER>[INFO]: failed to load adapter '/__w/llvm/llvm/toolchain/lib/libur_adapter_level_zero_v2.so.0' with error: /__w/llvm/llvm/toolchain/lib/libur_adapter_level_zero_v2.so.0: undefined symbol: zeCommandListAppendLaunchKernelWithArguments

which seems related to this PR. The jobs in question are "Dev IGC" and "ABI Compatibility". The latter, in particular, uses an older docker container image and the test binaries built with an official release. Then we copy newly built toolchain (think install/lib) but still using old driver.

@nrspruit , @ldorau , @pbalcer , @lukaszstolarczuk ,

Can you confirm this PR is the reason of these failures? What can be done here?

Adding @PatKamin

@nrspruit
Copy link
Contributor Author

nrspruit commented Nov 6, 2025

Some jobs in our pre-commit CI started to fail like this yesterday:

  <LOADER>[INFO]: failed to load adapter 'libur_adapter_level_zero.so.0' with error: /__w/llvm/llvm/toolchain/lib/libur_adapter_level_zero.so.0: undefined symbol: zeCommandListAppendLaunchKernelWithArguments
  <LOADER>[INFO]: failed to load adapter '/__w/llvm/llvm/toolchain/lib/libur_adapter_level_zero.so.0' with error: /__w/llvm/llvm/toolchain/lib/libur_adapter_level_zero.so.0: undefined symbol: zeCommandListAppendLaunchKernelWithArguments
  <LOADER>[INFO]: failed to load adapter 'libur_adapter_level_zero_v2.so.0' with error: /__w/llvm/llvm/toolchain/lib/libur_adapter_level_zero_v2.so.0: undefined symbol: zeCommandListAppendLaunchKernelWithArguments
  <LOADER>[INFO]: failed to load adapter '/__w/llvm/llvm/toolchain/lib/libur_adapter_level_zero_v2.so.0' with error: /__w/llvm/llvm/toolchain/lib/libur_adapter_level_zero_v2.so.0: undefined symbol: zeCommandListAppendLaunchKernelWithArguments

which seems related to this PR. The jobs in question are "Dev IGC" and "ABI Compatibility". The latter, in particular, uses an older docker container image and the test binaries built with an official release. Then we copy newly built toolchain (think install/lib) but still using old driver.

@nrspruit , @ldorau , @pbalcer , @lukaszstolarczuk ,

Can you confirm this PR is the reason of these failures? What can be done here?

The only way that would fail is if the UR adapter was NOT built statically linking the L0 static loader v1.25.x+ .

What is different between the workflows for Dev IGC and ABI compatibility? Why are they allowing the build of the adapters without statically linking the L0 Loader? All the CI run in the PR passed because the loader is statically linked.

@aelovikov-intel
Copy link
Contributor

Why are they allowing the build of the adapters without statically linking the L0 Loader

I don't know anything about that. Docker image for "ABI Compatibility tasks" is built using the following workflow: https://github.com/intel/llvm/blob/sycl/.github/workflows/sycl-prebuilt-e2e-container.yml

And the "current" toolchain is built the same way as for any other E2E jobs in the pre-commit.

@aelovikov-intel
Copy link
Contributor

Pre-commit CI is still broken, should we revert this PR due to no response from the author?

@nrspruit
Copy link
Contributor Author

nrspruit commented Nov 7, 2025 via email

@sarnex
Copy link
Contributor

sarnex commented Nov 7, 2025

I can help with any of the CI build questions but I am incredibly busy so I would need someone familiar with the UR/L0 code to tell me exactly what I need to be looking for, some of the other devops people may be more free

@pbalcer
Copy link
Contributor

pbalcer commented Nov 7, 2025

Here's my guess what's happening. In regular CI jobs, the level-zero loader is fetched dynamically from GitHub, and the loader is linked statically. This is independent of the adapter. See here:

It's possible that the jobs that are failing use a preinstalled dynamic loader. And then, at runtime, on the actual system where the job is executing, the symbol loading fails.

Long-term, the loader should ship and export two separate targets, static and dynamic, so that UR can always choose the static variant when linking with the adapter, regardless how the loader is sourced.

@sarnex
Copy link
Contributor

sarnex commented Nov 7, 2025

The docker images definitely have a preinstalled L0, the change in this PR updated the version of it, so it should be v1.25.2 now.

@pbalcer
Copy link
Contributor

pbalcer commented Nov 7, 2025

I just built v1.25.2, and the symbol that's failing to load is exported:

pbalcer@gkdse-pre-dnp-02:~/level-zero/build/lib$ nm libze_loader.so | grep zeCommandListAppendLaunchKernelWithArguments
00000000003ec455 T zeCommandListAppendLaunchKernelWithArguments

Other explanations for the behavior we are seeing in CI are a) LD_LIBRARY_PATH being set to a location where a different, older, version of the loader is installed, b) rpath, c) ... ?

Pre-commit CI is still broken, should we revert this PR due to no response from the author?

If we revert this, other parts of CI will stop working.

@sarnex
Copy link
Contributor

sarnex commented Nov 7, 2025

Sorry, I think the only remaining failures are the ABI compatibility jobs and those do not have an updated L0, by design. I fixed IGC-Dev just by rebuilding the container.

Note sure what we should do for the ABI compatibility jobs. If we could make the compiler use the just-built L0 instead of the system one that should work I guess, not sure if that's a good solution though.

@pbalcer
Copy link
Contributor

pbalcer commented Nov 7, 2025

If we could make the compiler use the just-built L0 instead of the system one that should work I guess, not sure if that's a good solution though.

I considered that. We could change the CMake's to do that, but that would also affect other builds, and those may not be able to fetch stuff dynamically from GitHub.

@sarnex
Copy link
Contributor

sarnex commented Nov 7, 2025

Could we set some CMake envvar or actual CMake variable passed in build such that it won't find the system one or won't even try? Not sure if there is a variable like that.

In the general case we should use the system one IMO.

@sarnex
Copy link
Contributor

sarnex commented Nov 7, 2025

@aelovikov-intel
Copy link
Contributor

Extra note: we're in ABI breaking window, so we'll have to stop testing for ABI compatibility soon'ish and might just do it now as well, but I want to make sure that the issue that happened here won't happen in future, so we do need to root cause it.

@pbalcer
Copy link
Contributor

pbalcer commented Nov 7, 2025

https://cmake.org/cmake/help/latest/variable/CMAKE_SYSTEM_IGNORE_PATH.html ? No idea

No idea either, I experimented with a few things, but no luck. I think it may be best to add a cmake option that would force it to go to this path:

if(NOT LEVEL_ZERO_LIB_NAME AND NOT LEVEL_ZERO_LIBRARY)
and then use it for the ABI compatibility jobs.

@sarnex
Copy link
Contributor

sarnex commented Nov 7, 2025

Sounds good to me, the question is who has the bandwidth to implement it :P

@pbalcer
Copy link
Contributor

pbalcer commented Nov 7, 2025

Sounds good to me, the question is who has the bandwidth to implement it :P

It's 6pm Friday for me, so... xd Given what @aelovikov-intel said, and with IGC-Dev jobs fixed, this no longer seems that urgent. I'll ask someone in my team for help once we get back from holidays (Wednesday).

@sarnex
Copy link
Contributor

sarnex commented Nov 7, 2025

I can just disable the ABI jobs until we can implement the fix. Is that okay @aelovikov-intel?

@aelovikov-intel
Copy link
Contributor

I can just disable the ABI jobs until we can implement the fix. Is that okay @aelovikov-intel?

#20597, waiting for the CI result before merging (not 100% sure it'll work like that).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants