You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/adding_software/debugging_failed_builds.md
+58Lines changed: 58 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -238,6 +238,64 @@ After some time, this build fails while trying to build `Plumed`, and we can acc
238
238
!!! Note
239
239
While this might be faster than the easystack-based approach, this is _not_ how the bot builds. So why it _may_ reproduce the failure the bot encounters, it may not reproduce the bug _at all_ (no failure) or run into _different_ bugs. If you want to be sure, use the easystack-based approach.
240
240
241
+
## Rebuilding software
242
+
[Rebuilding software](opening_pr.md#rebuilding_software) requires an additional step at the beginning: the software first needs to be removed. We assume you've already [checked out the feature branch](#fetching-the-feature-branch). Then, you need to start the container with the additional `--fakeroot` argument, otherwise you will not be able to remove files from the `/cvmfs` prefix. Make sure to also include the `--save` argument, as we will need the tarball later on. E.g.
Finally, run the `EESSI-remove-software.sh` script
255
+
```
256
+
./EESSI-remove-software.sh`
257
+
```
258
+
259
+
This should remove any software specified in a [rebuild easystack](opening_pr.md#rebuilding_software) that got added in your current feature branch.
260
+
261
+
Now, exit the container, paying attention to the instructions that are printed to resume later, e.g.:
262
+
263
+
```
264
+
Saved contents of tmp directory '/tmp/eessi.WZxeFUemH2' to tarball '/home/myuser/pr507/EESSI-1711538681.tgz' (to resume session add '--resume /home/myuser/pr507/EESSI-1711538681.tgz')
265
+
```
266
+
267
+
Now, continue with the original instructions to start the container (i.e. either [here](#starting-a-shell-in-the-eessi-container) or [with this alternate approach](#more-efficient-approach-for-multiplecontinued-debugging-sessions)) and make sure to add the `--resume` flag. This way, you are resuming from the tarball (i.e. with the software removed that has to be rebuilt), but in a new container in which you have regular (i.e. no root) permissions.
268
+
269
+
## Running the test step
270
+
If you are still in the prefix layer (i.e. after previously building something), exit it first:
Environment set up to use EESSI (2023.06), have fun!
280
+
{EESSI 2023.06} Apptainer>
281
+
```
282
+
283
+
!!! Note
284
+
If you are in a SLURM environment, make sure to run `for i in $(env | grep SLURM); do unset "${i%=*}"; done` to unset any SLURM environment variables. Failing to do so will cause `mpirun` to pick up on these and e.g. infer how many slots are available. If you run into errors of the form "There are not enough slots available in the system to satisfy the X slots that were requested by the application:", you probably forgot this step.
285
+
286
+
Then, execute the `run_tests.sh` script. We are assuming you are still in the root of the `software-layer` repository that you cloned earlier:
287
+
```
288
+
./run_tests.sh
289
+
```
290
+
if all goes well, you should see (part of) the EESSI test suite being run by ReFrame, finishing with something like
291
+
292
+
```
293
+
[ PASSED ] Ran X/Y test case(s) from Z check(s) (0 failure(s), 0 skipped, 0 aborted)
294
+
```
295
+
296
+
!!! Note
297
+
If you are running on a system with hyperthreading enabled, you may still run into the "There are not enough slots available in the system to satisfy the X slots that were requested by the application:" error from `mpirun`, because hardware threads are not considered to be slots by default by OpenMPIs `mpirun`. In this case, run with `OMPI_MCA_hwloc_base_use_hwthreads_as_cpus=1 ./run_tests.sh` (for OpenMPI 4.X) or `PRTE_MCA_rmaps_default_mapping_policy=:hwtcpus ./run_tests.sh` (for OpenMPI 5.X).
298
+
241
299
## Known causes of issues in EESSI
242
300
243
301
### The custom system prefix of the compatibility layer
Note that the naming scheme is standardized and should be `eessi-<eessi_version>-eb-<eb_version>-<toolchain_version>.yml`. See the [official EasyBuild documentation on easystack files](https://docs.easybuild.io/easystack-files/) for more information on the syntax.
76
77
77
78
4) Stage and commit the changes into your your branch with a sensible message
78
79
@@ -95,3 +96,26 @@ git push koala example_branch
95
96
96
97
If all goes well, one or more bots :robot: should almost instantly create a comment in your pull request
97
98
with an overview of how it is configured - you will need this information when providing build instructions.
99
+
100
+
### Rebuilding software
101
+
We typically do not rebuild software, since (strictly speaking) this breaks reproducibility for anyone using the software. However, there are certain situations in which it is difficult or impossible to avoid.
102
+
103
+
To do a rebuild, you add the software you want to rebuild to a dedicated easystack file in the `rebuilds` directory. Use the following naming convention: `YYYYMMDD-eb-<EB_VERSION>-<APPLICATION_NAME>-<APPLICATION_VERSION>-<SHORT_DESCRIPTION>.yml`, where `YYYYMMDD` is the opening date of your PR. E.g. `2024.05.06-eb-4.9.1-CUDA-12.1.1-ship-full-runtime.yml` was added in a PR on the 6th of May 2024 and used to rebuild CUDA-12.1.1 using EasyBuild 4.9.1 to resolve an issue with some runtime libraries missing from the initial CUDA 12.1.1 installation.
104
+
105
+
At the top of your easystack file, please use comments to include a short description, and make sure to include any relevant links to related issues (e.g. from the GitHub repositories of EESSI, EasyBuild, or the software you are rebuilding).
106
+
107
+
As an example, consider the full easystack file (`2024.05.06-eb-4.9.1-CUDA-12.1.1-ship-full-runtime.yml`) used for the aforementioned CUDA rebuild:
108
+
109
+
```yaml
110
+
# 2024.05.06
111
+
# Original matching of files we could ship was not done correctly. We were
112
+
# matching the basename for files (e.g., libcudart.so from libcudart.so.12)
113
+
# rather than the name stub (libcudart)
114
+
# See https://github.com/EESSI/software-layer/pull/559
115
+
easyconfigs:
116
+
- CUDA-12.1.1.eb:
117
+
options:
118
+
accept-eula-for: CUDA
119
+
```
120
+
121
+
By separating rebuilds in dedicated files, we still maintain a complete software bill of materials: it is transparent what got rebuilt, for which reason, and when.
0 commit comments