Skip to content

Conversation

@rhc54
Copy link
Contributor

@rhc54 rhc54 commented Dec 12, 2017

If available, have apps use registration capability to cleanup their session directories. Setup capability for vader to register its shared memory file location - let someone familiar with that code do so.

Signed-off-by: Ralph Castain [email protected]

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 12, 2017

Errrr...the compiler crashed!

bot:ibm:retest

@open-mpi open-mpi deleted a comment from ibm-ompi Dec 12, 2017
@ibm-ompi
Copy link

The IBM CI (PGI Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/9b9a01ace59e2c90133ff1ceae45391f

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 12, 2017

@karasevb @artpol84 @jjhursey It looks like there is something in the latest gds_dstore that the PGI compiler doesn't like. It consistently aborts with the same stacktrace.

Can you folks decipher the problem?

@artpol84
Copy link
Contributor

@rhc54 its hard to say what has happened. It looks to me that LLVM compiler itself failed with SEGFAULT:

pgcc-Fatal-/home/mpiczar/local/pgi-17.4/linuxpower/17.4/share/llvm/bin/opt TERMINATED by signal 11

@jjhursey can you help to read this error message.

@ibm-ompi
Copy link

The IBM CI (PGI Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/763bea641c8e9426774f16dd4d1c2e3b

@gpaulsen
Copy link
Member

:retest:

@gpaulsen
Copy link
Member

retest

@gpaulsen
Copy link
Member

bot:retest

@gpaulsen
Copy link
Member

I've updated our CI nodes to use PGI 17.10 and I'm hopeful it will not segv.

@ibm-ompi
Copy link

The IBM CI (PGI Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/aa81ca940c777214010112f95bdce4ad

@artpol84
Copy link
Contributor

Doesn't help.
Can you try to manually run the failing command and see if you can get the same behavior?

@jjhursey
Copy link
Member

It didn't pick up the new compiler install - Geoff is working on it.

@gpaulsen
Copy link
Member

bot:retest

@gpaulsen
Copy link
Member

3rd times the charm!

@ibm-ompi
Copy link

The IBM CI (PGI Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/bfa3fd68985416c1bb8542ca222e9999

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 13, 2017

Before you burn a ton more time on this, let me try just doing the PMIx update separately and see if it is something simply borked in this PR.

@gpaulsen
Copy link
Member

I was able to cd into the build directory and do a simple pgcc -c -fpic gds_dstore_component.c -I. which didn't segv even at -O2 or -O3 (using pgi 17.10)

Let me know what I can try next.

@artpol84
Copy link
Contributor

Sounds like it's a compiler issue then. I don't think I can help much with it.

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 14, 2017

hard to see how it can be a compiler problem since it works with everything else, but let's try it another way. I'm suspicious of some of the compiler directives in the latest dstore additions - may be something PGI doesn't like, though it begs the question of how it got thru the pmix testing.

@artpol84
Copy link
Contributor

Well, @gpaulsen just said that he was able to compile exactly this source with this compiler. I think that if compiler works well it should be deterministic and have same results all the time.
Maybe something in the environment breaks it, however again compiler shouldn't segfault but instead provide the error pointing to unknown directive.

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 14, 2017

Okay, so this PR minus the new dstore code builds just fine - see #4622. Now will try a totally fresh PR with the dstore code, just to see if something in this PR got borked.

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 14, 2017

FWIW: I believe the problem is the __extension__ attribute used in the new code. This is a GCC extension and is not portable. We'll see what the other tests reveal.

@artpol84
Copy link
Contributor

@rhc54, thanks for checking that. I'd be curious to see if this is an issue.

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 14, 2017

Sure enough - confirmed that the dstore code is the problem with #4623. Will try removing those attributes.

Just FYI: if you look elsewhere in the code, we always use something like __pmix_extension__ and then test for the __extension__ attribute so we can blank it out if it isn't supported.

@artpol84
Copy link
Contributor

@artpol84
Copy link
Contributor

We have it currently in OMPI master

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 14, 2017

Hmmm...interesting. Well, let's see what changed here.

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 14, 2017

Huh - well, it seems to me that those macro definitions are no longer of any value. They are just used in a simple set of functions. Let's try replacing them and see what happens.

@karasevb
Copy link
Member

I'm going to do this to avoid using macros in dstore

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 14, 2017

ok - let me know when you have it done and I'll update #4623 so we can retry

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 14, 2017

FWIW: it looks like the compiler is segfaulting due to return macro(macro) that invokes even more macros. We have seen this before where some compilers object to too much depth in the "return" statement. Best to avoid it.

@artpol84
Copy link
Contributor

I think it's worth to make a note to PGI developers.
@rhc54 @gpaulsen Do we know somebody to mention here?

@artpol84
Copy link
Contributor

If we going to - let's keep this PR as is so they can debug

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 14, 2017

Not worth it, IMO. As I said, we've encountered such things before, and nearly every compiler has some level of confusion that causes the optimizer to barf. It looks like @karasevb hit a sensitive point on PGI here, but it could as easily have bothered some other compiler at some particular version.

Like I said, we have learned to keep the "return" calls simple. No reason to drive them into deep recursive macro calls.

I'll keep this PR because I want to use it to test @karasevb's changes. Once we get the code cleaned up so it passes, I'll close this and create a proper update.

@ibm-ompi
Copy link

The IBM CI (PGI Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/8140ba90cde48dc6e2af657ebfd15d81

@gpaulsen
Copy link
Member

I will work to get this reported to PGI. Please keep this PR for them until they can clone it.

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 14, 2017

I've reached the conclusion that this isn't a compiler bug - it now appears that the problem can be directly traced to use of "inlined" functions in a function pointer structure. Removing those directives allows things to build.

Frankly, I'm a tad surprised this didn't cause problems elsewhere - I'm not sure what an "inlined" function pointer actually means in that context. Talking to my compiler guru now - will pass along more later.

@artpol84
Copy link
Contributor

artpol84 commented Dec 14, 2017

Good catch @rhc54!
However when some time ago I was checking what will happen if you use inline function as a pointer the answer was - there will be a regular instance of a function that will have a pointer and it shouldn't be an issue. In the end inline is a recommendation, not a strict rule.
So if inline func is causing a segfault then it's a compiler bug for sure IMHO.

@artpol84
Copy link
Contributor

In any case - segfaulting compiler is a bug :)

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 14, 2017

It doesn't look like the compiler is segfaulting any more - it appears to me that it is simply reporting an error and aborting. I'm still investigating

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 14, 2017

Well, here is what my compiler guru saids - which I'll preface that any mistakes are introduced by me. Basically, inline is just a hint, and so implementation is really flexible and varies a great deal across compilers and even release versions within a compiler. Using an inline function in a function pointer is generally supported, BUT may require that you have an "extern inline" declaration somewhere so it can be found. Again, implementations vary, and some may simply reject it.

In this case, the compiler actually was reporting the error. It clearly states that the problem was too deep a recursion in the pmix_sm_store function, which is where the function pointers were being accessed and had recursive calls into inline functions. So it likely was a combination of an inlined function pointer plus multiple recursion that was being flagged.

The other issue he cited is that, due to all the flexibility, some compilers also take the hint to mean "keep the assembly for this block of code as a unit in the icache". It is the equivalent of the "register" directive for a variable. So if you flag large blocks of code, or have conditional jumps in them, it causes optimization issues due to cache thrashing, as @hjelmn noted.

It doesn't look like my last change fixed the problem, so there is still something wrong in that code. Sigh.

@artpol84
Copy link
Contributor

@rhc54 I see - I only remember the old errors. Curious to see the error output.

As I mentioned before and @gpaulsen confirmed the intention - it would be good to keep a segfaulting instance for PGI guys so they can fix their compiler. Do you think you can re-create it in a separate branch and let it to hang until they will try?

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 14, 2017

bot:ibm:retest

@jjhursey
Copy link
Member

My 2 cents - we should merge in the repaired PR so that if someone pick up OMPI and happens to have an impacted compiler then they don't see a build failure. Odds are they won't be in a position to upgrade their compiler.

But I'd like to make sure we preserve the version of the branch that shows the compiler crash so @gpaulsen or someone can take it back to PGI for inspection. Ideally we would have a small reproducer, but I don't know if we (1) know enough about why this is happening or (2) have the time right now to do so.

Ralph Castain and others added 2 commits December 18, 2017 06:53
If available, have apps use registration capability to cleanup their session directories. Setup capability for vader to register its shared memory file location - let someone familiar with that code do so.

Final cleanup to track uid/gid, update the opal/pmix API to pass flags for ignore and leave top directory alone

Signed-off-by: Ralph Castain <[email protected]>
This commit moves the backing files to /dev/shm to avoid limitations
that may be set on /tmp. The files are registered with pmix to ensure
they are cleaned up after an erroneous exit.

Signed-off-by: Nathan Hjelm <[email protected]>
(cherry picked from commit 4810127)
@rhc54
Copy link
Contributor Author

rhc54 commented Dec 18, 2017

We clearly need to re-look at our infrastructure as this is getting to be far too common from multiple testers:

07:57:02 [htmlpublisher] Archiving HTML reports...
07:57:02 [htmlpublisher] Archiving at BUILD level /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/cov_build to /var/lib/jenkins/jobs/gh-ompi-master-pr/builds/6954/htmlreports/Coverity_Report
07:57:03 ERROR: Build step failed with exception
07:57:03 java.io.IOException: No space left on device
07:57:03 	at sun.nio.ch.FileDispatcherImpl.pwrite0(Native Method)
07:57:03 	at sun.nio.ch.FileDispatcherImpl.pwrite(FileDispatcherImpl.java:66)
07:57:03 	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:89)
07:57:03 	at sun.nio.ch.IOUtil.write(IOUtil.java:51)
07:57:03 	at sun.nio.ch.FileChannelImpl.writeInternal(FileChannelImpl.java:777)
07:57:03 	at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:763)
07:57:03 	at sun.nio.ch.FileChannelImpl.transferFromFileChannel(FileChannelImpl.java:635)
07:57:03 	at sun.nio.ch.FileChannelImpl.transferFrom(FileChannelImpl.java:707)
07:57:03 	at org.apache.tools.ant.util.ResourceUtils.copyResource(ResourceUtils.java:532)
07:57:03 	at org.apache.tools.ant.util.FileUtils.copyFile(FileUtils.java:559)
07:57:03 	at org.apache.tools.ant.taskdefs.Copy.doFileOperations(Copy.java:899)

@artpol84
Copy link
Contributor

@rhc54 the issue with 07:57:03 java.io.IOException: No space left on device is fixed.
I'll appreciate if you can shoot me an email if something is not working. It is likely that I can miss a comment on the github especially if this wasn't PR I was working on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants