-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[GR-39018] Debug id cleanup #4627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Fix for issue #4626 |
zakkak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM other than some minor comments.
Thank you @adinn for this refactoring/cleanup!
| } else { | ||
| return kind.getJavaName(); | ||
| } | ||
| return valueType().toJavaName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
valueType() has a check for type != null which means that it could potentially return null and that null value will not be handled here resulting in a crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed. I have modified it to return an empty string if valueType() returns null.
| } | ||
|
|
||
| NativeImageDebugLocalValueInfo(String name, JavaValue value, int framesize, JavaKind kind, ResolvedJavaType type, int slot, int line) { | ||
| NativeImageDebugLocalValueInfo(String name, JavaValue value, int framesize, JavaKind kind, ResolvedJavaType t, int slot, int line) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| NativeImageDebugLocalValueInfo(String name, JavaValue value, int framesize, JavaKind kind, ResolvedJavaType t, int slot, int line) { | |
| NativeImageDebugLocalValueInfo(String name, JavaValue value, int framesize, JavaKind kind, ResolvedJavaType type, int slot, int line) { |
This should better be kept named as type for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, yes but I got a style check error because it shadows field type. I have renamed it to resolvedType.
| // } else if (targetMethod instanceof CustomSubstitutionMethod) { | ||
| // targetMethod = ((CustomSubstitutionMethod) targetMethod).getOriginal(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these commented out by mistake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was deliberate and the comments have now been removed.
| */ | ||
| if (superClass != null) { | ||
| return getDeclaringClass(superClass, true).toJavaName(); | ||
| return getDeclaringClass(superClass, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return getDeclaringClass(superClass, true); | |
| return getOriginal(superClass); |
for consistency and to avoid an indirection
| return m; | ||
| } | ||
|
|
||
| private ResolvedJavaMethod originalMethod() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a duplicate of private static ResolvedJavaMethod getOriginal(HostedMethod hostedMethod)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's not exactly the same and the difference is very important.
This method is used to compute the /identity/ for a method presented anywhere in the DebugInfoProvider API. It unwraps from the Hosted domain to the JVMCI domain so that we ensure we get the same method whether the NativeImage or CodeCache presents it as a Hosted value or an original JVMCI value.
In the one special case of a SubstitutionMethod we traverse to the /original/ method. That ensures we consistently identify a substitution Bar.bar <--@Sub-- Foo.foo with method foo, not method bar. We need to do that so that in the type info model we only have the one method foo declared in the one class Foo. The debugger has to see a single, consistent view of the class and method. We don't want to see a method called bar attached to class Foo in one context and a method called foo attached to Foo in another context (and we certainly don't want to see a method called bar attached to class Bar). In fact we don't want to see class Bar referenced at all in the debug info type model (although we do want to see the file and source code for Bar.bar when debugging).
The only other case where we don't unwrap to a JVMCI method is when we hit some other form of substitution. In that case we use the unwrapped method is our id method. We don't ever traverse to the original here even when, in some cases, we can do so. That's for two reasons. Firstly, for custom substitutions the original is not normally of any relevance. The method code is synthesized, which means the original's file and bytecode is irrelevant. Secondly, the original may actually exist as a method in its own right, listed as belonging to the declaring class (this happens with custom substitution cases for foreign stub methods and reflective methods). If we used the original to identify the substitution we would end up conflating two or more independent methods (the debug model type scan detects this, generating an error because the same method is notified twice).
Method getOriginal(HostedMethod) (now renamed as getAnnotatedOrOriginal(HostedMethod)) is only used when determining a method's file or permissions. It also unwraps from the Hosted domain to the JVMCI domain. The difference is that in the special case of a SubstitutionMethod it traverses to the /annotated/ method. That's needed for file lookup because the correct file needs to be derived from the declaring class of the annotated method. It's also necessary as far as permissions are concerned because the compiler compiles the bytecode from the /annotated/ method so generates code whose permissions are determined by that bytecode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification Andrew. I missed the difference in my first pass.
.../src/com.oracle.svm.hosted/src/com/oracle/svm/hosted/image/NativeImageDebugInfoProvider.java
Outdated
Show resolved
Hide resolved
| String paramName = paramInfo.name(); | ||
| String paramTypeName = paramInfo.typeName(); | ||
| TypeEntry paramType = lookupType(paramInfo.valueType()); | ||
| String paramTypeName = paramType.getTypeName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: Since paramTypeName is only used in the log calls I would inline it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I also did the same with 2 occurrences of local var valueTypeName and one of local var returnTypeName that were only used in log statements.
| /* Add details of fields and field types */ | ||
| debugArrayTypeInfo.fieldInfoProvider().forEach(debugFieldInfo -> this.processField(debugFieldInfo, debugInfoBase, debugContext)); | ||
| debugContext.log("typename %s element type %s base size %d length offset %d\n", typeName, elementTypeName, baseSize, lengthOffset); | ||
| debugContext.log("typename %s element type %s base size %d length offset %d\n", typeName, this.elementType.getTypeName(), baseSize, lengthOffset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that in other log messages the canonicalized name is used instead. I would suggest either using the canonicalized name everywhere or nowhere to keep things consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was ensuring that spaces in names were replaced with underscores. We no longer need to do this name correction so I have removed all calls.
|
Please pretty up the commit messages. Maybe fold the last 4 style related commits into one. |
.../src/com.oracle.svm.hosted/src/com/oracle/svm/hosted/image/NativeImageDebugInfoProvider.java
Outdated
Show resolved
Hide resolved
.../src/com.oracle.svm.hosted/src/com/oracle/svm/hosted/image/NativeImageDebugInfoProvider.java
Show resolved
Hide resolved
|
@olpaw @zakkak I have just force pushed an update that addresses all the review comments and squashes the commits down to something minimal and coherent. The push pre-squash passed all the gate tests so I think this is good to go for internal gate testing -- assuming you are happy with the corrections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Running in our CI now ...
|
@adinn, with micronaut and spring image builds I see stacktraces with your changes. They all look the same: If needed, I can provide a reproducer for you tomorrow. |
Yes, that would be very helpful. The NPE follows an assert that the relevant value is null i.e. a presumed invariant is not being met. I'm not clear how that is happening (since it is not supposed to happen). A reproducer will make tracking it down much easier |
Ok. I will provide that early next week. |
|
@adinn I have a reproducer. Unfortunately it's only failing sometimes (~ 1 out of 5) This will fail with To increase the chance to observe the image build error you can use the following hack diff --git a/vm/mx.vm/mx_vm_benchmark.py b/vm/mx.vm/mx_vm_benchmark.py
index 3ea672e0010..017bc3aba03 100644
--- a/vm/mx.vm/mx_vm_benchmark.py
+++ b/vm/mx.vm/mx_vm_benchmark.py
@@ -795,6 +795,13 @@ class NativeImageVM(GraalVm):
final_image_command = config.base_image_build_args + executable_name_args + (pgo_args if instrumented_iterations > 0 or (self.hotspot_pgo and os.path.exists(config.latest_profile_path)) else []) + ml_args
with stages.set_command(final_image_command) as s:
s.execute_command()
+ s.execute_command()
+ s.execute_command()
+ s.execute_command()
+ s.execute_command()
+ s.execute_command()
+ s.execute_command()
+ s.execute_command()
def run_stage_run(self, config, stages, out):
image_path = os.path.join(config.output_dir, config.final_image_name) |
|
@olpaw Thanks for providing the reproducer, Paul. I tried building repeatedly yesterday but have not yet been able to get the error to occur on my branch (never mind do so when I have enabled debugging or log trace). I'll keep trying and also see if rebasing to master makes a difference. |
@adinn, rebasing should not be necessary. I reproduced with the graalvm repo commit id e16413a I suspect A good idea might be to add a check to com.oracle.objectfile.debugentry.Range#setLocalValueInfo that fails immediately if localInfos[i] and localValueInfos[i] kinds ever mismatch. |
|
@olpaw Hi Paul, I have now seen a failure on one run but not when I had the debugger attached.
That may well be the case. I have generated full log trace for a run where there was no NPE and in the case of the method you trapped in the debugger it is clear that it has no line or locals info. So, the only local var info that gets included in the range hierarchy appears to be synthetically generated details of the method arguments for a top level range that covers the method prologue range. That is interesting because the range that shows up in your debugger screenshot includes local info for subrange |
|
@olpaw I have managed to catch the problem in the debugger and understand something of what is going on. Checking the hosted universe reveals that there are indeed two At the point where the error occurs the debug model contains a single
The compiled method being processed at the point where the error occurs has a primary range The primary contains 3 subranges:
Each of the subranges is a leaf range. The surprising thing about this is that these ranges are not all associated with method 1. The generator expects all top level leaf frames presented in the I have a suspicion why this is happening but before coming to that I'll explain how this broken invariant that leads to the NPE. The way it works is as follows: Each A subrange records value info for params/locals to the subrange using a Later on, when writing loc info for a top level method [aside: Actually the same thing happens recursively for each inlined call node where, once again the direct child nodes of a call node ought all to refer to the same identity method. However, this problem only appears to be happening at the top level] A related oddity may shed light on what is happening. I found is that the two methods are modeled in the debug info with two distinct Now, it's not the presence of two entries that is odd here. That makes sense given the current scheme of using a I managed to execute the relevant lookup code in the debugger and work out why this is happening. The value type recorded in the So, in the first place this accounts for why we don't see this NPE prior to the identity cleanup patch. In the previous implementation identification of types and methods was performed using the fully Secondly, I wonder if this false equivalence between the two methods might account for why the compiler is assembling a compiled method that includes frame info for method 4 as though it is derived from method 1. I would have expected the compiler to recognise that it has more than one method at play here but perhaps some resolution step is being fooled by the reported signature equivalence into thinking that there is only one source of method bytecode at play here? Anyway, I think the problem that needs fixing here is in the compiler. The thing that maes it difficutl to analyze what is needed is that in most runs compilation of |
|
@olpaw I have investigated further and traced the problem further back into the compilation stage. I trapped the problem case in the debugger at the point where subranges are notified in Resulting Range TreeThis breaks twice for the primary has the subrange at the first break has the subrange at the second break is Underlying CompilationResultThe first oddity appears when looking into what method is reported as having been compiled. This primary method comes from a By contrast, when we look at the Associated Infopoint/SourceMappingThe All of them report their owner method to be ConclusionSo, the problem here is that the primary range for the whole method and the synthetic range added to cover the method prologue are created using It seems there are two possible fixes:
I suspect 1. it will probably be relatively easy to implement. However, I am not sure whether it is correct. Perhaps there is some good reason why the code cache is choosing to report the compilation result as belonging to a different method than the one actually compiled? |
I agree, the issue is on the compiler side and not a bug in debuginfo generation. If we ever observe a @christianwimmer who would be the right person to look into this? |
Agreed. n.b. the details that really stand out here are:
That first item makes it seem as if the compiler has actually consumed the 'wrong' method wholesale rather than just inlined some of its bytecode. The second item make me suspicious that a conflation of the right and wrong methods occurred at some point where the analysis/compiler was indexing/looking up the method based on signature. |
Or perhaps more likely based on the symbol name derived from the signature? |
|
@olpaw Hi Paul. I have managed to debug the problem further and found the point where the compilation results get mixed up. I tweaked class I found that the problem occurs when this As evidence of that I evaluated the following code at entry to and obtained the following result By contrast, I evaluated the following code at the point where the and obtained the following result As you can see the two entries have been merged into one where the method with return type I am not exactly sure what is happening here but I suspect entries in the I am also not clear why this is only blowing up intermittently. However, I think detection of the mismatch may depend on the order in which the results get inserted into the tree. |
@adinn Congratulations you have found a bug that goes back a looong time. I suspect |
Well, eyeballing the code says yes but I checked anyway. I evaluated this at entry to The resulting list prints as: So, it's a yes :-) |
We need to fix this. |
Well, that's true. Distinct methods or types should be distinguished and I will raise a bug to that effect. However, I'm also concerned that by reporting the wrong signature the resolution failure is causing these two distinct methods to look as if they are actually the same, which is why they get conflated in the first place. Once the compareTo method gets fixed then we will end up in the situation where the debugger is reporting that class Most of all the presence of two identical methods in the type listing is going to look mighty strange to end users. Is there not also a need to fix this resolution problem so we don't end up with lookalike methods? |
Two methods with the same fully qualified name can occur for many reasons, so we need to be prepared for it and support it correctly. In the worst case if a tool really has a problem with it, then you can a synthetic suffix to the name in the dwarf information. But I hope we are not going to need that. |
|
And some background information on what could be the reason for a signature conflict: When a signature references a type that has linking errors, we downgrade that type to Lines 51 to 59 in 36618f0
|
Yes, I understand that. I mentioned it as the root cause for the conflation of the two methods in my diagnosis of the problem above. What concerned me was that this possibility of two (or more) methods collapsing to generate the same FQN might lead to other cases were methods are wrongly assimilated. Can we be sure the only problem here in the collation performed when creating the list of methods passed to the code cache. Or are there other parts of the SVM code where a FQN (equivalently, a method symbol), is used to establish identity. For example, could two methods with the same FQN both need to be visible via linker symbols? Also, I am left wondering why the analysis does not proceed far enough in this case to be able to resolve the return type of the method that gets misreported. It seems rather odd to me that the analysis fails to report an |
…ss rather than via class names.
…hod rather than their symbol name. style fixes
|
It has passed all the gates and was also merged already....thanks again for contributing, @adinn! |
|
@fniephaus Hurrah! Thanks for your help. |

This PR is the promised cleanup to the debug info generator that ensures the model of the class base used to drive debug info generation is built by identifying TypeEntry and MethodEntry records using ResolvedJavaClass and ResolvedJavaMethod instances rather than their associated names. This means the model can cope with multiple types that have the same name and cope with multiple methods attached to some given class that have the same name and signature.
The cleanup also modifies the Dwarf Properties model that associates Dwarf section offsets and other information with a TypeEntry, MethodEntry or Range. Properties records are still stored in hash tables. However, the hash tables are now keyed using the Type/MethodEntry objects rather than the type/method names. Also method Properties are stored in one top level hash table rather than being nested in hash tables embedded in the type Properties.
n.b. the properties model really needs a further cleanup to allow the properties to be hung directly off the TypeEntry, MethodEntry or Range rather than indexed using an auxiliary hash table. That will be addressed in a follow-up PR.