-
Notifications
You must be signed in to change notification settings - Fork 315
Eagerly loading/initializing pthread library to prevent a race that may crash JVM #9780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
🎯 Code Coverage 🔗 Commit SHA: c96d579 | Docs | Was this helpful? Give us feedback! |
BenchmarksStartupParameters
See matching parameters
SummaryFound 0 performance improvements and 1 performance regressions! Performance is the same for 59 metrics, 5 unstable metrics.
Startup time reports for petclinicgantt
title petclinic - global startup overhead: candidate=1.55.0-SNAPSHOT~c96d5798e2, baseline=1.55.0-SNAPSHOT~92a857db10
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.02 s) : 0, 1019973
Total [baseline] (10.719 s) : 0, 10718636
Agent [candidate] (1.017 s) : 0, 1017226
Total [candidate] (10.786 s) : 0, 10786442
section appsec
Agent [baseline] (1.194 s) : 0, 1194234
Total [baseline] (10.788 s) : 0, 10787823
Agent [candidate] (1.198 s) : 0, 1197621
Total [candidate] (11.062 s) : 0, 11062415
section iast
Agent [baseline] (1.156 s) : 0, 1156250
Total [baseline] (11.16 s) : 0, 11159547
Agent [candidate] (1.149 s) : 0, 1149055
Total [candidate] (11.029 s) : 0, 11028752
section profiling
Agent [baseline] (1.163 s) : 0, 1163376
Total [baseline] (10.841 s) : 0, 10840856
Agent [candidate] (1.16 s) : 0, 1160109
Total [candidate] (11.06 s) : 0, 11060161
gantt
title petclinic - break down per module: candidate=1.55.0-SNAPSHOT~c96d5798e2, baseline=1.55.0-SNAPSHOT~92a857db10
dateFormat X
axisFormat %s
section tracing
crashtracking [baseline] (1.454 ms) : 0, 1454
crashtracking [candidate] (1.46 ms) : 0, 1460
BytebuddyAgent [baseline] (695.632 ms) : 0, 695632
BytebuddyAgent [candidate] (692.101 ms) : 0, 692101
GlobalTracer [baseline] (243.241 ms) : 0, 243241
GlobalTracer [candidate] (242.145 ms) : 0, 242145
AppSec [baseline] (32.176 ms) : 0, 32176
AppSec [candidate] (32.585 ms) : 0, 32585
Debugger [baseline] (6.266 ms) : 0, 6266
Debugger [candidate] (6.493 ms) : 0, 6493
Remote Config [baseline] (676.361 µs) : 0, 676
Remote Config [candidate] (694.949 µs) : 0, 695
Telemetry [baseline] (9.227 ms) : 0, 9227
Telemetry [candidate] (9.361 ms) : 0, 9361
Flare Poller [baseline] (10.124 ms) : 0, 10124
Flare Poller [candidate] (11.25 ms) : 0, 11250
section appsec
crashtracking [baseline] (1.48 ms) : 0, 1480
crashtracking [candidate] (1.468 ms) : 0, 1468
BytebuddyAgent [baseline] (718.384 ms) : 0, 718384
BytebuddyAgent [candidate] (720.37 ms) : 0, 720370
GlobalTracer [baseline] (234.681 ms) : 0, 234681
GlobalTracer [candidate] (235.35 ms) : 0, 235350
AppSec [baseline] (174.715 ms) : 0, 174715
AppSec [candidate] (175.259 ms) : 0, 175259
Debugger [baseline] (6.096 ms) : 0, 6096
Debugger [candidate] (6.191 ms) : 0, 6191
Remote Config [baseline] (647.368 µs) : 0, 647
Remote Config [candidate] (628.305 µs) : 0, 628
Telemetry [baseline] (8.565 ms) : 0, 8565
Telemetry [candidate] (8.428 ms) : 0, 8428
Flare Poller [baseline] (3.847 ms) : 0, 3847
Flare Poller [candidate] (3.801 ms) : 0, 3801
IAST [baseline] (24.669 ms) : 0, 24669
IAST [candidate] (24.997 ms) : 0, 24997
section iast
crashtracking [baseline] (1.464 ms) : 0, 1464
crashtracking [candidate] (1.469 ms) : 0, 1469
BytebuddyAgent [baseline] (818.594 ms) : 0, 818594
BytebuddyAgent [candidate] (814.028 ms) : 0, 814028
GlobalTracer [baseline] (232.157 ms) : 0, 232157
GlobalTracer [candidate] (231.073 ms) : 0, 231073
AppSec [baseline] (35.468 ms) : 0, 35468
AppSec [candidate] (35.161 ms) : 0, 35161
Debugger [baseline] (6.21 ms) : 0, 6210
Debugger [candidate] (6.139 ms) : 0, 6139
Remote Config [baseline] (616.19 µs) : 0, 616
Remote Config [candidate] (604.001 µs) : 0, 604
Telemetry [baseline] (8.858 ms) : 0, 8858
Telemetry [candidate] (8.63 ms) : 0, 8630
Flare Poller [baseline] (4.258 ms) : 0, 4258
Flare Poller [candidate] (4.199 ms) : 0, 4199
IAST [baseline] (27.058 ms) : 0, 27058
IAST [candidate] (26.249 ms) : 0, 26249
section profiling
ProfilingAgent [baseline] (109.194 ms) : 0, 109194
ProfilingAgent [candidate] (107.757 ms) : 0, 107757
crashtracking [baseline] (1.468 ms) : 0, 1468
crashtracking [candidate] (1.426 ms) : 0, 1426
BytebuddyAgent [baseline] (720.536 ms) : 0, 720536
BytebuddyAgent [candidate] (720.324 ms) : 0, 720324
GlobalTracer [baseline] (218.745 ms) : 0, 218745
GlobalTracer [candidate] (217.916 ms) : 0, 217916
AppSec [baseline] (32.224 ms) : 0, 32224
AppSec [candidate] (32.354 ms) : 0, 32354
Debugger [baseline] (6.675 ms) : 0, 6675
Debugger [candidate] (6.529 ms) : 0, 6529
Remote Config [baseline] (723.004 µs) : 0, 723
Remote Config [candidate] (770.011 µs) : 0, 770
Telemetry [baseline] (15.225 ms) : 0, 15225
Telemetry [candidate] (15.845 ms) : 0, 15845
Flare Poller [baseline] (4.912 ms) : 0, 4912
Flare Poller [candidate] (4.089 ms) : 0, 4089
Profiling [baseline] (109.812 ms) : 0, 109812
Profiling [candidate] (108.887 ms) : 0, 108887
Startup time reports for insecure-bankgantt
title insecure-bank - global startup overhead: candidate=1.55.0-SNAPSHOT~c96d5798e2, baseline=1.55.0-SNAPSHOT~92a857db10
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.027 s) : 0, 1026684
Total [baseline] (8.694 s) : 0, 8694468
Agent [candidate] (1.016 s) : 0, 1015546
Total [candidate] (8.652 s) : 0, 8652309
section iast
Agent [baseline] (1.154 s) : 0, 1153575
Total [baseline] (9.289 s) : 0, 9289323
Agent [candidate] (1.149 s) : 0, 1149308
Total [candidate] (9.286 s) : 0, 9286406
gantt
title insecure-bank - break down per module: candidate=1.55.0-SNAPSHOT~c96d5798e2, baseline=1.55.0-SNAPSHOT~92a857db10
dateFormat X
axisFormat %s
section tracing
crashtracking [baseline] (1.461 ms) : 0, 1461
crashtracking [candidate] (1.452 ms) : 0, 1452
BytebuddyAgent [baseline] (699.42 ms) : 0, 699420
BytebuddyAgent [candidate] (691.358 ms) : 0, 691358
GlobalTracer [baseline] (244.594 ms) : 0, 244594
GlobalTracer [candidate] (241.444 ms) : 0, 241444
AppSec [baseline] (32.649 ms) : 0, 32649
AppSec [candidate] (32.298 ms) : 0, 32298
Debugger [baseline] (6.377 ms) : 0, 6377
Debugger [candidate] (6.419 ms) : 0, 6419
Remote Config [baseline] (685.501 µs) : 0, 686
Remote Config [candidate] (693.666 µs) : 0, 694
Telemetry [baseline] (9.313 ms) : 0, 9313
Telemetry [candidate] (9.249 ms) : 0, 9249
Flare Poller [baseline] (10.93 ms) : 0, 10930
Flare Poller [candidate] (11.556 ms) : 0, 11556
section iast
crashtracking [baseline] (1.506 ms) : 0, 1506
crashtracking [candidate] (1.476 ms) : 0, 1476
BytebuddyAgent [baseline] (816.873 ms) : 0, 816873
BytebuddyAgent [candidate] (814.244 ms) : 0, 814244
GlobalTracer [baseline] (231.904 ms) : 0, 231904
GlobalTracer [candidate] (231.07 ms) : 0, 231070
AppSec [baseline] (35.417 ms) : 0, 35417
AppSec [candidate] (35.019 ms) : 0, 35019
Debugger [baseline] (6.087 ms) : 0, 6087
Debugger [candidate] (6.108 ms) : 0, 6108
Remote Config [baseline] (602.293 µs) : 0, 602
Remote Config [candidate] (607.498 µs) : 0, 607
Telemetry [baseline] (8.691 ms) : 0, 8691
Telemetry [candidate] (8.696 ms) : 0, 8696
Flare Poller [baseline] (4.306 ms) : 0, 4306
Flare Poller [candidate] (4.228 ms) : 0, 4228
IAST [baseline] (26.752 ms) : 0, 26752
IAST [candidate] (26.379 ms) : 0, 26379
LoadParameters
See matching parameters
SummaryFound 0 performance improvements and 2 performance regressions! Performance is the same for 10 metrics, 12 unstable metrics.
Request duration reports for insecure-bankgantt
title insecure-bank - request duration [CI 0.99] : candidate=1.55.0-SNAPSHOT~c96d5798e2, baseline=1.55.0-SNAPSHOT~92a857db10
dateFormat X
axisFormat %s
section baseline
no_agent (4.352 ms) : 4302, 4402
. : milestone, 4352,
iast (9.639 ms) : 9468, 9810
. : milestone, 9639,
iast_FULL (14.25 ms) : 13961, 14540
. : milestone, 14250,
iast_GLOBAL (10.53 ms) : 10345, 10715
. : milestone, 10530,
profiling (8.859 ms) : 8723, 8994
. : milestone, 8859,
tracing (8.09 ms) : 7967, 8212
. : milestone, 8090,
section candidate
no_agent (4.403 ms) : 4353, 4453
. : milestone, 4403,
iast (9.374 ms) : 9221, 9527
. : milestone, 9374,
iast_FULL (14.952 ms) : 14651, 15254
. : milestone, 14952,
iast_GLOBAL (10.804 ms) : 10608, 10999
. : milestone, 10804,
profiling (8.676 ms) : 8532, 8820
. : milestone, 8676,
tracing (7.857 ms) : 7743, 7972
. : milestone, 7857,
Request duration reports for petclinicgantt
title petclinic - request duration [CI 0.99] : candidate=1.55.0-SNAPSHOT~c96d5798e2, baseline=1.55.0-SNAPSHOT~92a857db10
dateFormat X
axisFormat %s
section baseline
no_agent (37.95 ms) : 37640, 38261
. : milestone, 37950,
appsec (49.439 ms) : 49009, 49869
. : milestone, 49439,
code_origins (44.505 ms) : 44120, 44890
. : milestone, 44505,
iast (45.252 ms) : 44849, 45655
. : milestone, 45252,
profiling (48.791 ms) : 48364, 49218
. : milestone, 48791,
tracing (44.867 ms) : 44487, 45248
. : milestone, 44867,
section candidate
no_agent (37.238 ms) : 36946, 37529
. : milestone, 37238,
appsec (49.068 ms) : 48640, 49497
. : milestone, 49068,
code_origins (44.008 ms) : 43622, 44395
. : milestone, 44008,
iast (45.095 ms) : 44699, 45491
. : milestone, 45095,
profiling (48.625 ms) : 48172, 49078
. : milestone, 48625,
tracing (46.271 ms) : 45873, 46670
. : milestone, 46271,
DacapoParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 12 metrics, 0 unstable metrics. Execution time for tomcatgantt
title tomcat - execution time [CI 0.99] : candidate=1.55.0-SNAPSHOT~c96d5798e2, baseline=1.55.0-SNAPSHOT~92a857db10
dateFormat X
axisFormat %s
section baseline
no_agent (1.472 ms) : 1461, 1484
. : milestone, 1472,
appsec (2.455 ms) : 2403, 2506
. : milestone, 2455,
iast (2.197 ms) : 2133, 2260
. : milestone, 2197,
iast_GLOBAL (2.243 ms) : 2179, 2306
. : milestone, 2243,
profiling (2.05 ms) : 1999, 2102
. : milestone, 2050,
tracing (2.023 ms) : 1974, 2072
. : milestone, 2023,
section candidate
no_agent (1.471 ms) : 1460, 1483
. : milestone, 1471,
appsec (2.447 ms) : 2396, 2497
. : milestone, 2447,
iast (2.197 ms) : 2134, 2260
. : milestone, 2197,
iast_GLOBAL (2.242 ms) : 2178, 2305
. : milestone, 2242,
profiling (2.046 ms) : 1995, 2097
. : milestone, 2046,
tracing (2.013 ms) : 1964, 2062
. : milestone, 2013,
Execution time for biojavagantt
title biojava - execution time [CI 0.99] : candidate=1.55.0-SNAPSHOT~c96d5798e2, baseline=1.55.0-SNAPSHOT~92a857db10
dateFormat X
axisFormat %s
section baseline
no_agent (15.592 s) : 15592000, 15592000
. : milestone, 15592000,
appsec (15.029 s) : 15029000, 15029000
. : milestone, 15029000,
iast (18.59 s) : 18590000, 18590000
. : milestone, 18590000,
iast_GLOBAL (17.799 s) : 17799000, 17799000
. : milestone, 17799000,
profiling (15.093 s) : 15093000, 15093000
. : milestone, 15093000,
tracing (14.867 s) : 14867000, 14867000
. : milestone, 14867000,
section candidate
no_agent (15.625 s) : 15625000, 15625000
. : milestone, 15625000,
appsec (15.049 s) : 15049000, 15049000
. : milestone, 15049000,
iast (18.786 s) : 18786000, 18786000
. : milestone, 18786000,
iast_GLOBAL (18.168 s) : 18168000, 18168000
. : milestone, 18168000,
profiling (15.139 s) : 15139000, 15139000
. : milestone, 15139000,
tracing (15.212 s) : 15212000, 15212000
. : milestone, 15212000,
|
|
Hi! 👋 Thanks for your pull request! 🎉 To help us review it, please make sure to:
If you need help, please check our contributing guidelines. |
dd-java-agent/agent-bootstrap/src/main/java/datadog/trace/bootstrap/Agent.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @amarziali pointed out, I'd much prefer for crash-tracking to use a simpler java.io approach then have to apply hacks to forcibly load java.nio native code earlier than would normally be necessary
Please see my comment: #9780 (comment) |
I saw that, I still believe this is the wrong fix - pre-loading Fixing crash-tracking to use |
|
@mcculls My worry is that we might, unknowingly, add some other code that would load pthread library under the hood and if that happens not on the main thread, we would end up with the same intermittent (although pretty rare) crash. I don't have a great solution for this, though :( |
|
I would much prefer to address the known situation first, which should be a straightforward fix of just switching to use |
|
@jbachorik @zhengyu123 for example, calling We already avoid touching https://docs.oracle.com/javase/8/docs/api/java/nio/file/FileSystems.html#getDefault-- |
|
@mcculls Fair point. Premain is tricky. I will try and think about how we could have more systematic workaround but ok, let's start with not using nio in the crashtracking initialization first. |
@mcculls Thanks for the insight. I did not realize the side-effect.
Could you articulate what is the "very similar reasons" for not touching I suspect that checking Thanks |
Some frameworks set system properties to select a different JUL implementation, or a custom JMX builder. They may set these system properties on the command-line, but some webapp servers set them after Using JUL / JMX can also cause log-spam and startup delays when the chosen implementation class is not available (for example if the web-app expects to set up a context class-loader to load the implementation before JUL / JMX is used.) We've learnt the hard way that you have to be very careful about what is loaded during
We have a known data point - crash-tracking's use of With that in place we can then monitor the situation - I suspect that this will address the situation, i.e. no further action is required. Meanwhile the underlying JDK issue is being fixed and backported as we speak. |
I see. It was not my intention to introduce side-effect. I will try to find a replacement without side-effect.
Yes, we have a known case, but cannot guarantee that is all the cases. We just happened to be able to get the artifacts of this crash, as it was from one of internal services, and we are lucky that it crashed at the spot that we can pin point the defect. I have been seeing some very strange crashes, e.g. TLS value suddenly disappeared, which may or may not relate to this defect, but it would be good to be sure. I believe that crash might be just one of symptoms of this unsynchronized initialization. Another symptom, that I can think of, is that one thread sees partially initialized or overwrites the
Yes, we can backport the defect, can we force our customers to upgrade? BTW, I am open to refactor crashtracking, if using |
|
@zhengyu123 btw, if there's a feature in crash-tracking that cannot be reimplemented with along with a comment explaining why it was added and referencing the JDK bug. This should address the current issue while minimizing the overall impact - and avoids introducing a pre-load call for all users during |
|
I refactored crashtracking to only use classic Followings are the threads running at the time of loading So, both assumptions:
are wrong. So, I close this PR. |
What Does This Do
Eagerly initializing
java.nioon main thread to avoid a race that may result in crashing JVM.Motivation
Improve stability.
Additional Notes
This is a workaround of upstream JDK bug: https://bugs.openjdk.org/browse/JDK-8345810
Contributor Checklist
type:and (comp:orinst:) labels in addition to any useful labelsclose,fixor any linking keywords when referencing an issue.Use
solvesinstead, and assign the PR milestone to the issueJira ticket: [PROJ-IDENT]
https://datadoghq.atlassian.net/browse/PROF-12749