-
Notifications
You must be signed in to change notification settings - Fork 117
Flag-guard expensive DNS lookup of cluster node full names, part of HDFS locality support #412
Flag-guard expensive DNS lookup of cluster node full names, part of HDFS locality support #412
Conversation
…DFS locality support
|
@kimoonkim is there an informative failure that can be detected, so that DNS lookup happens automatically if using the short name fails? Sort of like a fallback. Or is that failure scenario not possible to identify automatically? |
|
@erikerlandson Good question. It's hard to automatically detect specifically if DNS lookup is needed. But we can automatically detect if the HDFS namenode is configured and use that as indication if HDFS support is needed. And flag guard the entire HDFS support. Curious what others think about this approach. |
|
I think it's a good idea to be able to enable/disable HDFS in general, especially if it has non-trivial performance implications |
|
@erikerlandson It just occurred to me that we can do adaptive approach. If DNS returns hostnames that are same as the cluster node names for a few hosts, then it is very likely that cluster node names are full hostnames. Then we don't need to issue more DNS requests. Thoughts? |
|
Yes. Unlike Spark on YARN, HDFS is an optional component for Spark on K8s. I don't think we want to force everyone to pay the performance cost. Should I incorporate the broader flag in this PR or can it be a separate PR? |
|
it seems fine to me to incorporate the flag here, that ought to be a modest amount of code |
|
@erikerlandson @hustcat said in the issue that they use remote HDFS. (comment) So it may not work to automatically disable HDFS locality support based on whether the HDFS namenode is configured or not. (I'll make this change anyway because it would benefit others who don't use HDFS) I think we still want to disable the DNS lookup by default. The bottom line is that DNS lookup turned out too expensive to do inside this critical section of the code. We'll have to redesign this part while disabling it for now. Perhaps we can issue DNS lookups asynchronously and pick up the results at a later point when the results are available. |
|
@kimoonkim that sounds like a good way to proceed |
|
@erikerlandson I was thinking about checking whether the HDFS namenode is configured or not in the But on further thought, it seems this approach is error-prone. As the name Sorry about misleading the discussion. I think disabling the DNS lookup code is the best fix we can do for now, while redesigning the approach in the background. |
ash211
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like the right move until we can find a way to infer the flag.
Thanks folks for debugging!
|
rerun integration tests please |
| " The driver can slow down and fail to respond to executor heartbeats in time." + | ||
| " If enabling this flag, make sure your DNS server has enough capacity" + | ||
| " for the workload.") | ||
| .internal() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose of the internal() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes the flag non-public. From the code comment:
@param isPublic if this configuration is public to the user. If it's
false, this
configuration is only used internally and we should not expose it to users.
I think we want to redesign the DNS lookup code path so that this flag is not needed in the future. So keeping the flag internal allow us to remove it later without worrying if there are some users depend on it.
|
Hmm, the Jenkins integration test is unhappy. I think I am seeing the same error that #389 reported about resource staging server timeout. |
|
rerun integration tests please |
|
rerun integration tests please |
|
integration testing is probably going to have trouble passing until we fix #389, which should be soon |
|
All tests passed. Can someone merge this so to unblock #416? Thanks! |
…DFS locality support (apache-spark-on-k8s#412) * Flag-guard expensive DNS lookup of cluster node full names, part of HDFS locality support * Clean up a bit * Improve unit tests
@hustcat @duyanghao @foxish
What changes were proposed in this pull request?
Fixes #405 by flag-guarding potentially heavy DNS lookup RPCs.
How was this patch tested?
Modified existing unit tests to cover both flag-on and off cases.