Skip to content
This repository was archived by the owner on Jan 9, 2020. It is now read-only.

Conversation

@kimoonkim
Copy link
Member

@hustcat @duyanghao @foxish

What changes were proposed in this pull request?

Fixes #405 by flag-guarding potentially heavy DNS lookup RPCs.

How was this patch tested?

Modified existing unit tests to cover both flag-on and off cases.

@erikerlandson
Copy link
Member

@kimoonkim is there an informative failure that can be detected, so that DNS lookup happens automatically if using the short name fails? Sort of like a fallback. Or is that failure scenario not possible to identify automatically?

@kimoonkim
Copy link
Member Author

@erikerlandson Good question. It's hard to automatically detect specifically if DNS lookup is needed.

But we can automatically detect if the HDFS namenode is configured and use that as indication if HDFS support is needed. And flag guard the entire HDFS support. Curious what others think about this approach.

@erikerlandson
Copy link
Member

I think it's a good idea to be able to enable/disable HDFS in general, especially if it has non-trivial performance implications

@kimoonkim
Copy link
Member Author

@erikerlandson It just occurred to me that we can do adaptive approach. If DNS returns hostnames that are same as the cluster node names for a few hosts, then it is very likely that cluster node names are full hostnames. Then we don't need to issue more DNS requests. Thoughts?

@kimoonkim
Copy link
Member Author

Yes. Unlike Spark on YARN, HDFS is an optional component for Spark on K8s. I don't think we want to force everyone to pay the performance cost.

Should I incorporate the broader flag in this PR or can it be a separate PR?

@erikerlandson
Copy link
Member

it seems fine to me to incorporate the flag here, that ought to be a modest amount of code

@kimoonkim
Copy link
Member Author

kimoonkim commented Aug 3, 2017

@erikerlandson @hustcat said in the issue that they use remote HDFS. (comment) So it may not work to automatically disable HDFS locality support based on whether the HDFS namenode is configured or not. (I'll make this change anyway because it would benefit others who don't use HDFS)

I think we still want to disable the DNS lookup by default. The bottom line is that DNS lookup turned out too expensive to do inside this critical section of the code. We'll have to redesign this part while disabling it for now. Perhaps we can issue DNS lookups asynchronously and pick up the results at a later point when the results are available.

@erikerlandson
Copy link
Member

@kimoonkim that sounds like a good way to proceed

@kimoonkim
Copy link
Member Author

@erikerlandson I was thinking about checking whether the HDFS namenode is configured or not in the SparkContext.hadoopConfiguration. i.e. Check the config key fs.defaultFS appear as hdfs://HOST:PORT. For non-HDFS users, the value would become one for local file system such as file:///, or one for s3 s3a:///, etc.

But on further thought, it seems this approach is error-prone. As the name defaultFS indicates this is only for the default. Job users can override this by putting the namenode address in any file paths for the jobs' input/output data, e.g. hdfs://HOST:PORT/DIR1/FILE1. In such case, the auto-detection logic will disable HDFS support code when it is actually needed.

Sorry about misleading the discussion. I think disabling the DNS lookup code is the best fix we can do for now, while redesigning the approach in the background.

Copy link

@ash211 ash211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the right move until we can find a way to infer the flag.

Thanks folks for debugging!

@foxish foxish mentioned this pull request Aug 3, 2017
10 tasks
@kimoonkim
Copy link
Member Author

rerun integration tests please

" The driver can slow down and fail to respond to executor heartbeats in time." +
" If enabling this flag, make sure your DNS server has enough capacity" +
" for the workload.")
.internal()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of the internal() ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the flag non-public. From the code comment:

@param isPublic if this configuration is public to the user. If it's false, this
configuration is only used internally and we should not expose it to users.

I think we want to redesign the DNS lookup code path so that this flag is not needed in the future. So keeping the flag internal allow us to remove it later without worrying if there are some users depend on it.

@kimoonkim
Copy link
Member Author

Hmm, the Jenkins integration test is unhappy. I think I am seeing the same error that #389 reported about resource staging server timeout.

@foxish foxish mentioned this pull request Aug 3, 2017
7 tasks
@kimoonkim
Copy link
Member Author

rerun integration tests please

@liyinan926
Copy link
Member

rerun integration tests please

@erikerlandson
Copy link
Member

integration testing is probably going to have trouble passing until we fix #389, which should be soon

@liyinan926
Copy link
Member

All tests passed. Can someone merge this so to unblock #416? Thanks!

@erikerlandson erikerlandson merged commit e3cfaa4 into apache-spark-on-k8s:branch-2.2-kubernetes Aug 8, 2017
erikerlandson pushed a commit that referenced this pull request Aug 8, 2017
…DFS locality support (#412) (#421)

* Flag-guard expensive DNS lookup of cluster node full names, part of HDFS locality support

* Clean up a bit

* Improve unit tests
@kimoonkim kimoonkim deleted the flag-guard-hdfs-dns-lookup branch August 9, 2017 22:38
puneetloya pushed a commit to puneetloya/spark that referenced this pull request Mar 11, 2019
…DFS locality support (apache-spark-on-k8s#412)

* Flag-guard expensive DNS lookup of cluster node full names, part of HDFS locality support

* Clean up a bit

* Improve unit tests
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants