-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-29189][SQL] Add an option to ignore block locations when listing file #25869
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
(cherry picked from commit cdef51c166fbbb1321231bbfd6a7359ccbb3109c)
|
ok to test |
|
Thank you for making a PR, @wangshisan . |
|
Test build #111071 has finished for PR 25869 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you describe your pretty common environment more? If that is pretty common, it doesn't sound private. Otherwise, you can test this PR in some public similar environment everyone get access.
In our PROD env, we have a pure Spark cluster, I think this is also pretty common, where computation is separated from storage layer.
|
Sorry, I didn't made myself clear. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Yes, I see. A new API call was introduced in #24175 . And it do improve a lot. While, the new API will still fetch all the block location informations, and in our benchmark, it may consume tens of seconds to fetch all of them for a huge table with the new API. |
|
@wangshisan . We need a new UTs for this new feature. Could you add some? |
|
After adding UTs, please update the PR description test section, too. If you adds UTs, you can say that
In general, |
|
New UTs are added. |
|
Test build #111136 has finished for PR 25869 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
|
cc @squito |
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
|
Test build #111147 has finished for PR 25869 at commit
|
|
Test build #111194 has finished for PR 25869 at commit
|
|
retest this please |
|
Test build #111215 has finished for PR 25869 at commit
|
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
|
@HyukjinKwon After PR #24672 merged. |
|
Test build #111319 has finished for PR 25869 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the proposed case, this PR looks correct to me.
cc @gatorsmile , @cloud-fan , @JoshRosen . Could you review this new option?
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #111429 has finished for PR 25869 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
Outdated
Show resolved
Hide resolved
|
makes sense to me. I do think that it would be nice to have a way to still get locality preferences, depending on the filesystems. I see "semi-disagg" setups where the compute cluster still has hdfs, its just small and only meant for temporary data. But, I dunno how common that is, this seems like a worthwhile improvement in any case. |
|
Test build #111552 has finished for PR 25869 at commit
|
|
Jenkins, retest this please |
|
Test build #111621 has finished for PR 25869 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a minor comment on the doc text, otherwise lgtm
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
|
Hi, @wangshisan Could you address @squito 's comment? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (except @squito 's comment.)
|
Done. Please help review @dongjoon-hyun |
|
Test build #111818 has finished for PR 25869 at commit
|
|
merged to master. Thanks @wangshisan ! |
|
@wangshisan I assigned the issue in jira to the same userid that reported it, I assumed that was you. If not, please let me know and i can fix |
|
Thank you, @wangshisan and @squito ! |
| .createWithDefault(10000) | ||
|
|
||
| val IGNORE_DATA_LOCALITY = | ||
| buildConf("spark.sql.sources.ignore.datalocality") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conf naming looks a little weird... Compared with the other SQL Confs, this should be renamed to spark.sql.sources.ignoreDataLocality.enabled cc @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, please be careful about the namespace created in the new config names. ignore is definitely not a good namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wangshisan Could you submit a follow-up PR to rename it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry you are right, i should have paid more attention to this. I have opened a pr to fix the naming: #26056
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Late LGTM too except that naming.
What changes were proposed in this pull request?
In our PROD env, we have a pure Spark cluster, I think this is also pretty common, where computation is separated from storage layer. In such deploy mode, data locality is never reachable.
And there are some configurations in Spark scheduler to reduce waiting time for data locality(e.g. "spark.locality.wait"). While, problem is that, in listing file phase, the location informations of all the files, with all the blocks inside each file, are all fetched from the distributed file system. Actually, in a PROD environment, a table can be so huge that even fetching all these location informations need take tens of seconds.
To improve such scenario, Spark need provide an option, where data locality can be totally ignored, all we need in the listing file phase are the files locations, without any block location informations.
Why are the changes needed?
And we made a benchmark in our PROD env, after ignore the block locations, we got a pretty huge improvement.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Via ut.