[SPARK-29189][SQL] Add an option to ignore block locations when listing file #25869

wangshisan · 2019-09-20T13:00:53Z

What changes were proposed in this pull request?

In our PROD env, we have a pure Spark cluster, I think this is also pretty common, where computation is separated from storage layer. In such deploy mode, data locality is never reachable.
And there are some configurations in Spark scheduler to reduce waiting time for data locality(e.g. "spark.locality.wait"). While, problem is that, in listing file phase, the location informations of all the files, with all the blocks inside each file, are all fetched from the distributed file system. Actually, in a PROD environment, a table can be so huge that even fetching all these location informations need take tens of seconds.
To improve such scenario, Spark need provide an option, where data locality can be totally ignored, all we need in the listing file phase are the files locations, without any block location informations.

Why are the changes needed?

And we made a benchmark in our PROD env, after ignore the block locations, we got a pretty huge improvement.

Table Size	Total File Number	Total Block Number	List File Duration(With Block Location)	List File Duration(Without Block Location)
22.6T	30000	120000	16.841s	1.730s
28.8 T	42001	148964	10.099s	2.858s
3.4 T	20000	20000	5.833s	4.881s

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Via ut.

(cherry picked from commit cdef51c166fbbb1321231bbfd6a7359ccbb3109c)

wangyum · 2019-09-20T13:33:13Z

ok to test

dongjoon-hyun · 2019-09-20T17:16:10Z

Thank you for making a PR, @wangshisan .

SparkQA · 2019-09-20T17:43:45Z

Test build #111071 has finished for PR 25869 at commit e07c230.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

Could you describe your pretty common environment more? If that is pretty common, it doesn't sound private. Otherwise, you can test this PR in some public similar environment everyone get access.

In our PROD env, we have a pure Spark cluster, I think this is also pretty common, where computation is separated from storage layer.

wangshisan · 2019-09-21T12:18:56Z

Sorry, I didn't made myself clear.
Our Spark cluster is deployed separated from HDFS cluster, all the data stored in another HDFS cluster, and these two clusters share no physical nodes.
I mean such deploy mode, separating Spark cluster with the storage cluster(HDFS or some other distributed file systems), is pretty common. And in such Spark cluster, data locality is non sense, because it's non reachable.

HyukjinKwon

Similar fix was raised at #24175 . That was closed in favour of #24672

Do you still meet the issue with the fix?

wangshisan · 2019-09-21T16:02:53Z

Yes, I see. A new API call was introduced in #24175 . And it do improve a lot. While, the new API will still fetch all the block location informations, and in our benchmark, it may consume tens of seconds to fetch all of them for a huge table with the new API.
In my opinion, if a Spark cluster is deployed totally physically separated from a HDFS cluster, we do not need any of such block location information. And this is what this PR for.

dongjoon-hyun · 2019-09-21T22:04:00Z

@wangshisan . We need a new UTs for this new feature. Could you add some?

dongjoon-hyun · 2019-09-21T22:06:34Z

After adding UTs, please update the PR description test section, too. If you adds UTs, you can say that new UTs are added.

How was this patch tested?

In our PROD environment.

In general, In our PROD environment usually ends up with -1. 😄

wangshisan · 2019-09-22T02:53:35Z

New UTs are added.

SparkQA · 2019-09-22T06:28:01Z

Test build #111136 has finished for PR 25869 at commit b7f9f03.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

HyukjinKwon · 2019-09-22T08:04:43Z

cc @squito

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

SparkQA · 2019-09-22T12:05:50Z

Test build #111147 has finished for PR 25869 at commit 0da2903.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-23T07:05:01Z

Test build #111194 has finished for PR 25869 at commit a659e67.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2019-09-23T10:19:54Z

retest this please

SparkQA · 2019-09-23T14:48:32Z

Test build #111215 has finished for PR 25869 at commit a659e67.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

wangshisan · 2019-09-25T02:18:38Z

@HyukjinKwon After PR #24672 merged.

SparkQA · 2019-09-25T06:06:23Z

Test build #111319 has finished for PR 25869 at commit eb1a802.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

For the proposed case, this PR looks correct to me.

cc @gatorsmile , @cloud-fan , @JoshRosen . Could you review this new option?

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala

SparkQA · 2019-09-26T18:20:38Z

Test build #111429 has finished for PR 25869 at commit 475abba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala

squito · 2019-09-27T20:34:38Z

makes sense to me.

I do think that it would be nice to have a way to still get locality preferences, depending on the filesystems. I see "semi-disagg" setups where the compute cluster still has hdfs, its just small and only meant for temporary data. But, I dunno how common that is, this seems like a worthwhile improvement in any case.

SparkQA · 2019-09-29T07:05:02Z

Test build #111552 has finished for PR 25869 at commit e500bcd.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

squito · 2019-09-30T19:31:50Z

Jenkins, retest this please

SparkQA · 2019-09-30T23:06:34Z

Test build #111621 has finished for PR 25869 at commit e500bcd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito

just a minor comment on the doc text, otherwise lgtm

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

dongjoon-hyun · 2019-10-03T07:36:04Z

Hi, @wangshisan Could you address @squito 's comment?

dongjoon-hyun

LGTM (except @squito 's comment.)

wangshisan · 2019-10-06T13:13:26Z

Done. Please help review @dongjoon-hyun

SparkQA · 2019-10-06T16:49:01Z

Test build #111818 has finished for PR 25869 at commit 00ad219.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2019-10-07T19:53:19Z

merged to master. Thanks @wangshisan !

squito · 2019-10-07T19:58:54Z

@wangshisan I assigned the issue in jira to the same userid that reported it, I assumed that was you. If not, please let me know and i can fix

dongjoon-hyun · 2019-10-07T20:06:23Z

Thank you, @wangshisan and @squito !

gatorsmile · 2019-10-07T21:01:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .createWithDefault(10000)

+  val IGNORE_DATA_LOCALITY =
+    buildConf("spark.sql.sources.ignore.datalocality")


conf naming looks a little weird... Compared with the other SQL Confs, this should be renamed to spark.sql.sources.ignoreDataLocality.enabled cc @cloud-fan

yea, please be careful about the namespace created in the new config names. ignore is definitely not a good namespace.

@wangshisan Could you submit a follow-up PR to rename it?

sorry you are right, i should have paid more attention to this. I have opened a pr to fix the naming: #26056

HyukjinKwon

Late LGTM too except that naming.

[SPARK-29189] Add an option to ignore block locations when listing file

e07c230

(cherry picked from commit cdef51c166fbbb1321231bbfd6a7359ccbb3109c)

dongjoon-hyun added the SQL label Sep 20, 2019

dongjoon-hyun changed the title ~~[SPARK-29189] Add an option to ignore block locations when listing file~~ [SPARK-29189][SQL] Add an option to ignore block locations when listing file Sep 20, 2019

dongjoon-hyun requested changes Sep 21, 2019

View reviewed changes

HyukjinKwon reviewed Sep 21, 2019

View reviewed changes

add ut

b7f9f03

fix ut

0da2903

HyukjinKwon reviewed Sep 22, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

wangyum reviewed Sep 22, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala Outdated Show resolved Hide resolved

wangyum reviewed Sep 22, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

tiny fix

a659e67

dongjoon-hyun reviewed Sep 23, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Sep 23, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun self-requested a review September 23, 2019 23:07

dongjoon-hyun reviewed Sep 23, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Sep 23, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

dongjoon-hyun self-requested a review September 25, 2019 12:58

dongjoon-hyun reviewed Sep 25, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Sep 25, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala Outdated Show resolved Hide resolved

Trivial fix

475abba

viirya reviewed Sep 26, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala Outdated Show resolved Hide resolved

remove unused import

e500bcd

squito approved these changes Oct 2, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

dongjoon-hyun approved these changes Oct 3, 2019

View reviewed changes

Enhance comment

00ad219

asfgit closed this in 64fe82b Oct 7, 2019

gatorsmile reviewed Oct 7, 2019

View reviewed changes

wangshisan deleted the SPARK-29189 branch October 8, 2019 03:22

HyukjinKwon reviewed Oct 8, 2019

View reviewed changes

[SPARK-29189][SQL] Add an option to ignore block locations when listing file #25869

[SPARK-29189][SQL] Add an option to ignore block locations when listing file #25869

Uh oh!

Conversation

wangshisan commented Sep 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

wangyum commented Sep 20, 2019

Uh oh!

dongjoon-hyun commented Sep 20, 2019

Uh oh!

SparkQA commented Sep 20, 2019

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangshisan commented Sep 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangshisan commented Sep 21, 2019

Uh oh!

dongjoon-hyun commented Sep 21, 2019

Uh oh!

dongjoon-hyun commented Sep 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How was this patch tested?

Uh oh!

wangshisan commented Sep 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Sep 22, 2019

Uh oh!

Uh oh!

HyukjinKwon commented Sep 22, 2019

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Sep 22, 2019

Uh oh!

SparkQA commented Sep 23, 2019

Uh oh!

wangyum commented Sep 23, 2019

Uh oh!

SparkQA commented Sep 23, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wangshisan commented Sep 25, 2019

Uh oh!

SparkQA commented Sep 25, 2019

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Sep 26, 2019

Uh oh!

Uh oh!

squito commented Sep 27, 2019

Uh oh!

SparkQA commented Sep 29, 2019

Uh oh!

squito commented Sep 30, 2019

Uh oh!

SparkQA commented Sep 30, 2019

Uh oh!

squito left a comment

Choose a reason for hiding this comment

wangshisan commented Sep 20, 2019 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

wangshisan commented Sep 21, 2019 •

edited

Loading

HyukjinKwon left a comment •

edited

Loading

dongjoon-hyun commented Sep 21, 2019 •

edited

Loading

wangshisan commented Sep 22, 2019 •

edited

Loading