Skip to content

Conversation

@wangshisan
Copy link
Contributor

@wangshisan wangshisan commented Sep 20, 2019

What changes were proposed in this pull request?

In our PROD env, we have a pure Spark cluster, I think this is also pretty common, where computation is separated from storage layer. In such deploy mode, data locality is never reachable.
And there are some configurations in Spark scheduler to reduce waiting time for data locality(e.g. "spark.locality.wait"). While, problem is that, in listing file phase, the location informations of all the files, with all the blocks inside each file, are all fetched from the distributed file system. Actually, in a PROD environment, a table can be so huge that even fetching all these location informations need take tens of seconds.
To improve such scenario, Spark need provide an option, where data locality can be totally ignored, all we need in the listing file phase are the files locations, without any block location informations.

Why are the changes needed?

And we made a benchmark in our PROD env, after ignore the block locations, we got a pretty huge improvement.

Table Size Total File Number Total Block Number List File Duration(With Block Location) List File Duration(Without Block Location)
22.6T 30000 120000 16.841s 1.730s
28.8 T 42001 148964 10.099s 2.858s
3.4 T 20000 20000 5.833s 4.881s

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Via ut.

(cherry picked from commit cdef51c166fbbb1321231bbfd6a7359ccbb3109c)
@wangyum
Copy link
Member

wangyum commented Sep 20, 2019

ok to test

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-29189] Add an option to ignore block locations when listing file [SPARK-29189][SQL] Add an option to ignore block locations when listing file Sep 20, 2019
@dongjoon-hyun
Copy link
Member

Thank you for making a PR, @wangshisan .

@SparkQA
Copy link

SparkQA commented Sep 20, 2019

Test build #111071 has finished for PR 25869 at commit e07c230.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you describe your pretty common environment more? If that is pretty common, it doesn't sound private. Otherwise, you can test this PR in some public similar environment everyone get access.

In our PROD env, we have a pure Spark cluster, I think this is also pretty common, where computation is separated from storage layer.

@wangshisan
Copy link
Contributor Author

wangshisan commented Sep 21, 2019

Sorry, I didn't made myself clear.
Our Spark cluster is deployed separated from HDFS cluster, all the data stored in another HDFS cluster, and these two clusters share no physical nodes.
I mean such deploy mode, separating Spark cluster with the storage cluster(HDFS or some other distributed file systems), is pretty common. And in such Spark cluster, data locality is non sense, because it's non reachable.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar fix was raised at #24175 . That was closed in favour of #24672

Do you still meet the issue with the fix?

@wangshisan
Copy link
Contributor Author

Yes, I see. A new API call was introduced in #24175 . And it do improve a lot. While, the new API will still fetch all the block location informations, and in our benchmark, it may consume tens of seconds to fetch all of them for a huge table with the new API.
In my opinion, if a Spark cluster is deployed totally physically separated from a HDFS cluster, we do not need any of such block location information. And this is what this PR for.

@dongjoon-hyun
Copy link
Member

@wangshisan . We need a new UTs for this new feature. Could you add some?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Sep 21, 2019

After adding UTs, please update the PR description test section, too. If you adds UTs, you can say that new UTs are added.

How was this patch tested?

In our PROD environment.

In general, In our PROD environment usually ends up with -1. 😄

@wangshisan
Copy link
Contributor Author

wangshisan commented Sep 22, 2019

New UTs are added.

@SparkQA
Copy link

SparkQA commented Sep 22, 2019

Test build #111136 has finished for PR 25869 at commit b7f9f03.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

cc @squito

@SparkQA
Copy link

SparkQA commented Sep 22, 2019

Test build #111147 has finished for PR 25869 at commit 0da2903.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 23, 2019

Test build #111194 has finished for PR 25869 at commit a659e67.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member

wangyum commented Sep 23, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Sep 23, 2019

Test build #111215 has finished for PR 25869 at commit a659e67.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun self-requested a review September 23, 2019 23:07
@wangshisan
Copy link
Contributor Author

@HyukjinKwon After PR #24672 merged.

@SparkQA
Copy link

SparkQA commented Sep 25, 2019

Test build #111319 has finished for PR 25869 at commit eb1a802.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun self-requested a review September 25, 2019 12:58
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the proposed case, this PR looks correct to me.

cc @gatorsmile , @cloud-fan , @JoshRosen . Could you review this new option?

@SparkQA
Copy link

SparkQA commented Sep 26, 2019

Test build #111429 has finished for PR 25869 at commit 475abba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor

squito commented Sep 27, 2019

makes sense to me.

I do think that it would be nice to have a way to still get locality preferences, depending on the filesystems. I see "semi-disagg" setups where the compute cluster still has hdfs, its just small and only meant for temporary data. But, I dunno how common that is, this seems like a worthwhile improvement in any case.

@SparkQA
Copy link

SparkQA commented Sep 29, 2019

Test build #111552 has finished for PR 25869 at commit e500bcd.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor

squito commented Sep 30, 2019

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Sep 30, 2019

Test build #111621 has finished for PR 25869 at commit e500bcd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@squito squito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a minor comment on the doc text, otherwise lgtm

@dongjoon-hyun
Copy link
Member

Hi, @wangshisan Could you address @squito 's comment?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (except @squito 's comment.)

@wangshisan
Copy link
Contributor Author

Done. Please help review @dongjoon-hyun

@SparkQA
Copy link

SparkQA commented Oct 6, 2019

Test build #111818 has finished for PR 25869 at commit 00ad219.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor

squito commented Oct 7, 2019

merged to master. Thanks @wangshisan !

@asfgit asfgit closed this in 64fe82b Oct 7, 2019
@squito
Copy link
Contributor

squito commented Oct 7, 2019

@wangshisan I assigned the issue in jira to the same userid that reported it, I assumed that was you. If not, please let me know and i can fix

@dongjoon-hyun
Copy link
Member

Thank you, @wangshisan and @squito !

.createWithDefault(10000)

val IGNORE_DATA_LOCALITY =
buildConf("spark.sql.sources.ignore.datalocality")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conf naming looks a little weird... Compared with the other SQL Confs, this should be renamed to spark.sql.sources.ignoreDataLocality.enabled cc @cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, please be careful about the namespace created in the new config names. ignore is definitely not a good namespace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangshisan Could you submit a follow-up PR to rename it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry you are right, i should have paid more attention to this. I have opened a pr to fix the naming: #26056

@wangshisan wangshisan deleted the SPARK-29189 branch October 8, 2019 03:22
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late LGTM too except that naming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants