[SPARK-22790][SQL] add a configurable factor to describe HadoopFsRelation's size #20072

CodingCat · 2017-12-25T03:14:31Z

What changes were proposed in this pull request?

as per discussion in #19864 (comment)

the current HadoopFsRelation is purely based on the underlying file size which is not accurate and makes the execution vulnerable to errors like OOM

Users can enable CBO with the functionalities in #19864 to avoid this issue

This JIRA proposes to add a configurable factor to sizeInBytes method in HadoopFsRelation class so that users can mitigate this problem without CBO

How was this patch tested?

Existing tests

SparkQA · 2017-12-25T06:23:08Z

Test build #85363 has finished for PR 20072 at commit e6065c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-12-25T09:17:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

checkValues > 0.0

gatorsmile · 2017-12-25T09:20:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala

this should be handled by size > Long.MaxValue, the double value is overflowed only when the result value is Double.PositivyInfinity which would be capped as Long.MaxValue

nvm. hadoopFSSizeFactor is a double

CodingCat · 2017-12-25T17:20:22Z

@gatorsmile thanks for the review, Happy Christmas!

SparkQA · 2017-12-25T20:31:48Z

Test build #85383 has finished for PR 20072 at commit ec275a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

CodingCat · 2017-12-29T16:38:06Z

@gatorsmile more comments?

felixcheung · 2017-12-29T21:58:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

nit: always put space at the end of the line for readability / consistency.
we have two spaces here

"...In the case where the " + " the in-disk and in-..."

and double the

actually https://github.com/apache/spark/pull/20072/files/ec275a841a7bb4c23b277f915debeed54e6cf7ea#diff-9a6b543db706f1a90f790783d6930a13R250 is missing a space at the end - could you also fix that

done, thanks

SparkQA · 2017-12-30T21:07:21Z

Test build #85548 has finished for PR 20072 at commit 2a33b88.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-12-30T22:44:21Z

cc @juliuszsompolski @cloud-fan @wzhfy

wzhfy · 2017-12-31T07:32:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

How about DISK_TO_MEMORY_SIZE_FACTOR? IMHO the current name doesn't describe the purpose clearly.

wzhfy · 2017-12-31T07:36:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Is this config for all data sources or only hadoopFS-related data sources?

this is only for HadoopFSRelation

wzhfy · 2017-12-31T07:36:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala

shall we move it into the method sizeInBytes since it's only used there?

wzhfy · 2017-12-31T07:59:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala

I think this branch can be removed? Long.MaxValue is returned when converting a double value larger than Long.MaxValue.

CodingCat · 2018-01-02T03:42:39Z

@wzhfy thanks for the review, please take a look

cloud-fan · 2018-01-02T05:12:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

...sizeFactor is too vague, how about fileDataSizeFactor?

cloud-fan · 2018-01-02T05:12:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala

we should add a safe check for overflow.

before the latest commit, there is a safe check e6065c7#diff-fcb68cd3c7630f337ce9a3b479b6d0c4R88

However, since sizeFactor is a double, any overflow with positive double numbers would be capped as Double.PositiveInfinity, and as @wzhfy indicated, any double number which is larger than Long.MaxValue would return Long.MaxValue in its toLong method

so it should be safe here

ah good to know it

SparkQA · 2018-01-02T06:36:42Z

Test build #85586 has finished for PR 20072 at commit e97f419.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-01-03T17:13:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

similar to spark.sql.sources.parallelPartitionDiscovery.parallelism, how about spark.sql.sources. fileDataSizeFactor

shouldn't we call this something like compressionFactor?

ah compressionFactor sounds better.

cloud-fan · 2018-01-03T17:14:40Z

LGTM, we should also add a test

SparkQA · 2018-01-03T19:54:41Z

Test build #85638 has finished for PR 20072 at commit a0f3462.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

CodingCat · 2018-01-06T16:12:10Z

@cloud-fan @rxin @wzhfy @felixcheung @gatorsmile thanks the review, the new name of the parameter and test are added

SparkQA · 2018-01-06T16:14:13Z

Test build #85757 has finished for PR 20072 at commit 2f6e3c9.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-06T17:54:00Z

Test build #85760 has finished for PR 20072 at commit 291ce3a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

CodingCat · 2018-01-06T19:05:41Z

retest this please

SparkQA · 2018-01-06T19:25:06Z

Test build #85758 has finished for PR 20072 at commit 670a6c0.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2018-01-06T22:10:14Z

Test build #85762 has finished for PR 20072 at commit 291ce3a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-01-08T05:19:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      "in-disk and in-memory size of data is significantly different, users can adjust this " +
+      "factor for a better choice of the execution plan. The default value is 1.0.")
+    .doubleConf
+    .checkValue(_ > 0, "the value of fileDataSizeFactor must be larger than 0")


maybe >= 1.0? it's weird to see a compression factor less than 1.

BTW fileDataSizeFactor -> compressionFactor

it's not necessary to be that parquet is always smaller than memory size...e.g. in some simple dataset (like the one used in the test), parquet's overhead makes the overall size larger than in-memory size....

but with TPCDS dataset, I observed that parquet size is much smaller than in-memory size

cloud-fan · 2018-01-08T05:20:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

    .createWithDefault(false)

+  val DISK_TO_MEMORY_SIZE_FACTOR = buildConf(
+    "spark.sql.sources.compressionFactor")


merge this with the previous line

BTW, how about fileCompressionFactor? Since it works for only file-based data sources.

cloud-fan · 2018-01-08T05:24:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

    .booleanConf
    .createWithDefault(false)

+  val DISK_TO_MEMORY_SIZE_FACTOR = buildConf(


Rename this too, FILE_COMRESSION_FACTOR

cloud-fan · 2018-01-08T05:27:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    "spark.sql.sources.compressionFactor")
+    .internal()
+    .doc("The result of multiplying this factor with the size of data source files is propagated " +
+      "to serve as the stats to choose the best execution plan. In the case where the " +


When estimating the output data size of a table scan, multiple the file size with this factor as the estimated data size, in case the data is compressed in the file and lead to a heavily underestimated result.

cloud-fan · 2018-01-08T05:28:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala


-  override def sizeInBytes: Long = location.sizeInBytes
+  override def sizeInBytes: Long = {
+    val sizeFactor = sqlContext.conf.diskToMemorySizeFactor


compressionFactor

gatorsmile · 2018-01-10T16:48:16Z

cc @CodingCat

SparkQA · 2018-01-11T06:55:09Z

Test build #85949 has finished for PR 20072 at commit 5230081.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-11T18:01:38Z

Test build #85970 has finished for PR 20072 at commit 6fe8589.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-12T00:07:55Z

Test build #85988 has finished for PR 20072 at commit c584c61.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

CodingCat · 2018-01-12T16:49:53Z

retest this please

SparkQA · 2018-01-12T20:20:47Z

Test build #86045 has finished for PR 20072 at commit c584c61.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-13T18:35:45Z

Thanks! Merged to master/2.3

…tion's size ## What changes were proposed in this pull request? as per discussion in #19864 (comment) the current HadoopFsRelation is purely based on the underlying file size which is not accurate and makes the execution vulnerable to errors like OOM Users can enable CBO with the functionalities in #19864 to avoid this issue This JIRA proposes to add a configurable factor to sizeInBytes method in HadoopFsRelation class so that users can mitigate this problem without CBO ## How was this patch tested? Existing tests Author: CodingCat <[email protected]> Author: Nan Zhu <[email protected]> Closes #20072 from CodingCat/SPARK-22790. (cherry picked from commit ba891ec) Signed-off-by: gatorsmile <[email protected]>

gatorsmile reviewed Dec 25, 2017

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated

Copy link

Member

gatorsmile Dec 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkValues > 0.0

gatorsmile reviewed Dec 25, 2017

View reviewed changes

felixcheung reviewed Dec 29, 2017

View reviewed changes

wzhfy reviewed Dec 31, 2017

View reviewed changes

CodingCat force-pushed the SPARK-22790 branch from 2b4854f to e97f419 Compare January 2, 2018 03:42

cloud-fan reviewed Jan 2, 2018

View reviewed changes

cloud-fan reviewed Jan 3, 2018

View reviewed changes

CodingCat force-pushed the SPARK-22790 branch from 2f6e3c9 to 670a6c0 Compare January 6, 2018 16:20

CodingCat and others added 9 commits January 6, 2018 08:21

improve the doc for "spark.memory.offHeap.size"

fbe8253

fix

f40dde4

improve the doc for "spark.memory.offHeap.size"

e868b6f

fix

400f71e

add a configurable factor to describe HadoopFsRelation's size

2ba0c26

check value > 0

669704c

address the comments

e25aff5

address the comments

36e8ac4

address more comments

4b0a85b

address the comments

291ce3a

CodingCat force-pushed the SPARK-22790 branch from 670a6c0 to 291ce3a Compare January 6, 2018 16:22

cloud-fan reviewed Jan 8, 2018

View reviewed changes

address the comments

5230081

fix the test

6fe8589

renaming

c584c61

asfgit closed this in ba891ec Jan 13, 2018

[SPARK-22790][SQL] add a configurable factor to describe HadoopFsRelation's size #20072

[SPARK-22790][SQL] add a configurable factor to describe HadoopFsRelation's size #20072

Uh oh!

Conversation

CodingCat commented Dec 25, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 25, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CodingCat commented Dec 25, 2017

Uh oh!

SparkQA commented Dec 25, 2017

Uh oh!

CodingCat commented Dec 29, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 30, 2017

Uh oh!

gatorsmile commented Dec 30, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CodingCat commented Jan 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jan 3, 2018

Uh oh!

SparkQA commented Jan 3, 2018

Uh oh!

CodingCat commented Jan 6, 2018

Uh oh!

SparkQA commented Jan 6, 2018

Uh oh!

SparkQA commented Jan 6, 2018

Uh oh!

CodingCat commented Jan 6, 2018

Uh oh!

SparkQA commented Jan 6, 2018

Uh oh!