[SPARK-19761][SQL]create InMemoryFileIndex with an empty rootPaths when set PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero failed #17093

windpiger · 2017-02-28T02:23:02Z

What changes were proposed in this pull request?

If we create a InMemoryFileIndex with an empty rootPaths when set PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero, it will throw an exception:

Positive number of slices required
java.lang.IllegalArgumentException: Positive number of slices required
        at org.apache.spark.rdd.ParallelCollectionRDD$.slice(ParallelCollectionRDD.scala:119)
        at org.apache.spark.rdd.ParallelCollectionRDD.getPartitions(ParallelCollectionRDD.scala:97)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2084)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
        at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$.org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles(PartitioningAwareFileIndex.scala:357)
        at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.listLeafFiles(PartitioningAwareFileIndex.scala:256)
        at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:74)
        at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(InMemoryFileIndex.scala:50)
        at org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9$$anonfun$apply$mcV$sp$2.apply$mcV$sp(FileIndexSuite.scala:186)
        at org.apache.spark.sql.test.SQLTestUtils$class.withSQLConf(SQLTestUtils.scala:105)
        at org.apache.spark.sql.execution.datasources.FileIndexSuite.withSQLConf(FileIndexSuite.scala:33)
        at org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9.apply$mcV$sp(FileIndexSuite.scala:185)
        at org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9.apply(FileIndexSuite.scala:185)
        at org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9.apply(FileIndexSuite.scala:185)
        at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
        at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)

How was this patch tested?

unit test added

…en set PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero failed

windpiger · 2017-02-28T02:26:59Z

cc @cloud-fan @gatorsmile

SparkQA · 2017-02-28T04:18:45Z

Test build #73552 has finished for PR 17093 at commit 96898a2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-02-28T04:58:03Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala

    }
  }
+
+  test("InMemoryFileIndex with empty rootPaths when PARALLEL_PARTITION_DISCOVERY_THRESHOLD is 0") {


After this fix, when users setting it to -1, we still face the same strange error. We need a complete fix.

I think the user should not set it to a negative number initiatively, if they set it , throw exception is reasonable

SparkQA · 2017-02-28T07:43:57Z

Test build #73568 has finished for PR 17093 at commit a3ac29b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-28T11:14:04Z

Test build #73583 has finished for PR 17093 at commit ec0afac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-28T19:25:41Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala

      sparkSession: SparkSession): Seq[(Path, Seq[FileStatus])] = {

    // Short-circuits parallel listing when serial listing is likely to be faster.
-    if (paths.size < sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {


why not we just make sure parallelPartitionDiscoveryThreshold is greater than 0? We can add a condition(via checkValue) in SQLConf.PARALLEL_PARTITION_DISCOVERY_THRESHOLD

sorry, I didn't notice there is a checkValue func, let me fix it. thanks!

windpiger · 2017-03-01T01:28:38Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala


    // Short-circuits parallel listing when serial listing is likely to be faster.
-    if (paths.size < sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
+    if (paths.size <= sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {


add equal is more clear to understand the conf parallelPartitionDiscoveryThreshold

cloud-fan · 2017-03-01T01:58:37Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala

  }

+  test("InMemoryFileIndex with empty rootPaths when PARALLEL_PARTITION_DISCOVERY_THRESHOLD" +
+    "is not positive number") {


is negative

there is zero value test in it, so I use not positive...

not positive number -> a nonpositive number

cloud-fan · 2017-03-01T01:59:08Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

        "files with another Spark distributed job. This applies to Parquet, ORC, CSV, JSON and " +
        "LibSVM data sources.")
      .intConf
+      .checkValue(parallel => parallel >= 0, "The maximum number of files allowed for listing " +


actually it should be The maximum number of root paths?

SparkQA · 2017-03-01T02:56:07Z

Test build #73633 has finished for PR 17093 at commit 13b70f0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-01T04:12:49Z

Test build #73650 has finished for PR 17093 at commit 0d2334d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-01T06:07:31Z

LGTM except #17093 (comment)

SparkQA · 2017-03-01T06:59:45Z

Test build #73665 has finished for PR 17093 at commit 74d08a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-01T07:51:43Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala

+      }
+    }.getMessage
+    assert(e.contains("The maximum number of paths allowed for listing files at " +
+    "driver side must not be negative"))


Nit: indent. : )

oh...thanks~

SparkQA · 2017-03-01T09:10:31Z

Test build #73672 has finished for PR 17093 at commit e1a9072.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-01T09:56:56Z

Test build #73673 has finished for PR 17093 at commit 3c079ff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-01T16:00:20Z

Thanks! Merging to master.

…ed to listFiles twice ## What changes were proposed in this pull request? Currently when we resolveRelation for a `FileFormat DataSource` without providing user schema, it will execute `listFiles` twice in `InMemoryFileIndex` during `resolveRelation`. This PR add a `FileStatusCache` for DataSource, this can avoid listFiles twice. But there is a bug in `InMemoryFileIndex` see: [SPARK-19748](#17079) [SPARK-19761](#17093), so this pr should be after SPARK-19748/ SPARK-19761. ## How was this patch tested? unit test added Author: windpiger <[email protected]> Closes #17081 from windpiger/resolveDataSourceScanFilesTwice.

[SPARK-19761][SQL]create InMemoryFileIndex with an empty rootPaths wh…

96898a2

…en set PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero failed

windpiger mentioned this pull request Feb 28, 2017

[SPARK-18726][SQL]resolveRelation for FileFormat DataSource don't need to listFiles twice #17081

Closed

gatorsmile reviewed Feb 28, 2017

View reviewed changes

cover the negtive number situation

a3ac29b

merge with master

ec0afac

cloud-fan reviewed Feb 28, 2017

View reviewed changes

add checkValue for the conf

13b70f0

windpiger commented Mar 1, 2017

View reviewed changes

cloud-fan reviewed Mar 1, 2017

View reviewed changes

windpiger added 2 commits March 1, 2017 10:54

fix comments

1cb997c

fix comments

0d2334d

modify test case

74d08a5

modify a test name

e1a9072

gatorsmile reviewed Mar 1, 2017

View reviewed changes

modify a style

3c079ff

asfgit closed this in 8aa560b Mar 1, 2017

[SPARK-19761][SQL]create InMemoryFileIndex with an empty rootPaths when set PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero failed #17093

[SPARK-19761][SQL]create InMemoryFileIndex with an empty rootPaths when set PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero failed #17093

Uh oh!

Conversation

windpiger commented Feb 28, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

windpiger commented Feb 28, 2017

Uh oh!

SparkQA commented Feb 28, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

windpiger Feb 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 28, 2017

Uh oh!

SparkQA commented Feb 28, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 1, 2017

Uh oh!

SparkQA commented Mar 1, 2017

Uh oh!

gatorsmile commented Mar 1, 2017

Uh oh!

SparkQA commented Mar 1, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

windpiger Mar 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 1, 2017

Uh oh!

SparkQA commented Mar 1, 2017

Uh oh!

gatorsmile commented Mar 1, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

windpiger Feb 28, 2017 •

edited

Loading

windpiger Mar 1, 2017 •

edited

Loading