[SPARK-29537][SQL] throw exception when user defined a wrong base path #26195

Ngone51 · 2019-10-21T13:14:42Z

What changes were proposed in this pull request?

When user defined a base path which is not an ancestor directory for all the input paths,
throw exception immediately.

Why are the changes needed?

Assuming that we have a DataFrame[c1, c2] be written out in parquet and partitioned by c1.

When using spark.read.parquet("/path/to/data/c1=1") to read the data, we'll have a DataFrame with column c2 only.

But if we use spark.read.option("basePath", "/path/from").parquet("/path/to/data/c1=1") to
read the data, we'll have a DataFrame with column c1 and c2.

This's happens because a wrong base path does not actually work in parsePartition(), so paring would continue until it reaches a directory without "=".

And I think the result of the second read way doesn't make sense.

Does this PR introduce any user-facing change?

Yes, with this change, user would hit IllegalArgumentException when given a wrong base path while previous behavior doesn't.

How was this patch tested?

Added UT.

SparkQA · 2019-10-21T15:01:00Z

Test build #112393 has finished for PR 26195 at commit 746a97e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-21T18:50:06Z

Test build #112403 has finished for PR 26195 at commit f7ecb15.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2019-10-22T01:13:06Z

Failed test:

org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite.multiple joins

Error Message

org.scalatest.exceptions.TestFailedException: 2 did not equal 1
Stacktrace

sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 2 did not equal 1
	at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
	at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
	at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
	at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
	at org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite.checkNumLocalShuffleReaders(AdaptiveQueryExecSuite.scala:83)
	at org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite.$anonfun$new$10(AdaptiveQueryExecSuite.scala:167)
	at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:47)
	at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:31)
	at org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(AdaptiveQueryExecSuite.scala:27)
	at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:231)
	at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:229)
	at org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite.withSQLConf(AdaptiveQueryExecSuite.scala:27)
	at org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite.$anonfun$new$9(AdaptiveQueryExecSuite.scala:151)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
	at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
	at org.scalatest.Transformer.apply(Transformer.scala:22)
	at org.scalatest.Transformer.apply(Transformer.scala:20)
	at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
	at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
	at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
	at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
	at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)
	at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
	at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
	at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56)
	at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
	at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
	at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56)
	at org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
	at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:393)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:381)
	at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:376)
	at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:458)
	at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
	at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
	at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
	at org.scalatest.Suite.run(Suite.scala:1124)
	at org.scalatest.Suite.run$(Suite.scala:1106)
	at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
	at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
	at org.scalatest.SuperEngine.runImpl(Engine.scala:518)
	at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
	at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
	at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56)
	at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
	at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
	at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
	at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56)
	at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317)
	at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510)
	at sbt.ForkMain$Run$2.call(ForkMain.java:296)
	at sbt.ForkMain$Run$2.call(ForkMain.java:286)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

It's weird. I can not pass it even in master branch, though locally.

HeartSaVioR · 2019-10-22T02:32:46Z

I've already filed https://issues.apache.org/jira/browse/SPARK-29538 for that issue and left a comment in #26157. Let's see how it goes.

Ngone51 · 2019-10-28T14:39:23Z

retest this please.

SparkQA · 2019-10-28T14:43:40Z

Test build #112776 has started for PR 26195 at commit f7ecb15.

Ngone51 · 2019-11-27T08:51:02Z

cc @cloud-fan Please take a look, thanks.

cloud-fan · 2019-11-27T09:46:34Z

when did we add basePath? I have no idea what it is...

SparkQA · 2019-11-27T12:51:07Z

Test build #114520 has finished for PR 26195 at commit 617ab8f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2019-11-27T13:56:00Z

@cloud-fan

when did we add basePath? I have no idea what it is...

After tracking the code history, I think it was introduced in #9651 from Spark 1.6.

And here's documentation for basePath(ref: http://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery):

Starting from Spark 1.6.0, partition discovery only finds partitions under the given paths by default. For the above example, if users pass path/to/table/gender=male to either SparkSession.read.parquet or SparkSession.read.load, gender will not be considered as a partitioning column. If users need to specify the base path that partition discovery should start with, they can set basePath in the data source options. For example, when path/to/table/gender=male is the path of the data and users set basePath to path/to/table/, gender will be a partitioning column.

cloud-fan · 2019-11-27T14:25:47Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala

      "driver side must not be negative"))
  }

+  test ("SPARK-29537: throw exception when user defined a wrong base path") {


let's also add an end-to-end test with DataFrameReader

Added 261b9ad

cloud-fan · 2019-11-27T14:26:32Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala

+        def qualifiedPath(path: Path): Path = path.makeQualified(fs.getUri, fs.getWorkingDirectory)
+
+        val qualifiedBasePath = qualifiedPath(userDefinedBasePath)
+        rootPaths.find(p => !qualifiedPath(p).toString.


is there a way to check sub-path using some FS APIs instead of relying on path string?

I didn't find it either in Path or FileSystem.

srowen · 2019-11-27T15:41:18Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala

+
+        val qualifiedBasePath = qualifiedPath(userDefinedBasePath)
+        rootPaths.find(p => !qualifiedPath(p).toString.
+          startsWith(qualifiedBasePath.toString)) match {


The indent is off here. But can you just use .find(...).foreach(rp => ...?
Or require(rootPaths.forall(p => qualifiedPath(p)...), "error message")

SparkQA · 2019-11-28T07:26:35Z

Test build #114560 has finished for PR 26195 at commit 66f0bd3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-12-02T05:54:02Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala

        }
+        def qualifiedPath(path: Path): Path = path.makeQualified(fs.getUri, fs.getWorkingDirectory)
+
+        val qualifiedBasePath = qualifiedPath(userDefinedBasePath)


let's call toString here, to avoid calling toString later many times

We can even call toString in qualifiedPath and remove the needs to call .toString altogether.

HeartSaVioR · 2019-12-02T08:16:28Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala

+            throw new IllegalArgumentException(
+              s"Wrong basePath $userDefinedBasePath for the root path: $rp")
+          }
        Set(fs.makeQualified(userDefinedBasePath))


Can be simply Set(qualifiedBasePath) as we now calculated it before; if we want to change qualifiedPath() to return String, Set(new Path(qualifiedBasePath)).

HeartSaVioR · 2019-12-02T08:18:57Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala

        }
+        def qualifiedPath(path: Path): Path = path.makeQualified(fs.getUri, fs.getWorkingDirectory)
+
+        val qualifiedBasePath = qualifiedPath(userDefinedBasePath)


We can even call toString in qualifiedPath and remove the needs to call .toString altogether.

cloud-fan · 2019-12-02T12:54:05Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala

+            throw new IllegalArgumentException(
+              s"Wrong basePath $userDefinedBasePath for the root path: $rp")
+          }
+        Set(new Path(qualifiedBasePath))


we should reduce overhead as possible as we can

val qualifiedBasePath = fs.makeQualified(userDefinedBasePath) val qualifiedBasePathStr = qualifiedBasePath.toString rootPaths.find... Set(qualifiedBasePath)

Ngone51 · 2019-12-02T13:33:59Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala

+        val qualifiedBasePath = fs.makeQualified(userDefinedBasePath)
+        val qualifiedBasePathStr = qualifiedBasePath.toString
+        rootPaths
+          .find(!fs.makeQualified(_).toString.startsWith(qualifiedBasePathStr))


Review note: I've inlined the qualified() function into find() clause.

SparkQA · 2019-12-02T16:25:58Z

Test build #114729 has finished for PR 26195 at commit e270fea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-02T17:32:03Z

Test build #114732 has finished for PR 26195 at commit e889cda.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-12-03T09:03:13Z

thanks, merging to master!

Ngone51 · 2019-12-03T12:23:09Z

Thanks! @cloud-fan @HeartSaVioR @HyukjinKwon @srowen

### What changes were proposed in this pull request? When user defined a base path which is not an ancestor directory for all the input paths, throw exception immediately. ### Why are the changes needed? Assuming that we have a DataFrame[c1, c2] be written out in parquet and partitioned by c1. When using `spark.read.parquet("/path/to/data/c1=1")` to read the data, we'll have a DataFrame with column c2 only. But if we use `spark.read.option("basePath", "/path/from").parquet("/path/to/data/c1=1")` to read the data, we'll have a DataFrame with column c1 and c2. This's happens because a wrong base path does not actually work in `parsePartition()`, so paring would continue until it reaches a directory without "=". And I think the result of the second read way doesn't make sense. ### Does this PR introduce any user-facing change? Yes, with this change, user would hit `IllegalArgumentException ` when given a wrong base path while previous behavior doesn't. ### How was this patch tested? Added UT. Closes apache#26195 from Ngone51/dev-wrong-basePath. Lead-authored-by: wuyi <[email protected]> Co-authored-by: wuyi <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

throw exception when user defined a wrong base path

746a97e

use Uri.Path

f7ecb15

dongjoon-hyun added the SQL label Oct 22, 2019

Ngone51 added 2 commits November 27, 2019 16:49

use qualified path

358bc0e

improve test

617ab8f

cloud-fan reviewed Nov 27, 2019

View reviewed changes

srowen reviewed Nov 27, 2019

View reviewed changes

Ngone51 added 2 commits November 28, 2019 11:28

add test in DataFrameReaderWriterSuite

261b9ad

address comment

66f0bd3

cloud-fan reviewed Dec 2, 2019

View reviewed changes

cloud-fan approved these changes Dec 2, 2019

View reviewed changes

HeartSaVioR reviewed Dec 2, 2019

View reviewed changes

address comments

e270fea

cloud-fan reviewed Dec 2, 2019

View reviewed changes

address comment

e889cda

Ngone51 commented Dec 2, 2019

View reviewed changes

HyukjinKwon approved these changes Dec 3, 2019

View reviewed changes

cloud-fan closed this in 075ae1e Dec 3, 2019

[SPARK-29537][SQL] throw exception when user defined a wrong base path #26195

[SPARK-29537][SQL] throw exception when user defined a wrong base path #26195

Uh oh!

Conversation

Ngone51 commented Oct 21, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Oct 21, 2019

Uh oh!

SparkQA commented Oct 21, 2019

Uh oh!

Ngone51 commented Oct 22, 2019

Uh oh!

HeartSaVioR commented Oct 22, 2019

Uh oh!

Ngone51 commented Oct 28, 2019

Uh oh!

SparkQA commented Oct 28, 2019

Uh oh!

Ngone51 commented Nov 27, 2019

Uh oh!

cloud-fan commented Nov 27, 2019

Uh oh!

SparkQA commented Nov 27, 2019

Uh oh!

Ngone51 commented Nov 27, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 28, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 2, 2019

Uh oh!

SparkQA commented Dec 2, 2019

Uh oh!

cloud-fan commented Dec 3, 2019

Uh oh!

Ngone51 commented Dec 3, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants