[SPARK-26990][SQL]FileIndex: use user specified field names if possible #23894

gengliangwang · 2019-02-26T07:11:54Z

What changes were proposed in this pull request?

WIth the following file structure:

/tmp/data
└── a=5

In the previous release:

scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema     
root
 |-- ID: long (nullable = true)
 |-- A: integer (nullable = true)

While in current code:

scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema     
root
 |-- ID: long (nullable = true)
 |-- a: integer (nullable = true)

We can see that the partition column name a is different from A as user specifed. This PR is to fix the case and make it more user-friendly.

How was this patch tested?

Unit test

gengliangwang · 2019-02-26T07:12:18Z

@bersprockets @cloud-fan

SparkQA · 2019-02-26T08:05:01Z

Test build #102783 has finished for PR 23894 at commit 18c3dd4.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-02-26T08:25:05Z

retest this please.

mgaido91

LGTM, just a nit

mgaido91 · 2019-02-26T08:49:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala

      validatePartitionColumns: Boolean,
      timeZone: TimeZone): PartitionSpec = {
-    val userSpecifiedDataTypes = if (userSpecifiedSchema.isDefined) {
+    val (userSpecifiedDataTypes, userSpecifiedNames) = if (userSpecifiedSchema.isDefined) {


just a nit: we can maybe separate the userSpecifiedNames map creation in order to improve code readability. Moreover we need it only if !caseSensitive. So what about:

val userSpecifiedNames = userSpecifiedSchema.filterNot(_ => caseSensitive).map { schema => CaseInsensitiveMap(...) }.getOrElse(Map.empty[String, String])

Thanks! I have revised the code.

SparkQA · 2019-02-26T12:34:38Z

Test build #102787 has finished for PR 23894 at commit 18c3dd4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-26T12:56:27Z

Test build #102789 has finished for PR 23894 at commit 17f7852.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2019-02-26T13:09:21Z

retest this please

srowen · 2019-02-26T15:47:50Z

This is a dumb question, but in general are column names case insensitive like this? I'd kind of expect this is an error but I don't know this part well.

cloud-fan · 2019-02-26T15:54:51Z

@srowen I think both behaviors make sense but it's better to keep behavior same as the previous releases.

srowen · 2019-02-26T16:24:30Z

Fine with that. Is 'previous release' 2.4.0 here? then I agree for sure. If the current behavior has already been 'released' that's trickier.

bersprockets · 2019-02-26T16:35:23Z

@srowen The change that created the current behavior happened after v2.4.0.

bersprockets · 2019-02-26T19:04:03Z

lgtm

bersprockets · 2019-02-26T19:23:02Z

and.. why is no test running?

SparkQA · 2019-02-26T19:35:23Z

Test build #4576 has started for PR 23894 at commit 17f7852.

gatorsmile · 2019-02-26T19:44:20Z

This is a behavior change we should avoid. We should backport it to 2.4 to keep the previous behaviors. cc @dbtsai

gatorsmile · 2019-02-26T19:45:33Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala

+      stringToFile(file, "text")
+      val path = new Path(dir.getCanonicalPath)
+      val schema = StructType(Seq(StructField("A", StringType, false)))
+      withSQLConf(SQLConf.CASE_SENSITIVE.key -> "false") {


Also test the behavior when the conf is on.

When the conf is on, the schema will be a instead of A. Both 2.4 and the current code has the same result.
Not sure if we should throw exceptions on this.

dongjoon-hyun · 2019-02-26T19:52:40Z

~~cc @dbtsai since he is a release manager for 2.4.1.~~ Oops My bad. I missed the previous pinging.

SparkQA · 2019-02-27T00:44:50Z

Test build #102792 has finished for PR 23894 at commit 17f7852.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-02-27T06:41:47Z

thanks, merging to master!

WIth the following file structure: ``` /tmp/data └── a=5 ``` In the previous release: ``` scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema root |-- ID: long (nullable = true) |-- A: integer (nullable = true) ``` While in current code: ``` scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema root |-- ID: long (nullable = true) |-- a: integer (nullable = true) ``` We can see that the partition column name `a` is different from `A` as user specifed. This PR is to fix the case and make it more user-friendly. Unit test Closes apache#23894 from gengliangwang/fileIndexSchema. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

bersprockets · 2019-02-27T18:35:09Z

I don't see an pending back-port to 2.4, so I will start one.

…names if possible ## What changes were proposed in this pull request? Back-port of #23894 to branch-2.4. WIth the following file structure: ``` /tmp/data └── a=5 ``` In the previous release: ``` scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema root |-- ID: long (nullable = true) |-- A: integer (nullable = true) ``` While in current code: ``` scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema root |-- ID: long (nullable = true) |-- a: integer (nullable = true) ``` We can see that the partition column name `a` is different from `A` as user specifed. This PR is to fix the case and make it more user-friendly. Closes #23894 from gengliangwang/fileIndexSchema. Authored-by: Gengliang Wang <gengliang.wangdatabricks.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> ## How was this patch tested? Unit test Closes #23909 from bersprockets/backport-SPARK-26990. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

dbtsai · 2019-03-01T19:56:38Z

@gatorsmile I'll cut a new RC. Thanks!

…names if possible ## What changes were proposed in this pull request? Back-port of apache#23894 to branch-2.4. WIth the following file structure: ``` /tmp/data └── a=5 ``` In the previous release: ``` scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema root |-- ID: long (nullable = true) |-- A: integer (nullable = true) ``` While in current code: ``` scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema root |-- ID: long (nullable = true) |-- a: integer (nullable = true) ``` We can see that the partition column name `a` is different from `A` as user specifed. This PR is to fix the case and make it more user-friendly. Closes apache#23894 from gengliangwang/fileIndexSchema. Authored-by: Gengliang Wang <gengliang.wangdatabricks.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> ## How was this patch tested? Unit test Closes apache#23909 from bersprockets/backport-SPARK-26990. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

SPARK-26990: use user specified field names if possible

18c3dd4

gengliangwang mentioned this pull request Feb 26, 2019

[SPARK-26188][SQL] FileIndex: don't infer data types of partition columns if user specifies schema #23165

Closed

cloud-fan approved these changes Feb 26, 2019

View reviewed changes

mgaido91 reviewed Feb 26, 2019

View reviewed changes

revise code

17f7852

gatorsmile reviewed Feb 26, 2019

View reviewed changes

cloud-fan closed this in 95e5572 Feb 27, 2019

bersprockets mentioned this pull request Feb 27, 2019

[SPARK-26990][SQL][BACKPORT-2.4] FileIndex: use user specified field names if possible #23909

Closed

[SPARK-26990][SQL]FileIndex: use user specified field names if possible #23894

[SPARK-26990][SQL]FileIndex: use user specified field names if possible #23894

Uh oh!

Conversation

gengliangwang commented Feb 26, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gengliangwang commented Feb 26, 2019

Uh oh!

SparkQA commented Feb 26, 2019

Uh oh!

gengliangwang commented Feb 26, 2019

Uh oh!

mgaido91 left a comment

Choose a reason for hiding this comment

Uh oh!

mgaido91 Feb 26, 2019

Choose a reason for hiding this comment

Uh oh!

gengliangwang Feb 26, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 26, 2019

Uh oh!

SparkQA commented Feb 26, 2019

Uh oh!

mgaido91 commented Feb 26, 2019

Uh oh!

srowen commented Feb 26, 2019

Uh oh!

cloud-fan commented Feb 26, 2019

Uh oh!

srowen commented Feb 26, 2019

Uh oh!

bersprockets commented Feb 26, 2019

Uh oh!

bersprockets commented Feb 26, 2019

Uh oh!

bersprockets commented Feb 26, 2019

Uh oh!

SparkQA commented Feb 26, 2019

Uh oh!

gatorsmile commented Feb 26, 2019

Uh oh!

gatorsmile Feb 26, 2019

Choose a reason for hiding this comment

Uh oh!

gengliangwang Feb 27, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Feb 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Feb 27, 2019

Uh oh!

cloud-fan commented Feb 27, 2019

Uh oh!

bersprockets commented Feb 27, 2019

Uh oh!

dbtsai commented Mar 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

dongjoon-hyun commented Feb 26, 2019 •

edited

Loading