-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-26990][SQL]FileIndex: use user specified field names if possible #23894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #102783 has finished for PR 23894 at commit
|
|
retest this please. |
mgaido91
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a nit
| validatePartitionColumns: Boolean, | ||
| timeZone: TimeZone): PartitionSpec = { | ||
| val userSpecifiedDataTypes = if (userSpecifiedSchema.isDefined) { | ||
| val (userSpecifiedDataTypes, userSpecifiedNames) = if (userSpecifiedSchema.isDefined) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a nit: we can maybe separate the userSpecifiedNames map creation in order to improve code readability. Moreover we need it only if !caseSensitive. So what about:
val userSpecifiedNames = userSpecifiedSchema.filterNot(_ => caseSensitive).map { schema => CaseInsensitiveMap(...) }.getOrElse(Map.empty[String, String])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I have revised the code.
|
Test build #102787 has finished for PR 23894 at commit
|
|
Test build #102789 has finished for PR 23894 at commit
|
|
retest this please |
|
This is a dumb question, but in general are column names case insensitive like this? I'd kind of expect this is an error but I don't know this part well. |
|
@srowen I think both behaviors make sense but it's better to keep behavior same as the previous releases. |
|
Fine with that. Is 'previous release' 2.4.0 here? then I agree for sure. If the current behavior has already been 'released' that's trickier. |
|
@srowen The change that created the current behavior happened after v2.4.0. |
|
lgtm |
|
and.. why is no test running? |
|
Test build #4576 has started for PR 23894 at commit |
|
This is a behavior change we should avoid. We should backport it to 2.4 to keep the previous behaviors. cc @dbtsai |
| stringToFile(file, "text") | ||
| val path = new Path(dir.getCanonicalPath) | ||
| val schema = StructType(Seq(StructField("A", StringType, false))) | ||
| withSQLConf(SQLConf.CASE_SENSITIVE.key -> "false") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also test the behavior when the conf is on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the conf is on, the schema will be a instead of A. Both 2.4 and the current code has the same result.
Not sure if we should throw exceptions on this.
|
|
|
Test build #102792 has finished for PR 23894 at commit
|
|
thanks, merging to master! |
WIth the following file structure:
```
/tmp/data
└── a=5
```
In the previous release:
```
scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema
root
|-- ID: long (nullable = true)
|-- A: integer (nullable = true)
```
While in current code:
```
scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema
root
|-- ID: long (nullable = true)
|-- a: integer (nullable = true)
```
We can see that the partition column name `a` is different from `A` as user specifed. This PR is to fix the case and make it more user-friendly.
Unit test
Closes apache#23894 from gengliangwang/fileIndexSchema.
Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
|
I don't see an pending back-port to 2.4, so I will start one. |
…names if possible ## What changes were proposed in this pull request? Back-port of #23894 to branch-2.4. WIth the following file structure: ``` /tmp/data └── a=5 ``` In the previous release: ``` scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema root |-- ID: long (nullable = true) |-- A: integer (nullable = true) ``` While in current code: ``` scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema root |-- ID: long (nullable = true) |-- a: integer (nullable = true) ``` We can see that the partition column name `a` is different from `A` as user specifed. This PR is to fix the case and make it more user-friendly. Closes #23894 from gengliangwang/fileIndexSchema. Authored-by: Gengliang Wang <gengliang.wangdatabricks.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> ## How was this patch tested? Unit test Closes #23909 from bersprockets/backport-SPARK-26990. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
|
@gatorsmile I'll cut a new RC. Thanks! |
…names if possible ## What changes were proposed in this pull request? Back-port of apache#23894 to branch-2.4. WIth the following file structure: ``` /tmp/data └── a=5 ``` In the previous release: ``` scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema root |-- ID: long (nullable = true) |-- A: integer (nullable = true) ``` While in current code: ``` scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema root |-- ID: long (nullable = true) |-- a: integer (nullable = true) ``` We can see that the partition column name `a` is different from `A` as user specifed. This PR is to fix the case and make it more user-friendly. Closes apache#23894 from gengliangwang/fileIndexSchema. Authored-by: Gengliang Wang <gengliang.wangdatabricks.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> ## How was this patch tested? Unit test Closes apache#23909 from bersprockets/backport-SPARK-26990. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…names if possible ## What changes were proposed in this pull request? Back-port of apache#23894 to branch-2.4. WIth the following file structure: ``` /tmp/data └── a=5 ``` In the previous release: ``` scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema root |-- ID: long (nullable = true) |-- A: integer (nullable = true) ``` While in current code: ``` scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema root |-- ID: long (nullable = true) |-- a: integer (nullable = true) ``` We can see that the partition column name `a` is different from `A` as user specifed. This PR is to fix the case and make it more user-friendly. Closes apache#23894 from gengliangwang/fileIndexSchema. Authored-by: Gengliang Wang <gengliang.wangdatabricks.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> ## How was this patch tested? Unit test Closes apache#23909 from bersprockets/backport-SPARK-26990. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…names if possible ## What changes were proposed in this pull request? Back-port of apache#23894 to branch-2.4. WIth the following file structure: ``` /tmp/data └── a=5 ``` In the previous release: ``` scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema root |-- ID: long (nullable = true) |-- A: integer (nullable = true) ``` While in current code: ``` scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema root |-- ID: long (nullable = true) |-- a: integer (nullable = true) ``` We can see that the partition column name `a` is different from `A` as user specifed. This PR is to fix the case and make it more user-friendly. Closes apache#23894 from gengliangwang/fileIndexSchema. Authored-by: Gengliang Wang <gengliang.wangdatabricks.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> ## How was this patch tested? Unit test Closes apache#23909 from bersprockets/backport-SPARK-26990. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
WIth the following file structure:
In the previous release:
While in current code:
We can see that the partition column name
ais different fromAas user specifed. This PR is to fix the case and make it more user-friendly.How was this patch tested?
Unit test