Skip to content

Conversation

@gengliangwang
Copy link
Member

What changes were proposed in this pull request?

WIth the following file structure:

/tmp/data
└── a=5

In the previous release:

scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema     
root
 |-- ID: long (nullable = true)
 |-- A: integer (nullable = true)

While in current code:

scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema     
root
 |-- ID: long (nullable = true)
 |-- a: integer (nullable = true)

We can see that the partition column name a is different from A as user specifed. This PR is to fix the case and make it more user-friendly.

How was this patch tested?

Unit test

@gengliangwang
Copy link
Member Author

@bersprockets @cloud-fan

@SparkQA
Copy link

SparkQA commented Feb 26, 2019

Test build #102783 has finished for PR 23894 at commit 18c3dd4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a nit

validatePartitionColumns: Boolean,
timeZone: TimeZone): PartitionSpec = {
val userSpecifiedDataTypes = if (userSpecifiedSchema.isDefined) {
val (userSpecifiedDataTypes, userSpecifiedNames) = if (userSpecifiedSchema.isDefined) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a nit: we can maybe separate the userSpecifiedNames map creation in order to improve code readability. Moreover we need it only if !caseSensitive. So what about:

val userSpecifiedNames = userSpecifiedSchema.filterNot(_ => caseSensitive).map { schema => CaseInsensitiveMap(...) }.getOrElse(Map.empty[String, String])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I have revised the code.

@SparkQA
Copy link

SparkQA commented Feb 26, 2019

Test build #102787 has finished for PR 23894 at commit 18c3dd4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 26, 2019

Test build #102789 has finished for PR 23894 at commit 17f7852.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor

retest this please

@srowen
Copy link
Member

srowen commented Feb 26, 2019

This is a dumb question, but in general are column names case insensitive like this? I'd kind of expect this is an error but I don't know this part well.

@cloud-fan
Copy link
Contributor

@srowen I think both behaviors make sense but it's better to keep behavior same as the previous releases.

@srowen
Copy link
Member

srowen commented Feb 26, 2019

Fine with that. Is 'previous release' 2.4.0 here? then I agree for sure. If the current behavior has already been 'released' that's trickier.

@bersprockets
Copy link
Contributor

@srowen The change that created the current behavior happened after v2.4.0.

@bersprockets
Copy link
Contributor

lgtm

@bersprockets
Copy link
Contributor

and.. why is no test running?

@SparkQA
Copy link

SparkQA commented Feb 26, 2019

Test build #4576 has started for PR 23894 at commit 17f7852.

@gatorsmile
Copy link
Member

This is a behavior change we should avoid. We should backport it to 2.4 to keep the previous behaviors. cc @dbtsai

stringToFile(file, "text")
val path = new Path(dir.getCanonicalPath)
val schema = StructType(Seq(StructField("A", StringType, false)))
withSQLConf(SQLConf.CASE_SENSITIVE.key -> "false") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also test the behavior when the conf is on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the conf is on, the schema will be a instead of A. Both 2.4 and the current code has the same result.
Not sure if we should throw exceptions on this.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 26, 2019

cc @dbtsai since he is a release manager for 2.4.1. Oops My bad. I missed the previous pinging.

@SparkQA
Copy link

SparkQA commented Feb 27, 2019

Test build #102792 has finished for PR 23894 at commit 17f7852.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 95e5572 Feb 27, 2019
bersprockets pushed a commit to bersprockets/spark that referenced this pull request Feb 27, 2019
WIth the following file structure:
```
/tmp/data
└── a=5
```

In the previous release:
```
scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema
root
 |-- ID: long (nullable = true)
 |-- A: integer (nullable = true)
```

While in current code:
```
scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema
root
 |-- ID: long (nullable = true)
 |-- a: integer (nullable = true)
```

We can see that the partition column name `a` is different from `A` as user specifed. This PR is to fix the case and make it more user-friendly.

Unit test

Closes apache#23894 from gengliangwang/fileIndexSchema.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@bersprockets
Copy link
Contributor

I don't see an pending back-port to 2.4, so I will start one.

cloud-fan pushed a commit that referenced this pull request Feb 28, 2019
…names if possible

## What changes were proposed in this pull request?

Back-port of #23894 to branch-2.4.

WIth the following file structure:
```
/tmp/data
└── a=5
```

In the previous release:
```
scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema
root
 |-- ID: long (nullable = true)
 |-- A: integer (nullable = true)
```

While in current code:
```
scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema
root
 |-- ID: long (nullable = true)
 |-- a: integer (nullable = true)
```

We can see that the partition column name `a` is different from `A` as user specifed. This PR is to fix the case and make it more user-friendly.

Closes #23894 from gengliangwang/fileIndexSchema.

Authored-by: Gengliang Wang <gengliang.wangdatabricks.com>
Signed-off-by: Wenchen Fan <wenchendatabricks.com>

## How was this patch tested?

Unit test

Closes #23909 from bersprockets/backport-SPARK-26990.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@dbtsai
Copy link
Member

dbtsai commented Mar 1, 2019

@gatorsmile I'll cut a new RC. Thanks!

kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 23, 2019
…names if possible

## What changes were proposed in this pull request?

Back-port of apache#23894 to branch-2.4.

WIth the following file structure:
```
/tmp/data
└── a=5
```

In the previous release:
```
scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema
root
 |-- ID: long (nullable = true)
 |-- A: integer (nullable = true)
```

While in current code:
```
scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema
root
 |-- ID: long (nullable = true)
 |-- a: integer (nullable = true)
```

We can see that the partition column name `a` is different from `A` as user specifed. This PR is to fix the case and make it more user-friendly.

Closes apache#23894 from gengliangwang/fileIndexSchema.

Authored-by: Gengliang Wang <gengliang.wangdatabricks.com>
Signed-off-by: Wenchen Fan <wenchendatabricks.com>

## How was this patch tested?

Unit test

Closes apache#23909 from bersprockets/backport-SPARK-26990.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 25, 2019
…names if possible

## What changes were proposed in this pull request?

Back-port of apache#23894 to branch-2.4.

WIth the following file structure:
```
/tmp/data
└── a=5
```

In the previous release:
```
scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema
root
 |-- ID: long (nullable = true)
 |-- A: integer (nullable = true)
```

While in current code:
```
scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema
root
 |-- ID: long (nullable = true)
 |-- a: integer (nullable = true)
```

We can see that the partition column name `a` is different from `A` as user specifed. This PR is to fix the case and make it more user-friendly.

Closes apache#23894 from gengliangwang/fileIndexSchema.

Authored-by: Gengliang Wang <gengliang.wangdatabricks.com>
Signed-off-by: Wenchen Fan <wenchendatabricks.com>

## How was this patch tested?

Unit test

Closes apache#23909 from bersprockets/backport-SPARK-26990.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Aug 1, 2019
…names if possible

## What changes were proposed in this pull request?

Back-port of apache#23894 to branch-2.4.

WIth the following file structure:
```
/tmp/data
└── a=5
```

In the previous release:
```
scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema
root
 |-- ID: long (nullable = true)
 |-- A: integer (nullable = true)
```

While in current code:
```
scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema
root
 |-- ID: long (nullable = true)
 |-- a: integer (nullable = true)
```

We can see that the partition column name `a` is different from `A` as user specifed. This PR is to fix the case and make it more user-friendly.

Closes apache#23894 from gengliangwang/fileIndexSchema.

Authored-by: Gengliang Wang <gengliang.wangdatabricks.com>
Signed-off-by: Wenchen Fan <wenchendatabricks.com>

## How was this patch tested?

Unit test

Closes apache#23909 from bersprockets/backport-SPARK-26990.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants