Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This is a regression introduced by #14207. After Spark 2.1, we store the inferred schema when creating the table, to avoid inferring schema again at read path. However, there is one special case: overlapped columns between data and partition. For this case, it breaks the assumption of table schema that there is on ovelap between data and partition schema, and partition columns should be at the end. The result is, for Spark 2.1, the table scan has incorrect schema that puts partition columns at the end. For Spark 2.2, we add a check in CatalogTable to validate table schema, which fails at this case.

To fix this issue, a simple and safe approach is to fallback to old behavior when overlapeed columns detected, i.e. store empty schema in metastore.

How was this patch tested?

new regression test

@cloud-fan
Copy link
Contributor Author

cc @gatorsmile

@SparkQA
Copy link

SparkQA commented Oct 26, 2017

Test build #83073 has finished for PR 19579 at commit 18907cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// empty schema in metastore and infer it at runtime. Note that this also means the new
// scalable partitioning handling feature(introduced at Spark 2.1) is disabled in this case.
case r: HadoopFsRelation if r.overlappedPartCols.nonEmpty =>
table.copy(schema = new StructType(), partitionColumnNames = Nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log a warning message here? When data columns and partition columns have the same names, the values could be inconsistent. Thus, we do not suggest users to create such a table and it might perform well because we infer the schema at runtime.

@SparkQA
Copy link

SparkQA commented Oct 27, 2017

Test build #83094 has finished for PR 19579 at commit 2fe7ec4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

gatorsmile commented Oct 27, 2017

Thanks! Merged to master/2.2

@asfgit asfgit closed this in 9b262f6 Oct 27, 2017
asfgit pushed a commit that referenced this pull request Oct 27, 2017
…s between data and partition schema

This is a regression introduced by #14207. After Spark 2.1, we store the inferred schema when creating the table, to avoid inferring schema again at read path. However, there is one special case: overlapped columns between data and partition. For this case, it breaks the assumption of table schema that there is on ovelap between data and partition schema, and partition columns should be at the end. The result is, for Spark 2.1, the table scan has incorrect schema that puts partition columns at the end. For Spark 2.2, we add a check in CatalogTable to validate table schema, which fails at this case.

To fix this issue, a simple and safe approach is to fallback to old behavior when overlapeed columns detected, i.e. store empty schema in metastore.

new regression test

Author: Wenchen Fan <[email protected]>

Closes #19579 from cloud-fan/bug2.
asfgit pushed a commit that referenced this pull request Feb 15, 2018
## What changes were proposed in this pull request?
#19579 introduces a behavior change. We need to document it in the migration guide.

## How was this patch tested?
Also update the HiveExternalCatalogVersionsSuite to verify it.

Author: gatorsmile <[email protected]>

Closes #20606 from gatorsmile/addMigrationGuide.

(cherry picked from commit a77ebb0)
Signed-off-by: gatorsmile <[email protected]>
asfgit pushed a commit that referenced this pull request Feb 15, 2018
## What changes were proposed in this pull request?
#19579 introduces a behavior change. We need to document it in the migration guide.

## How was this patch tested?
Also update the HiveExternalCatalogVersionsSuite to verify it.

Author: gatorsmile <[email protected]>

Closes #20606 from gatorsmile/addMigrationGuide.
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…s between data and partition schema

This is a regression introduced by apache#14207. After Spark 2.1, we store the inferred schema when creating the table, to avoid inferring schema again at read path. However, there is one special case: overlapped columns between data and partition. For this case, it breaks the assumption of table schema that there is on ovelap between data and partition schema, and partition columns should be at the end. The result is, for Spark 2.1, the table scan has incorrect schema that puts partition columns at the end. For Spark 2.2, we add a check in CatalogTable to validate table schema, which fails at this case.

To fix this issue, a simple and safe approach is to fallback to old behavior when overlapeed columns detected, i.e. store empty schema in metastore.

new regression test

Author: Wenchen Fan <[email protected]>

Closes apache#19579 from cloud-fan/bug2.
cloud-fan pushed a commit that referenced this pull request Nov 4, 2020
…using a hard-coded location to store localized Spark binaries

### What changes were proposed in this pull request?
This PR changes `HiveExternalCatalogVersionsSuite` to, by default, use a standard temporary directory to store the Spark binaries that it localizes. It additionally adds a new System property, `spark.test.cache-dir`, which can be used to define a static location into which the Spark binary will be localized to allow for sharing between test executions. If the System property is used, the downloaded binaries won't be deleted after the test runs.

### Why are the changes needed?
In SPARK-22356 (PR #19579), the `sparkTestingDir` used by `HiveExternalCatalogVersionsSuite` became hard-coded to enable re-use of the downloaded Spark tarball between test executions:
```
  // For local test, you can set `sparkTestingDir` to a static value like `/tmp/test-spark`, to
  // avoid downloading Spark of different versions in each run.
  private val sparkTestingDir = new File("/tmp/test-spark")
```
However this doesn't work, since it gets deleted every time:
```
  override def afterAll(): Unit = {
    try {
      Utils.deleteRecursively(wareHousePath)
      Utils.deleteRecursively(tmpDataDir)
      Utils.deleteRecursively(sparkTestingDir)
    } finally {
      super.afterAll()
    }
  }
```

It's bad that we're hard-coding to a `/tmp` directory, as in some cases this is not the proper place to store temporary files. We're not currently making any good use of it.

### Does this PR introduce _any_ user-facing change?
Developer-facing changes only, as this is in a test.

### How was this patch tested?
The test continues to execute as expected.

Closes #30122 from xkrogen/xkrogen-SPARK-33214-hiveexternalversioncatalogsuite-fix.

Authored-by: Erik Krogen <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants