Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

Before Spark 2.1, users can create an external data source table without schema, and we will infer the table schema at runtime. In Spark 2.1, we decided to infer the schema when the table was created, so that we don't need to infer it again and again at runtime.

This is a good improvement, but we should still respect and support old tables which doesn't store table schema in metastore.

How was this patch tested?

regression test.

@cloud-fan
Copy link
Contributor Author

cc @yhuai @ericl @gatorsmile

@SparkQA
Copy link

SparkQA commented Nov 16, 2016

Test build #68697 has finished for PR 15900 at commit a75cf30.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 16, 2016

Test build #68717 has finished for PR 15900 at commit 4094a72.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

DataType.fromJson(schema.get).asInstanceOf[StructType]
} else if (props.filterKeys(_.startsWith(DATASOURCE_SCHEMA_PREFIX)).isEmpty) {
// If there is no schema information in table properties, it means the schema of this table
// was empty when saving into metastore, which is possible in older version of Spark. We
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please mention version number

DataSource(
sparkSession,
userSpecifiedSchema = Some(table.schema),
// In older version of Spark, the table schema can be empty and should be inferred at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please mention version number

checkAnswer(spark.table("old"), Row(1, "a"))
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be good to actually create a set of compatibility tests to make sure a new version of Spark can access table metadata created by a older version (starting from Spark 1.3) without problem. Let's create a follow-up jira for this task and do it during the QA period of spark 2.1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// If there is no schema information in table properties, it means the schema of this table
// was empty when saving into metastore, which is possible in older version of Spark. We
// should respect it.
new StructType()
Copy link
Contributor

@yhuai yhuai Nov 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, a clarification question. This function is only needed for data source tables, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, since we also store schema for hive table, hive table will also call this function. But hive table will never go into this branch, as it always has a schema.(the removal of runtime schema inference happened before we store schema of hive table)

properties = Map(
HiveExternalCatalog.DATASOURCE_PROVIDER -> "parquet"))
hiveClient.createTable(tableDesc, ignoreIfExists = false)
checkAnswer(spark.table("old"), Row(1, "a"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also test describe table and make sure it can provide correct column info?

@SparkQA
Copy link

SparkQA commented Nov 17, 2016

Test build #68745 has finished for PR 15900 at commit 847dada.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Nov 17, 2016

Merging in master/branch-2.1.

@asfgit asfgit closed this in 07b3f04 Nov 17, 2016
asfgit pushed a commit that referenced this pull request Nov 17, 2016
…tastore

## What changes were proposed in this pull request?

Before Spark 2.1, users can create an external data source table without schema, and we will infer the table schema at runtime. In Spark 2.1, we decided to infer the schema when the table was created, so that we don't need to infer it again and again at runtime.

This is a good improvement, but we should still respect and support old tables which doesn't store table schema in metastore.

## How was this patch tested?

regression test.

Author: Wenchen Fan <[email protected]>

Closes #15900 from cloud-fan/hive-catalog.

(cherry picked from commit 07b3f04)
Signed-off-by: Reynold Xin <[email protected]>
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…tastore

## What changes were proposed in this pull request?

Before Spark 2.1, users can create an external data source table without schema, and we will infer the table schema at runtime. In Spark 2.1, we decided to infer the schema when the table was created, so that we don't need to infer it again and again at runtime.

This is a good improvement, but we should still respect and support old tables which doesn't store table schema in metastore.

## How was this patch tested?

regression test.

Author: Wenchen Fan <[email protected]>

Closes apache#15900 from cloud-fan/hive-catalog.
ghost pushed a commit to dbtsai/spark that referenced this pull request Aug 15, 2017
…hema in table properties

## What changes were proposed in this pull request?

This is a follow-up of apache#15900 , to fix one more bug:
When table schema is empty and need to be inferred at runtime, we should not resolve parent plans before the schema has been inferred, or the parent plans will be resolved against an empty schema and may get wrong result for something like `select *`

The fix logic is: introduce `UnresolvedCatalogRelation` as a placeholder. Then we replace it with `LogicalRelation` or `HiveTableRelation` during analysis, so that it's guaranteed that we won't resolve parent plans until the schema has been inferred.

## How was this patch tested?

regression test

Author: Wenchen Fan <[email protected]>

Closes apache#18907 from cloud-fan/bug.
cloud-fan added a commit to cloud-fan/spark that referenced this pull request Aug 16, 2017
…hema in table properties

This is a follow-up of apache#15900 , to fix one more bug:
When table schema is empty and need to be inferred at runtime, we should not resolve parent plans before the schema has been inferred, or the parent plans will be resolved against an empty schema and may get wrong result for something like `select *`

The fix logic is: introduce `UnresolvedCatalogRelation` as a placeholder. Then we replace it with `LogicalRelation` or `HiveTableRelation` during analysis, so that it's guaranteed that we won't resolve parent plans until the schema has been inferred.

regression test

Author: Wenchen Fan <[email protected]>

Closes apache#18907 from cloud-fan/bug.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants