[SPARK-18464][SQL] support old table which doesn't store schema in metastore #15900

cloud-fan · 2016-11-16T06:17:54Z

What changes were proposed in this pull request?

Before Spark 2.1, users can create an external data source table without schema, and we will infer the table schema at runtime. In Spark 2.1, we decided to infer the schema when the table was created, so that we don't need to infer it again and again at runtime.

This is a good improvement, but we should still respect and support old tables which doesn't store table schema in metastore.

How was this patch tested?

regression test.

cloud-fan · 2016-11-16T06:19:10Z

cc @yhuai @ericl @gatorsmile

SparkQA · 2016-11-16T07:33:49Z

Test build #68697 has finished for PR 15900 at commit a75cf30.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-16T14:31:30Z

Test build #68717 has finished for PR 15900 at commit 4094a72.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2016-11-16T15:25:15Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

      DataType.fromJson(schema.get).asInstanceOf[StructType]
+    } else if (props.filterKeys(_.startsWith(DATASOURCE_SCHEMA_PREFIX)).isEmpty) {
+      // If there is no schema information in table properties, it means the schema of this table
+      // was empty when saving into metastore, which is possible in older version of Spark. We


nit: Please mention version number

tejasapatil · 2016-11-16T15:25:35Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

          DataSource(
            sparkSession,
-            userSpecifiedSchema = Some(table.schema),
+            // In older version of Spark, the table schema can be empty and should be inferred at


nit: Please mention version number

yhuai · 2016-11-16T17:40:13Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala

+        checkAnswer(spark.table("old"), Row(1, "a"))
+      }
+    }
+  }


It will be good to actually create a set of compatibility tests to make sure a new version of Spark can access table metadata created by a older version (starting from Spark 1.3) without problem. Let's create a follow-up jira for this task and do it during the QA period of spark 2.1.

created https://issues.apache.org/jira/browse/SPARK-18482

yhuai · 2016-11-16T17:41:30Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

+      // If there is no schema information in table properties, it means the schema of this table
+      // was empty when saving into metastore, which is possible in older version of Spark. We
+      // should respect it.
+      new StructType()


btw, a clarification question. This function is only needed for data source tables, right?

no, since we also store schema for hive table, hive table will also call this function. But hive table will never go into this branch, as it always has a schema.(the removal of runtime schema inference happened before we store schema of hive table)

yhuai · 2016-11-16T17:42:41Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala

+          properties = Map(
+            HiveExternalCatalog.DATASOURCE_PROVIDER -> "parquet"))
+        hiveClient.createTable(tableDesc, ignoreIfExists = false)
+        checkAnswer(spark.table("old"), Row(1, "a"))


Can we also test describe table and make sure it can provide correct column info?

SparkQA · 2016-11-17T07:13:49Z

Test build #68745 has finished for PR 15900 at commit 847dada.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-11-17T07:59:46Z

Merging in master/branch-2.1.

…tastore ## What changes were proposed in this pull request? Before Spark 2.1, users can create an external data source table without schema, and we will infer the table schema at runtime. In Spark 2.1, we decided to infer the schema when the table was created, so that we don't need to infer it again and again at runtime. This is a good improvement, but we should still respect and support old tables which doesn't store table schema in metastore. ## How was this patch tested? regression test. Author: Wenchen Fan <[email protected]> Closes #15900 from cloud-fan/hive-catalog. (cherry picked from commit 07b3f04) Signed-off-by: Reynold Xin <[email protected]>

…tastore ## What changes were proposed in this pull request? Before Spark 2.1, users can create an external data source table without schema, and we will infer the table schema at runtime. In Spark 2.1, we decided to infer the schema when the table was created, so that we don't need to infer it again and again at runtime. This is a good improvement, but we should still respect and support old tables which doesn't store table schema in metastore. ## How was this patch tested? regression test. Author: Wenchen Fan <[email protected]> Closes apache#15900 from cloud-fan/hive-catalog.

…hema in table properties ## What changes were proposed in this pull request? This is a follow-up of apache#15900 , to fix one more bug: When table schema is empty and need to be inferred at runtime, we should not resolve parent plans before the schema has been inferred, or the parent plans will be resolved against an empty schema and may get wrong result for something like `select *` The fix logic is: introduce `UnresolvedCatalogRelation` as a placeholder. Then we replace it with `LogicalRelation` or `HiveTableRelation` during analysis, so that it's guaranteed that we won't resolve parent plans until the schema has been inferred. ## How was this patch tested? regression test Author: Wenchen Fan <[email protected]> Closes apache#18907 from cloud-fan/bug.

…hema in table properties This is a follow-up of apache#15900 , to fix one more bug: When table schema is empty and need to be inferred at runtime, we should not resolve parent plans before the schema has been inferred, or the parent plans will be resolved against an empty schema and may get wrong result for something like `select *` The fix logic is: introduce `UnresolvedCatalogRelation` as a placeholder. Then we replace it with `LogicalRelation` or `HiveTableRelation` during analysis, so that it's guaranteed that we won't resolve parent plans until the schema has been inferred. regression test Author: Wenchen Fan <[email protected]> Closes apache#18907 from cloud-fan/bug.

support old table which doesn't store schema in table properties

4094a72

cloud-fan force-pushed the hive-catalog branch from a75cf30 to 4094a72 Compare November 16, 2016 12:34

tejasapatil reviewed Nov 16, 2016

View reviewed changes

yhuai reviewed Nov 16, 2016

View reviewed changes

address comments

847dada

asfgit closed this in 07b3f04 Nov 17, 2016

cloud-fan mentioned this pull request Aug 10, 2017

[SPARK-18464][SQL][followup] support old table which doesn't store schema in table properties #18907

Closed

[SPARK-18464][SQL] support old table which doesn't store schema in metastore #15900

[SPARK-18464][SQL] support old table which doesn't store schema in metastore #15900

Uh oh!

Conversation

cloud-fan commented Nov 16, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Nov 16, 2016

Uh oh!

SparkQA commented Nov 16, 2016

Uh oh!

SparkQA commented Nov 16, 2016

Uh oh!

tejasapatil Nov 16, 2016

Choose a reason for hiding this comment

Uh oh!

tejasapatil Nov 16, 2016

Choose a reason for hiding this comment

Uh oh!

yhuai Nov 16, 2016

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 17, 2016

Choose a reason for hiding this comment

Uh oh!

yhuai Nov 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 17, 2016

Choose a reason for hiding this comment

Uh oh!

yhuai Nov 16, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 17, 2016

Uh oh!

rxin commented Nov 17, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yhuai Nov 16, 2016 •

edited

Loading