[SPARK-17492] [SQL] Fix Reading Cataloged Data Sources without Extending SchemaRelationProvider #15046

gatorsmile · 2016-09-10T16:38:50Z

What changes were proposed in this pull request?

For data sources without extending SchemaRelationProvider, we expect users to not specify schemas when they creating tables. If the schema is input from users, an exception is issued.

Since Spark 2.1, for any data source, to avoid infer the schema every time, we store the schema in the metastore catalog. Thus, when reading a cataloged data source table, the schema could be read from metastore catalog. In this case, we also got an exception. For example,

sql(
  s"""
     |CREATE TABLE relationProvierWithSchema
     |USING org.apache.spark.sql.sources.SimpleScanSource
     |OPTIONS (
     |  From '1',
     |  To '10'
     |)
   """.stripMargin)
spark.table(tableName).show()

org.apache.spark.sql.sources.SimpleScanSource does not allow user-specified schemas.;

This PR is to fix the above issue. When building a data source, we introduce a flag isSchemaFromUsers to indicate whether the schema is really input from users. If true, we issue an exception. Otherwise, we will call the createRelation of RelationProvider to generate the BaseRelation, in which it contains the actual schema.

How was this patch tested?

Added a few cases.

SparkQA · 2016-09-10T18:36:14Z

Test build #65213 has finished for PR 15046 at commit 4ab1b8a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-09-10T18:43:42Z

cc @yhuai @cloud-fan

cloud-fan · 2016-09-12T08:45:10Z

ah good catch! But adding a new flag looks a little tricky, let me think if there is better way to fix it

gatorsmile · 2016-09-18T06:41:44Z

@cloud-fan JDBC is also affected by this bug. Do you have any better idea about this issue? Thanks!

cloud-fan · 2016-09-18T08:42:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+        if (isSchemaFromUsers) {
+          throw new AnalysisException(s"$className does not allow user-specified schemas.")
+        } else {
+          dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)


If the RelationProvider doesn't allow user-specified schema, can we assume it's cheap to infer schema for it? Then we can simply check if the given schema matches the schema of relation returned by createRelation

Yeah, this is a pretty good idea. Let me try it. Thanks!

SparkQA · 2016-09-19T23:09:43Z

Test build #65616 has finished for PR 15046 at commit 55ee864.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-09-20T03:59:16Z

sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala

+    withTempView("t1", "t2") {
+      sql(
+        """
+          |CREATE TEMPORARY TABLE t1


let's use CREATE TEMPORARY VIEW

cloud-fan · 2016-09-20T04:00:23Z

sql/core/src/test/scala/org/apache/spark/sql/sources/TableScanSuite.scala

      (1 to 10).map(Row(_)).toSeq)
  }

+  test("create a temp table that does not have a path in the option") {


temp view

cloud-fan · 2016-09-20T04:03:08Z

sql/core/src/test/scala/org/apache/spark/sql/sources/TableScanSuite.scala

+      withTable(tableName) {
+        sql(
+          s"""
+             |CREATE $tableType $tableName


what does this test?

My original thought is to provide a comprehensive test coverage of data source table creation/insertion with path. This is not related to this PR. Let me get rid of it.

cloud-fan · 2016-09-20T04:06:30Z

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

+
+    // when users specify the schema
+    val inputSchema = new StructType().add("s", IntegerType, nullable = false)
+    val e = intercept[AnalysisException] { dfReader.schema(inputSchema).load() }


there is no test for this case before?

For DataFrameReader APIs, we do not have such a test case

cloud-fan · 2016-09-20T04:07:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+      case (dataSource: RelationProvider, Some(schema)) =>
+        val baseRelation =
+          dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)
+        if (baseRelation.schema != schema) {


cc @yhuai @liancheng to confirm, is it safe?

SparkQA · 2016-09-21T00:02:39Z

Test build #65684 has finished for PR 15046 at commit 7a80738.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-09-21T12:20:22Z

sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala

    )
  }

+  test("insert into a temp view that does not point to an insertable data source") {


hmm, is this test related to this PR?

Not related. This is also to improve the test case coverage. Feel free to let me know if you want to remove it from this PR. Thanks!

cloud-fan · 2016-09-21T12:23:43Z

sql/core/src/test/scala/org/apache/spark/sql/sources/TableScanSuite.scala

-          |  To '10'
-          |)
-        """.stripMargin)
+    Seq("TEMPORARY VIEW", "TABLE").foreach { tableType =>


these changes are also not related to this PR right? We are improving the test coverage here.

cloud-fan · 2016-09-22T01:00:37Z

is this a problem in 2.0?

gatorsmile · 2016-09-22T01:58:55Z

This is a new issue of Spark 2.1, after we physically store the inferred schema in the metastore.

BTW, I also ran the test cases in Spark 2.0. It works well.

cloud-fan · 2016-09-22T05:20:28Z

thanks, merging to master!

#### Which Delta project/connector is this regarding?  - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description  User-specified schema may come from the catalog if the Delta table is stored in an external catalog that syncs the table schema with the Delta log. We should allow it if it's the same as the real Delta table schema. This is already the case for batch read, see apache/spark#15046 This PR changes the Delta streaming read to allow it as well. Note: since Delta uses DS v2 (`TableProvider`) and explicitly claims that user-specified schema is not supported (`TableProvider#supportsExternalMetadata` returns false by default), end users still can't specify schema in `spark.read/readStream.schema`. This change is only for advanced Spark plugins that can construct logical plans to triggers Delta v1 source stream scan. ## How was this patch tested?  a new test ## Does this PR introduce _any_ user-facing changes?  No

…4125) backport #3929 to 3.3  #### Which Delta project/connector is this regarding?  - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description  User-specified schema may come from the catalog if the Delta table is stored in an external catalog that syncs the table schema with the Delta log. We should allow it if it's the same as the real Delta table schema. This is already the case for batch read, see apache/spark#15046 This PR changes the Delta streaming read to allow it as well. Note: since Delta uses DS v2 (`TableProvider`) and explicitly claims that user-specified schema is not supported (`TableProvider#supportsExternalMetadata` returns false by default), end users still can't specify schema in `spark.read/readStream.schema`. This change is only for advanced Spark plugins that can construct logical plans to triggers Delta v1 source stream scan. ## How was this patch tested?  a new test ## Does this PR introduce _any_ user-facing changes?  No

gatorsmile added 4 commits September 10, 2016 08:31

fix

17c2d50

clean

00a49fe

clean

335e0d6

add one more test case

4ab1b8a

cloud-fan reviewed Sep 18, 2016

View reviewed changes

gatorsmile added 2 commits September 19, 2016 13:30

Merge remote-tracking branch 'upstream/master' into tempViewCases

31b8724

address comments.

55ee864

cloud-fan reviewed Sep 20, 2016

View reviewed changes

gatorsmile added 2 commits September 19, 2016 23:05

address comments

59d06f8

address comments

7a80738

cloud-fan reviewed Sep 21, 2016

View reviewed changes

asfgit closed this in 3a80f92 Sep 22, 2016

cloud-fan mentioned this pull request Feb 13, 2018

[SPARK-23203][SQL]: DataSourceV2: Use immutable logical plans. #20387

Closed

cloud-fan mentioned this pull request Dec 6, 2024

allow user-specified schema in read if it's consistent delta-io/delta#3929

Merged

5 tasks

cloud-fan mentioned this pull request Feb 6, 2025

[3.3] allow user-specified schema in read if it's consistent (#3929) delta-io/delta#4125

Merged

5 tasks

[SPARK-17492] [SQL] Fix Reading Cataloged Data Sources without Extending SchemaRelationProvider #15046

[SPARK-17492] [SQL] Fix Reading Cataloged Data Sources without Extending SchemaRelationProvider #15046

Uh oh!

Conversation

gatorsmile commented Sep 10, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 10, 2016

Uh oh!

gatorsmile commented Sep 10, 2016

Uh oh!

cloud-fan commented Sep 12, 2016

Uh oh!

gatorsmile commented Sep 18, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Sep 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Sep 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Sep 22, 2016

Uh oh!

gatorsmile commented Sep 22, 2016

Uh oh!

cloud-fan commented Sep 22, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cloud-fan Sep 20, 2016 •

edited

Loading

cloud-fan Sep 21, 2016 •

edited

Loading