[SPARK-15515] [SQL] Error Handling in Running SQL Directly On Files #13283

gatorsmile · 2016-05-24T22:36:44Z

What changes were proposed in this pull request?

This PR is to address the following issues:

ISSUE 1: For ORC source format, we are reporting the strange error message when we did not enable Hive support:

SQL Example: 
  select id from `org.apache.spark.sql.hive.orc`.`file_path`
Error Message:
  Table or view not found: `org.apache.spark.sql.hive.orc`.`file_path`

Instead, we should issue the error message like:

Expected Error Message:
   The ORC data source must be used with Hive support enabled

ISSUE 2: For the Avro format, we report the strange error message like:

The example query is like

SQL Example: 
  select id from `avro`.`file_path`
  select id from `com.databricks.spark.avro`.`file_path`
Error Message:
  Table or view not found: `com.databricks.spark.avro`.`file_path`

The desired message should be like:

Expected Error Message:
  Failed to find data source: avro. Please use Spark package http://spark-packages.org/package/databricks/spark-avro"

ISSUE 3: Unable to detect incompatibility libraries for Spark 2.0 in Data Source Resolution. We report a strange error message:

Update: The latest code changes contains

For JDBC format, we added an extra checking in the rule ResolveRelations of Analyzer. Without the PR, Spark will return the error message like: Option 'url' not specified. Now, we are reporting Unsupported data source type for direct query on files: jdbc
Make data source format name case incensitive so that error handling behaves consistent with the normal cases.
Added the test cases for all the supported formats.

How was this patch tested?

Added test cases to cover all the above issues

zsxwing · 2016-05-25T18:53:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

-                val className = error.getMessage
-                if (spark2RemovedClasses.contains(className)) {
-                  throw new ClassNotFoundException(s"$className is removed in Spark 2.0. " +
+                // error.getMessage is the class name of provider2. Instead, we use provider here.


In a second thought, I don't think we need this if branch. Could you just remove it?

This is for link issues. But it will be NoClassDefFoundError

Sure, will do it. Thanks!

gatorsmile · 2016-05-25T21:44:31Z

It sounds like we need to verify all the possible source types we can support. Let me add them. Thanks!

SparkQA · 2016-05-25T22:44:53Z

Test build #59296 has finished for PR 13283 at commit 3ac2b93.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-05-26T03:49:42Z

Update: The latest code changes contains

For JDBC format, we added an extra checking in the rule ResolveRelations of Analyzer. Without the PR, Spark will return the error message like: Option 'url' not specified. Now, we are reporting Unsupported data source type for direct query on files: jdbc
Make data source format name case incensitive so that error handling behaves consistent with the normal cases.
Added the test cases for all the supported formats.

SparkQA · 2016-05-26T04:39:05Z

Test build #59332 has finished for PR 13283 at commit 76f4f80.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-26T07:42:30Z

Test build #59346 has finished for PR 13283 at commit b9e12f8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-05-26T14:11:43Z

@zsxwing Now, code is ready for review. Thanks!

zsxwing · 2016-05-26T17:50:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+     * Data source formats that were not supported in direct query on file
+     */
+    private final val unsupportedFileQuerySource = Set(
+      "org.apache.spark.sql.jdbc",


Not sure about this. Looks hard to maintain. E.g., you forget to add JdbcRelationProvider. @rxin What do you think?

Can we just check to see if it extends FileFormat?

@zsxwing @marmbrus Agree. It is hard to maintain the list. First, will add the missing class names to the list.

To implement what @marmbrus suggested, I think we need to do the following:

Users' input might not be full class names. To dynamically load the class, we still need to have a list like:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

Lines 80 to 108 in 361ebc2

/** A map to maintain backward compatibility in case we move data sources around. */

private val backwardCompatibilityMap: Map[String, String] = {

val jdbc = classOf[JdbcRelationProvider].getCanonicalName

val json = classOf[JsonFileFormat].getCanonicalName

val parquet = classOf[ParquetFileFormat].getCanonicalName

val csv = classOf[CSVFileFormat].getCanonicalName

val libsvm = "org.apache.spark.ml.source.libsvm.LibSVMFileFormat"

val orc = "org.apache.spark.sql.hive.orc.OrcFileFormat"

Map(

"org.apache.spark.sql.jdbc" -> jdbc,

"org.apache.spark.sql.jdbc.DefaultSource" -> jdbc,

"org.apache.spark.sql.execution.datasources.jdbc.DefaultSource" -> jdbc,

"org.apache.spark.sql.execution.datasources.jdbc" -> jdbc,

"org.apache.spark.sql.json" -> json,

"org.apache.spark.sql.json.DefaultSource" -> json,

"org.apache.spark.sql.execution.datasources.json" -> json,

"org.apache.spark.sql.execution.datasources.json.DefaultSource" -> json,

"org.apache.spark.sql.parquet" -> parquet,

"org.apache.spark.sql.parquet.DefaultSource" -> parquet,

"org.apache.spark.sql.execution.datasources.parquet" -> parquet,

"org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet,

"org.apache.spark.sql.hive.orc.DefaultSource" -> orc,

"org.apache.spark.sql.hive.orc" -> orc,

"org.apache.spark.ml.source.libsvm.DefaultSource" -> libsvm,

"org.apache.spark.ml.source.libsvm" -> libsvm,

"com.databricks.spark.csv" -> csv

)

}

To avoid duplicating the codes, we need to move this list to a place both Analyzer and DataSource can access. Any suggestion here?

Dynamically loading the class and check whether the class extends FileFormat. We need to do something similar like

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

Lines 128 to 132 in 361ebc2

try {

Try(loader.loadClass(provider)).orElse(Try(loader.loadClass(provider2))) match {

case Success(dataSource) =>

// Found the data source using fully qualified path

dataSource

To avoid duplicate the codes, when unable to loading the class, we still follow the existing way to handle it. That is, returns the UnresolvedRelation u.

Is my understanding right?

Thank you for your suggestions!

Couldn't the logic go in ResolveDataSource?

: ) Yeah. Let me try it. Thank you!

SparkQA · 2016-05-28T07:30:25Z

Test build #59554 has finished for PR 13283 at commit e5c08f2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-05-31T17:39:27Z

@zsxwing @marmbrus How about the latest code changes? Let me know if anything I can further improve. Thanks!

zsxwing · 2016-05-31T22:44:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala

          sparkSession,
          paths = u.tableIdentifier.table :: Nil,
          className = u.tableIdentifier.database.get)
+        if (dataSource.isFileFormat() == Option(false)) {


isFileFormat is not necessary. You can use providingClass like this:

val notSupportDirectQuery = try { !classOf[FileFormat].isAssignableFrom(dataSource.providingClass) } catch { case NonFatal(e) => false } if (notSupportDirectQuery) { throw new AnalysisException("Unsupported data source type for direct query on files: " + s"${u.tableIdentifier.database.get}") }

Thank you very much! Will use your version! It is much better. Thanks again!

SparkQA · 2016-06-01T02:02:42Z

Test build #59699 has finished for PR 13283 at commit 9fae469.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-06-02T03:07:03Z

@zsxwing @marmbrus Any remaining issue? Thanks!

zsxwing · 2016-06-02T20:22:16Z

LGTM. Merging to master and 2.0. Thanks!

#### What changes were proposed in this pull request? This PR is to address the following issues: - **ISSUE 1:** For ORC source format, we are reporting the strange error message when we did not enable Hive support: ```SQL SQL Example: select id from `org.apache.spark.sql.hive.orc`.`file_path` Error Message: Table or view not found: `org.apache.spark.sql.hive.orc`.`file_path` ``` Instead, we should issue the error message like: ``` Expected Error Message: The ORC data source must be used with Hive support enabled ``` - **ISSUE 2:** For the Avro format, we report the strange error message like: The example query is like ```SQL SQL Example: select id from `avro`.`file_path` select id from `com.databricks.spark.avro`.`file_path` Error Message: Table or view not found: `com.databricks.spark.avro`.`file_path` ``` The desired message should be like: ``` Expected Error Message: Failed to find data source: avro. Please use Spark package http://spark-packages.org/package/databricks/spark-avro" ``` - ~~**ISSUE 3:** Unable to detect incompatibility libraries for Spark 2.0 in Data Source Resolution. We report a strange error message:~~ **Update**: The latest code changes contains - For JDBC format, we added an extra checking in the rule `ResolveRelations` of `Analyzer`. Without the PR, Spark will return the error message like: `Option 'url' not specified`. Now, we are reporting `Unsupported data source type for direct query on files: jdbc` - Make data source format name case incensitive so that error handling behaves consistent with the normal cases. - Added the test cases for all the supported formats. #### How was this patch tested? Added test cases to cover all the above issues Author: gatorsmile <[email protected]> Author: xiaoli <[email protected]> Author: Xiao Li <[email protected]> Closes #13283 from gatorsmile/runSQLAgainstFile. (cherry picked from commit 9aff6f3) Signed-off-by: Shixiong Zhu <[email protected]>

tedyu · 2016-06-02T21:34:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala

+        val notSupportDirectQuery = try {
+          !classOf[FileFormat].isAssignableFrom(dataSource.providingClass)
+        } catch {
+          case NonFatal(e) => false


When would this happen ?

Should true be returned here ?

@tedyu If people use select * from db_name.table_name, it will throw an exception. Still need to continue for such cases.

Thanks Ryan.

gatorsmile and others added 30 commits November 13, 2015 14:50

Merge remote-tracking branch 'upstream/master'

01e4cdf

Merge remote-tracking branch 'upstream/master'

6835704

Merge remote-tracking branch 'upstream/master'

9180687

SPARK-11633

b38a21e

Merge remote-tracking branch 'upstream/master' into joinMakeCopy

d2b84af

Merge remote-tracking branch 'upstream/master'

fda8025

Merge branch 'master' of https://github.com/gatorsmile/spark

ac0dccd

Merge remote-tracking branch 'upstream/master'

6e0018b

converge

0546772

converge

b37a64f

Merge remote-tracking branch 'upstream/master'

c2a872c

Merge remote-tracking branch 'upstream/master'

ab6dbd7

Merge remote-tracking branch 'upstream/master'

4276356

Merge remote-tracking branch 'upstream/master'

2dab708

Merge remote-tracking branch 'upstream/master'

0458770

Merge remote-tracking branch 'upstream/master'

1debdfa

Merge remote-tracking branch 'upstream/master'

763706d

Merge remote-tracking branch 'upstream/master'

4de6ec1

Merge remote-tracking branch 'upstream/master'

9422a4f

Merge remote-tracking branch 'upstream/master'

52bdf48

Merge remote-tracking branch 'upstream/master'

1e95df3

Merge remote-tracking branch 'upstream/master'

fab24cf

Merge remote-tracking branch 'upstream/master'

8b2e33b

Merge remote-tracking branch 'upstream/master'

2ee1876

Merge remote-tracking branch 'upstream/master'

b9f0090

Merge remote-tracking branch 'upstream/master'

ade6f7e

Merge remote-tracking branch 'upstream/master'

9fd63d2

Merge remote-tracking branch 'upstream/master'

5199d49

Merge remote-tracking branch 'upstream/master'

404214c

Merge remote-tracking branch 'upstream/master'

c001dd9

zsxwing reviewed May 25, 2016
View reviewed changes

address comments.

3ac2b93

gatorsmile changed the title ~~[SPARK-15515] [SPARK-15514] [SQL] Error Handling in Running SQL Directly On Files~~ [SPARK-15515] [SQL] Error Handling in Running SQL Directly On Files May 25, 2016

address comments.

76f4f80

fix test case.

b9e12f8

zsxwing reviewed May 26, 2016
View reviewed changes

gatorsmile added 5 commits May 26, 2016 12:12

Merge remote-tracking branch 'upstream/master' into runSQLAgainstFile

c1217e1

Merge remote-tracking branch 'upstream/master'

c752518

Merge remote-tracking branch 'upstream/master'

db0f48c

Merge branch 'runSQLAgainstFile' into runSQLAgainstFileNew

11387e2

address comments

e5c08f2

zsxwing reviewed May 31, 2016
View reviewed changes

address comments.

9fae469

asfgit closed this in 9aff6f3 Jun 2, 2016

tedyu reviewed Jun 2, 2016
View reviewed changes

	/** A map to maintain backward compatibility in case we move data sources around. */
	private val backwardCompatibilityMap: Map[String, String] = {
	val jdbc = classOf[JdbcRelationProvider].getCanonicalName
	val json = classOf[JsonFileFormat].getCanonicalName
	val parquet = classOf[ParquetFileFormat].getCanonicalName
	val csv = classOf[CSVFileFormat].getCanonicalName
	val libsvm = "org.apache.spark.ml.source.libsvm.LibSVMFileFormat"
	val orc = "org.apache.spark.sql.hive.orc.OrcFileFormat"

	Map(
	"org.apache.spark.sql.jdbc" -> jdbc,
	"org.apache.spark.sql.jdbc.DefaultSource" -> jdbc,
	"org.apache.spark.sql.execution.datasources.jdbc.DefaultSource" -> jdbc,
	"org.apache.spark.sql.execution.datasources.jdbc" -> jdbc,
	"org.apache.spark.sql.json" -> json,
	"org.apache.spark.sql.json.DefaultSource" -> json,
	"org.apache.spark.sql.execution.datasources.json" -> json,
	"org.apache.spark.sql.execution.datasources.json.DefaultSource" -> json,
	"org.apache.spark.sql.parquet" -> parquet,
	"org.apache.spark.sql.parquet.DefaultSource" -> parquet,
	"org.apache.spark.sql.execution.datasources.parquet" -> parquet,
	"org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet,
	"org.apache.spark.sql.hive.orc.DefaultSource" -> orc,
	"org.apache.spark.sql.hive.orc" -> orc,
	"org.apache.spark.ml.source.libsvm.DefaultSource" -> libsvm,
	"org.apache.spark.ml.source.libsvm" -> libsvm,
	"com.databricks.spark.csv" -> csv
	)
	}

	try {
	Try(loader.loadClass(provider)).orElse(Try(loader.loadClass(provider2))) match {
	case Success(dataSource) =>
	// Found the data source using fully qualified path
	dataSource

[SPARK-15515] [SQL] Error Handling in Running SQL Directly On Files #13283

[SPARK-15515] [SQL] Error Handling in Running SQL Directly On Files #13283

Uh oh!

Conversation

gatorsmile commented May 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented May 25, 2016

Uh oh!

SparkQA commented May 25, 2016

Uh oh!

gatorsmile commented May 26, 2016

Uh oh!

SparkQA commented May 26, 2016

Uh oh!

SparkQA commented May 26, 2016

Uh oh!

gatorsmile commented May 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile May 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 28, 2016

Uh oh!

gatorsmile commented May 31, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 1, 2016

Uh oh!

gatorsmile commented Jun 2, 2016

Uh oh!

zsxwing commented Jun 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

gatorsmile commented May 24, 2016 •

edited

Loading

gatorsmile May 26, 2016 •

edited

Loading