-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-15515] [SQL] Error Handling in Running SQL Directly On Files #13283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| val className = error.getMessage | ||
| if (spark2RemovedClasses.contains(className)) { | ||
| throw new ClassNotFoundException(s"$className is removed in Spark 2.0. " + | ||
| // error.getMessage is the class name of provider2. Instead, we use provider here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a second thought, I don't think we need this if branch. Could you just remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for link issues. But it will be NoClassDefFoundError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do it. Thanks!
|
It sounds like we need to verify all the possible source types we can support. Let me add them. Thanks! |
|
Test build #59296 has finished for PR 13283 at commit
|
|
Update: The latest code changes contains
|
|
Test build #59332 has finished for PR 13283 at commit
|
|
Test build #59346 has finished for PR 13283 at commit
|
|
@zsxwing Now, code is ready for review. Thanks! |
| * Data source formats that were not supported in direct query on file | ||
| */ | ||
| private final val unsupportedFileQuerySource = Set( | ||
| "org.apache.spark.sql.jdbc", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this. Looks hard to maintain. E.g., you forget to add JdbcRelationProvider. @rxin What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just check to see if it extends FileFormat?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zsxwing @marmbrus Agree. It is hard to maintain the list. First, will add the missing class names to the list.
To implement what @marmbrus suggested, I think we need to do the following:
- Users' input might not be full class names. To dynamically load the class, we still need to have a list like:
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
Lines 80 to 108 in 361ebc2
/** A map to maintain backward compatibility in case we move data sources around. */ private val backwardCompatibilityMap: Map[String, String] = { val jdbc = classOf[JdbcRelationProvider].getCanonicalName val json = classOf[JsonFileFormat].getCanonicalName val parquet = classOf[ParquetFileFormat].getCanonicalName val csv = classOf[CSVFileFormat].getCanonicalName val libsvm = "org.apache.spark.ml.source.libsvm.LibSVMFileFormat" val orc = "org.apache.spark.sql.hive.orc.OrcFileFormat" Map( "org.apache.spark.sql.jdbc" -> jdbc, "org.apache.spark.sql.jdbc.DefaultSource" -> jdbc, "org.apache.spark.sql.execution.datasources.jdbc.DefaultSource" -> jdbc, "org.apache.spark.sql.execution.datasources.jdbc" -> jdbc, "org.apache.spark.sql.json" -> json, "org.apache.spark.sql.json.DefaultSource" -> json, "org.apache.spark.sql.execution.datasources.json" -> json, "org.apache.spark.sql.execution.datasources.json.DefaultSource" -> json, "org.apache.spark.sql.parquet" -> parquet, "org.apache.spark.sql.parquet.DefaultSource" -> parquet, "org.apache.spark.sql.execution.datasources.parquet" -> parquet, "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet, "org.apache.spark.sql.hive.orc.DefaultSource" -> orc, "org.apache.spark.sql.hive.orc" -> orc, "org.apache.spark.ml.source.libsvm.DefaultSource" -> libsvm, "org.apache.spark.ml.source.libsvm" -> libsvm, "com.databricks.spark.csv" -> csv ) }
To avoid duplicating the codes, we need to move this list to a place bothAnalyzerandDataSourcecan access. Any suggestion here? - Dynamically loading the class and check whether the class extends
FileFormat. We need to do something similar likespark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
Lines 128 to 132 in 361ebc2
try { Try(loader.loadClass(provider)).orElse(Try(loader.loadClass(provider2))) match { case Success(dataSource) => // Found the data source using fully qualified path dataSource - To avoid duplicate the codes, when unable to loading the class, we still follow the existing way to handle it. That is, returns the
UnresolvedRelationu.
Is my understanding right?
Thank you for your suggestions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't the logic go in ResolveDataSource?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
: ) Yeah. Let me try it. Thank you!
|
Test build #59554 has finished for PR 13283 at commit
|
| sparkSession, | ||
| paths = u.tableIdentifier.table :: Nil, | ||
| className = u.tableIdentifier.database.get) | ||
| if (dataSource.isFileFormat() == Option(false)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isFileFormat is not necessary. You can use providingClass like this:
val notSupportDirectQuery = try {
!classOf[FileFormat].isAssignableFrom(dataSource.providingClass)
} catch {
case NonFatal(e) => false
}
if (notSupportDirectQuery) {
throw new AnalysisException("Unsupported data source type for direct query on files: " +
s"${u.tableIdentifier.database.get}")
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much! Will use your version! It is much better. Thanks again!
|
Test build #59699 has finished for PR 13283 at commit
|
|
LGTM. Merging to master and 2.0. Thanks! |
#### What changes were proposed in this pull request? This PR is to address the following issues: - **ISSUE 1:** For ORC source format, we are reporting the strange error message when we did not enable Hive support: ```SQL SQL Example: select id from `org.apache.spark.sql.hive.orc`.`file_path` Error Message: Table or view not found: `org.apache.spark.sql.hive.orc`.`file_path` ``` Instead, we should issue the error message like: ``` Expected Error Message: The ORC data source must be used with Hive support enabled ``` - **ISSUE 2:** For the Avro format, we report the strange error message like: The example query is like ```SQL SQL Example: select id from `avro`.`file_path` select id from `com.databricks.spark.avro`.`file_path` Error Message: Table or view not found: `com.databricks.spark.avro`.`file_path` ``` The desired message should be like: ``` Expected Error Message: Failed to find data source: avro. Please use Spark package http://spark-packages.org/package/databricks/spark-avro" ``` - ~~**ISSUE 3:** Unable to detect incompatibility libraries for Spark 2.0 in Data Source Resolution. We report a strange error message:~~ **Update**: The latest code changes contains - For JDBC format, we added an extra checking in the rule `ResolveRelations` of `Analyzer`. Without the PR, Spark will return the error message like: `Option 'url' not specified`. Now, we are reporting `Unsupported data source type for direct query on files: jdbc` - Make data source format name case incensitive so that error handling behaves consistent with the normal cases. - Added the test cases for all the supported formats. #### How was this patch tested? Added test cases to cover all the above issues Author: gatorsmile <[email protected]> Author: xiaoli <[email protected]> Author: Xiao Li <[email protected]> Closes #13283 from gatorsmile/runSQLAgainstFile. (cherry picked from commit 9aff6f3) Signed-off-by: Shixiong Zhu <[email protected]>
| val notSupportDirectQuery = try { | ||
| !classOf[FileFormat].isAssignableFrom(dataSource.providingClass) | ||
| } catch { | ||
| case NonFatal(e) => false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would this happen ?
Should true be returned here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tedyu If people use select * from db_name.table_name, it will throw an exception. Still need to continue for such cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Ryan.
What changes were proposed in this pull request?
This PR is to address the following issues:
Instead, we should issue the error message like:
The example query is like
The desired message should be like:
ISSUE 3: Unable to detect incompatibility libraries for Spark 2.0 in Data Source Resolution. We report a strange error message:Update: The latest code changes contains
ResolveRelationsofAnalyzer. Without the PR, Spark will return the error message like:Option 'url' not specified. Now, we are reportingUnsupported data source type for direct query on files: jdbcHow was this patch tested?
Added test cases to cover all the above issues