Skip to content

Conversation

@gatorsmile
Copy link
Member

@gatorsmile gatorsmile commented May 24, 2016

What changes were proposed in this pull request?

This PR is to address the following issues:

  • ISSUE 1: For ORC source format, we are reporting the strange error message when we did not enable Hive support:
SQL Example: 
  select id from `org.apache.spark.sql.hive.orc`.`file_path`
Error Message:
  Table or view not found: `org.apache.spark.sql.hive.orc`.`file_path`

Instead, we should issue the error message like:

Expected Error Message:
   The ORC data source must be used with Hive support enabled
  • ISSUE 2: For the Avro format, we report the strange error message like:

The example query is like

SQL Example: 
  select id from `avro`.`file_path`
  select id from `com.databricks.spark.avro`.`file_path`
Error Message:
  Table or view not found: `com.databricks.spark.avro`.`file_path`

The desired message should be like:

Expected Error Message:
  Failed to find data source: avro. Please use Spark package http://spark-packages.org/package/databricks/spark-avro"
  • ISSUE 3: Unable to detect incompatibility libraries for Spark 2.0 in Data Source Resolution. We report a strange error message:

Update: The latest code changes contains

  • For JDBC format, we added an extra checking in the rule ResolveRelations of Analyzer. Without the PR, Spark will return the error message like: Option 'url' not specified. Now, we are reporting Unsupported data source type for direct query on files: jdbc
  • Make data source format name case incensitive so that error handling behaves consistent with the normal cases.
  • Added the test cases for all the supported formats.

How was this patch tested?

Added test cases to cover all the above issues

gatorsmile and others added 30 commits November 13, 2015 14:50
val className = error.getMessage
if (spark2RemovedClasses.contains(className)) {
throw new ClassNotFoundException(s"$className is removed in Spark 2.0. " +
// error.getMessage is the class name of provider2. Instead, we use provider here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a second thought, I don't think we need this if branch. Could you just remove it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for link issues. But it will be NoClassDefFoundError

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do it. Thanks!

@gatorsmile gatorsmile changed the title [SPARK-15515] [SPARK-15514] [SQL] Error Handling in Running SQL Directly On Files [SPARK-15515] [SQL] Error Handling in Running SQL Directly On Files May 25, 2016
@gatorsmile
Copy link
Member Author

It sounds like we need to verify all the possible source types we can support. Let me add them. Thanks!

@SparkQA
Copy link

SparkQA commented May 25, 2016

Test build #59296 has finished for PR 13283 at commit 3ac2b93.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

Update: The latest code changes contains

  • For JDBC format, we added an extra checking in the rule ResolveRelations of Analyzer. Without the PR, Spark will return the error message like: Option 'url' not specified. Now, we are reporting Unsupported data source type for direct query on files: jdbc
  • Make data source format name case incensitive so that error handling behaves consistent with the normal cases.
  • Added the test cases for all the supported formats.

@SparkQA
Copy link

SparkQA commented May 26, 2016

Test build #59332 has finished for PR 13283 at commit 76f4f80.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 26, 2016

Test build #59346 has finished for PR 13283 at commit b9e12f8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

@zsxwing Now, code is ready for review. Thanks!

* Data source formats that were not supported in direct query on file
*/
private final val unsupportedFileQuerySource = Set(
"org.apache.spark.sql.jdbc",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this. Looks hard to maintain. E.g., you forget to add JdbcRelationProvider. @rxin What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just check to see if it extends FileFormat?

Copy link
Member Author

@gatorsmile gatorsmile May 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zsxwing @marmbrus Agree. It is hard to maintain the list. First, will add the missing class names to the list.

To implement what @marmbrus suggested, I think we need to do the following:

  • Users' input might not be full class names. To dynamically load the class, we still need to have a list like:
    /** A map to maintain backward compatibility in case we move data sources around. */
    private val backwardCompatibilityMap: Map[String, String] = {
    val jdbc = classOf[JdbcRelationProvider].getCanonicalName
    val json = classOf[JsonFileFormat].getCanonicalName
    val parquet = classOf[ParquetFileFormat].getCanonicalName
    val csv = classOf[CSVFileFormat].getCanonicalName
    val libsvm = "org.apache.spark.ml.source.libsvm.LibSVMFileFormat"
    val orc = "org.apache.spark.sql.hive.orc.OrcFileFormat"
    Map(
    "org.apache.spark.sql.jdbc" -> jdbc,
    "org.apache.spark.sql.jdbc.DefaultSource" -> jdbc,
    "org.apache.spark.sql.execution.datasources.jdbc.DefaultSource" -> jdbc,
    "org.apache.spark.sql.execution.datasources.jdbc" -> jdbc,
    "org.apache.spark.sql.json" -> json,
    "org.apache.spark.sql.json.DefaultSource" -> json,
    "org.apache.spark.sql.execution.datasources.json" -> json,
    "org.apache.spark.sql.execution.datasources.json.DefaultSource" -> json,
    "org.apache.spark.sql.parquet" -> parquet,
    "org.apache.spark.sql.parquet.DefaultSource" -> parquet,
    "org.apache.spark.sql.execution.datasources.parquet" -> parquet,
    "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet,
    "org.apache.spark.sql.hive.orc.DefaultSource" -> orc,
    "org.apache.spark.sql.hive.orc" -> orc,
    "org.apache.spark.ml.source.libsvm.DefaultSource" -> libsvm,
    "org.apache.spark.ml.source.libsvm" -> libsvm,
    "com.databricks.spark.csv" -> csv
    )
    }

    To avoid duplicating the codes, we need to move this list to a place both Analyzer and DataSource can access. Any suggestion here?
  • Dynamically loading the class and check whether the class extends FileFormat. We need to do something similar like
    try {
    Try(loader.loadClass(provider)).orElse(Try(loader.loadClass(provider2))) match {
    case Success(dataSource) =>
    // Found the data source using fully qualified path
    dataSource
  • To avoid duplicate the codes, when unable to loading the class, we still follow the existing way to handle it. That is, returns the UnresolvedRelation u.

Is my understanding right?

Thank you for your suggestions!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't the logic go in ResolveDataSource?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

: ) Yeah. Let me try it. Thank you!

@SparkQA
Copy link

SparkQA commented May 28, 2016

Test build #59554 has finished for PR 13283 at commit e5c08f2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

@zsxwing @marmbrus How about the latest code changes? Let me know if anything I can further improve. Thanks!

sparkSession,
paths = u.tableIdentifier.table :: Nil,
className = u.tableIdentifier.database.get)
if (dataSource.isFileFormat() == Option(false)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isFileFormat is not necessary. You can use providingClass like this:

        val notSupportDirectQuery = try {
          !classOf[FileFormat].isAssignableFrom(dataSource.providingClass)
        } catch {
          case NonFatal(e) => false
        }
        if (notSupportDirectQuery) {
          throw new AnalysisException("Unsupported data source type for direct query on files: " +
            s"${u.tableIdentifier.database.get}")
        }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much! Will use your version! It is much better. Thanks again!

@SparkQA
Copy link

SparkQA commented Jun 1, 2016

Test build #59699 has finished for PR 13283 at commit 9fae469.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

@zsxwing @marmbrus Any remaining issue? Thanks!

@zsxwing
Copy link
Member

zsxwing commented Jun 2, 2016

LGTM. Merging to master and 2.0. Thanks!

@asfgit asfgit closed this in 9aff6f3 Jun 2, 2016
asfgit pushed a commit that referenced this pull request Jun 2, 2016
#### What changes were proposed in this pull request?
This PR is to address the following issues:

- **ISSUE 1:** For ORC source format, we are reporting the strange error message when we did not enable Hive support:
```SQL
SQL Example:
  select id from `org.apache.spark.sql.hive.orc`.`file_path`
Error Message:
  Table or view not found: `org.apache.spark.sql.hive.orc`.`file_path`
```
Instead, we should issue the error message like:
```
Expected Error Message:
   The ORC data source must be used with Hive support enabled
```
- **ISSUE 2:** For the Avro format, we report the strange error message like:

The example query is like
  ```SQL
SQL Example:
  select id from `avro`.`file_path`
  select id from `com.databricks.spark.avro`.`file_path`
Error Message:
  Table or view not found: `com.databricks.spark.avro`.`file_path`
   ```
The desired message should be like:
```
Expected Error Message:
  Failed to find data source: avro. Please use Spark package http://spark-packages.org/package/databricks/spark-avro"
```

- ~~**ISSUE 3:** Unable to detect incompatibility libraries for Spark 2.0 in Data Source Resolution. We report a strange error message:~~

**Update**: The latest code changes contains
- For JDBC format, we added an extra checking in the rule `ResolveRelations` of `Analyzer`. Without the PR, Spark will return the error message like: `Option 'url' not specified`. Now, we are reporting `Unsupported data source type for direct query on files: jdbc`
- Make data source format name case incensitive so that error handling behaves consistent with the normal cases.
- Added the test cases for all the supported formats.

#### How was this patch tested?
Added test cases to cover all the above issues

Author: gatorsmile <[email protected]>
Author: xiaoli <[email protected]>
Author: Xiao Li <[email protected]>

Closes #13283 from gatorsmile/runSQLAgainstFile.

(cherry picked from commit 9aff6f3)
Signed-off-by: Shixiong Zhu <[email protected]>
val notSupportDirectQuery = try {
!classOf[FileFormat].isAssignableFrom(dataSource.providingClass)
} catch {
case NonFatal(e) => false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would this happen ?

Should true be returned here ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tedyu If people use select * from db_name.table_name, it will throw an exception. Still need to continue for such cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ryan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants