-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[Spark-25993][SQL][TEST]Add test cases for CREATE EXTERNAL TABLE with subdirectories #23108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
docs/sql-migration-guide-upgrade.md
Outdated
|
|
||
| - Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC data source table and ORC vectorization would be applied. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior. | ||
|
|
||
| - In version 2.3 and earlier, `spark.sql.hive.converMetastoreOrc` default is `false`, if you specify a directory in the `LOCATION` clause in the `CREATE EXTERNAL TABLE STORED AS ORC LOCATION` sql statement, Spark will use the Hive ORC reader to read the data into the table if the directory or sub-directory contains the matching data, if you specify the wild card(*), the Hive ORC reader will not be able to read the data, because it is treating the wild card as a directory. For example: ORC data is stored at `/tmp/orctab1/dir1/`, `create external table tab1(...) stored as orc location '/tmp/orctab1/'` will read the data into the table, `create external table tab2(...) stored as orc location '/tmp/orctab1/*' ` will not. Since Spark 2.4, `spark.sql.hive.convertMetaStoreOrc` default is `true`, Spark will use native ORC reader, it will read the data if you specify the wild card, but will not if you specify the parent directory. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you change but will not if you specify the parent directory more clearly with examples like the other sentence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure.
| } | ||
| } | ||
|
|
||
| test("SPARK-25993 Add test cases for resolution of ORC table location") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HiveOrcSourceSuite.scala will be the better place. And, we had better have the following and cover both case behaviors; true and false.
Seq(true, false).foreach { convertMetastore =>
withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> s"$convertMetastore") {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will move the test case to there. Thanks.
|
Test build #99179 has finished for PR 23108 at commit
|
|
Test build #99235 has finished for PR 23108 at commit
|
|
Retest this please. |
|
Test build #99243 has finished for PR 23108 at commit
|
|
I fixed a typo in the testcase, retest please. Thanks. |
|
Test build #99251 has finished for PR 23108 at commit
|
docs/sql-migration-guide-upgrade.md
Outdated
|
|
||
| - Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC data source table and ORC vectorization would be applied. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior. | ||
|
|
||
| - In version 2.3 and earlier, `spark.sql.hive.converMetastoreOrc` default is `false`, if you specify a directory in the `LOCATION` clause in the `CREATE EXTERNAL TABLE STORED AS ORC LOCATION` sql statement, Spark will use the Hive ORC reader to read the data into the table if the directory or sub-directory contains the matching data, if you specify the wild card(*), the Hive ORC reader will not be able to read the data, because it is treating the wild card as a directory. For example: ORC data is stored at `/tmp/orctab1/dir1/`, `create external table tab1(...) stored as orc location '/tmp/orctab1/'` will read the data into the table, `create external table tab2(...) stored as orc location '/tmp/orctab1/*' ` will not. Since Spark 2.4, `spark.sql.hive.convertMetaStoreOrc` default is `true`, Spark will use native ORC reader, if you specify the wild card, it will try to read the matching data from current directory and sub-directory, if you specify a directory which does not contains the matching data, native ORC reader will not be able to read, even the data is in the sub-directory. For example: ORC data is stored at `/tmp/orctab1/dir1/`, `create external table tab3(...) stored as orc location '/tmp/orctab1/'` will not read the data from sub-directory into the table. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.sql.hive.converMetastoreOrc -> spark.sql.hive.convertMetastoreOrc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read this again. In fact, this is not a new behavior for Spark users because Apache Spark uses Parquet as a default format since 2.0 and the default behavior of STORED AS PARQUET works like this.
In order to give the richer context to the users and to avoid irrelevant confusions, we had better merge this part into the above line (line 112). For example, I'd like to update line 112 like the following.
applied. In addition, this makes Spark's Hive table read behavior more consistent over different formats and with the behavior of
spark.read.load. For example, for both ORC/Parquet Hive tables,LOCATION '/table/*'is required instead ofLOCATION '/table/'to create an external table reading its direct sub-directories like '/table/dir'. To setfalsetospark.sql.hive.convertMetastoreOrcrestores the previous behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will make changes.
| | c2 int, | ||
| | c3 string) | ||
| |STORED AS orc | ||
| |LOCATION '$wildCardDir'""".stripMargin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevinyu98 . This works, but there is a side effect with this. I mean this creates additional directory who name is '*' literally. And, newly inserted data go into that directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun Hello Dongjoon, yes, you are right. It will create a directory with the name is '*', and it is the same behavior prior spark 2.4. I was just following the examples from the jira. Do you have any suggestions here? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two suggestions.
- Is this PR aiming only one-level subdirectories? Could you check the behavior on one, two, three level subdirectories in Parquet Hive tables first?
- Since the test case looks general for both Parquet/ORC, please add a test case for Parquet while you are here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun Thanks for the suggestions. I tried with three level subdirectores for Parquet/ORC. Here is the result:
sql("set spark.sql.hive.convertMetastoreOrc=true")
three level directories
ORC:
- "/*" can read sub directory data, but not three level subdirectories
- "/" can only read current directory
Parquet:
- "/*" can read sub directory data, but not three level subdirectories
- "/" can only read current directory
sql("set spark.sql.hive.convertMetastoreOrc=false")
ORC:
- "/" can read three level subdirectories
- "/*" can't read any data
parquet:
- "/" can only read current directory
- "/*" can read sub directory data, but not three level subdirectories.
With sql("set spark.sql.hive.convertMetastoreOrc=true"), the ORC and Parquet behavior is consistent.
- I think this PR is aiming only one-level subdirectores.
- Sure, I will add one more for Parquet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for investigating. I agree with you for (1). For the test case, please add three-level subdirectories. That will help us to improve Spark later. You may file another JIRA issue for that as a new feature JIRA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun Sorry for the delay. My got some issues with my Intellij environment. Sure, I will add three level subdirectories for this PR. FYI, I also tried with convertMetastoreParquet for Parquet, the behavior is consistent.
sql("set spark.sql.hive.convertMetastoreParquet = true")
three level
Parquet:
-- "/" can only read current directory
-- "/*" can read sub directory data, but not three level subdirectories.
sql("set spark.sql.hive.convertMetastoreParquet = false")
-- "/" can only read current directory
-- "/*" can read sub directory data, but not three level subdirectories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun I forgot to add three level subdirectores in the last commit, I am adding now, will submit soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @kevinyu98 . Also, please update the title and description of this PR and SPARK-25993 JIRA issue.
- [Spark-25993][SQL][TEST]Add test cases for resolution of ORC table location
+ [SPARK-25993][SQL][TEST] Add test cases for CREATE EXTERNAL TABLE with subdirectories
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
Test build #99622 has finished for PR 23108 at commit
|
|
Test build #99688 has finished for PR 23108 at commit
|
| } | ||
| } | ||
|
|
||
| test("SPARK-25993 Add test cases for resolution of Parquet table location") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, move to HiveParquetSourceSuite? That's the similar one with HiveOrcSourceSuite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, let's replace the test case name with CREATE EXTERNAL TABLE with subdirectories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, for the full test coverage, can we have the following combination like ORC, too?
Seq(true, false).foreach { convertMetastore =>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, changed.
| } | ||
| } | ||
|
|
||
| test("SPARK-25993 Add test cases for resolution of ORC table location") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change this to CREATE EXTERNAL TABLE with subdirectories, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
|
|
||
| test("SPARK-25993 Add test cases for resolution of Parquet table location") { | ||
| withTempPath { path => | ||
| val someDF1 = Seq((1, 1, "parq1"), (2, 2, "parq2")).toDF("c1", "c2", "c3").repartition(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
| withTempPath { path => | ||
| val someDF1 = Seq((1, 1, "parq1"), (2, 2, "parq2")).toDF("c1", "c2", "c3").repartition(1) | ||
| withTable("tbl1", "tbl2", "tbl3") { | ||
| val dataDir = s"${path.getCanonicalPath}/l3/l2/l1/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| (1 to 2).map(i => Row(i, i, s"parq$i"))) | ||
|
|
||
| val wildcardL3Statement = | ||
| s""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| } | ||
| } | ||
|
|
||
| protected def testORCTableLocation(isConvertMetastore: Boolean): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this test helper function is only used in HiveOrcSourceSuite, can we move this into HiveOrcSourceSuite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I moved.
|
Test build #99746 has finished for PR 23108 at commit
|
| |OPTIONS ( | ||
| | PATH '${new File(orcTableAsDir.getAbsolutePath).toURI}' | ||
| | PATH '${new File(orcTableAsDir.getAbsolutePath | ||
| ).toURI}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above change in line 76 ~ 80 looks strange and irrelevant. Let's revert this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, I didn't notice my changes affect the formatting in the file. I have revert the change. Thanks
| } catch { | ||
| case e: IOException => | ||
| assert(e.getMessage().contains("java.io.IOException: Not a file")) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevinyu98 . Is this testing exceptions for the above all three SQLs? We use intercept[IOException] to test expected Exceptions.
For now, this looks like not a robust test case, because there is no assert(false) after sql(sqlStmt). We need to check the individual query failure and success exactly for the specific configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, I will make changes. Thanks.
| checkAnswer(sql(wildcardL3SqlStatement), Nil) | ||
| } | ||
| } finally { | ||
| hiveClient.runSqlHive("DROP TABLE IF EXISTS hive_orc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we need to clean up tbl1 ~ tbl4, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun at line 221, I put the tbl1 ~ tbl4 with the withTable, I think it will get dropped. I tried to run it couple time in intellij, it seems work fine. what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I missed that.
| class HiveParquetSourceSuite extends ParquetPartitioningTest { | ||
| import testImplicits._ | ||
| import spark._ | ||
| import java.io.IOException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This had better go to line 20.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
Test build #99799 has finished for PR 23108 at commit
|
|
retest please |
|
retest this please |
|
Test build #99804 has finished for PR 23108 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It seems that we are testing the only data in depth 3.
1) tbl1 LOCATION '${path}/l3/l2/'
2) tbl2 LOCATION '${path}/l3/l2/*'
3) tbl3 LOCATION '${path}/l3/*'
Like the followings, the test coverage should have depth 1 case, too. (as I requested before)
4) tbl4 LOCATION '${path}/*'
5) tbl5 LOCATION '${path}'
-
Let's have multiple data directories. Current one only puts the data into a single directory. Multiple data directory will represent the real use cases.
-
Could you make these Parquet/ORC test cases a little bit more similarily? We had better have the same style for the same test coverage. Currently, they look different. The best way is having a helper function, but it seems that it's gone during previous commits.
| } | ||
| } | ||
|
|
||
| test("SPARK-25993 CREATE EXTERNAL TABLE with subdirectories") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix this first for the first and second review comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun Thanks for the comments, I have tried to make the changes for the first and second review comments, I changed both suites to make it looks similar, also add more test cases. For the 3rd comments, I haven't found a common place to both suites, when you say the help function missing in the previous commit, can you help to point what kind of help function I missed? Thanks.
|
Test build #100042 has finished for PR 23108 at commit
|
|
retest this please |
|
Test build #100049 has finished for PR 23108 at commit
|
|
Retest please. I looked at the Test build, saw Builder #10055 has failure, but when I run the failure test case local, it works. |
|
retest this please |
|
Test build #100072 has finished for PR 23108 at commit
|
|
retest this please |
|
Test build #100111 has finished for PR 23108 at commit
|
|
Can one of the admins verify this patch? |
|
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
|
I did the rebase on the master, open a new pr27130. There is no |
…th subdirectories ### What changes were proposed in this pull request? This PR aims to add these test cases for resolution of ORC table location reported by [SPARK-25993](https://issues.apache.org/jira/browse/SPARK-25993) also add corresponding test cases for Parquet table. ### Why are the changes needed? The current behavior is complex, this test case suites are designed to prevent the accidental behavior change. This pr is rebased on master, the original pr is [23108](#23108) ### Does this PR introduce any user-facing change? No. This adds test cases only. ### How was this patch tested? This is a new test case. Closes #27130 from kevinyu98/spark-25993-2. Authored-by: Kevin Yu <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
Add these test cases for resolution of ORC table location reported by SPARK-25993
also add corresponding test cases for Parquet table.
Update the
doc
How was this patch tested?
This is a new test case.