Skip to content

Conversation

@kevinyu98
Copy link
Contributor

@kevinyu98 kevinyu98 commented Nov 21, 2018

What changes were proposed in this pull request?

Add these test cases for resolution of ORC table location reported by SPARK-25993

also add corresponding test cases for Parquet table.

Update the

sql-migration-guide-update

doc

How was this patch tested?

This is a new test case.

@dongjoon-hyun
Copy link
Member

ok to test


- Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC data source table and ORC vectorization would be applied. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior.

- In version 2.3 and earlier, `spark.sql.hive.converMetastoreOrc` default is `false`, if you specify a directory in the `LOCATION` clause in the `CREATE EXTERNAL TABLE STORED AS ORC LOCATION` sql statement, Spark will use the Hive ORC reader to read the data into the table if the directory or sub-directory contains the matching data, if you specify the wild card(*), the Hive ORC reader will not be able to read the data, because it is treating the wild card as a directory. For example: ORC data is stored at `/tmp/orctab1/dir1/`, `create external table tab1(...) stored as orc location '/tmp/orctab1/'` will read the data into the table, `create external table tab2(...) stored as orc location '/tmp/orctab1/*' ` will not. Since Spark 2.4, `spark.sql.hive.convertMetaStoreOrc` default is `true`, Spark will use native ORC reader, it will read the data if you specify the wild card, but will not if you specify the parent directory. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you change but will not if you specify the parent directory more clearly with examples like the other sentence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

}
}

test("SPARK-25993 Add test cases for resolution of ORC table location") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HiveOrcSourceSuite.scala will be the better place. And, we had better have the following and cover both case behaviors; true and false.

    Seq(true, false).foreach { convertMetastore =>
      withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> s"$convertMetastore") {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will move the test case to there. Thanks.

@SparkQA
Copy link

SparkQA commented Nov 22, 2018

Test build #99179 has finished for PR 23108 at commit e238764.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 25, 2018

Test build #99235 has finished for PR 23108 at commit e731746.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Nov 25, 2018

Test build #99243 has finished for PR 23108 at commit e731746.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kevinyu98
Copy link
Contributor Author

I fixed a typo in the testcase, retest please. Thanks.

@SparkQA
Copy link

SparkQA commented Nov 26, 2018

Test build #99251 has finished for PR 23108 at commit d6e582b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


- Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC data source table and ORC vectorization would be applied. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior.

- In version 2.3 and earlier, `spark.sql.hive.converMetastoreOrc` default is `false`, if you specify a directory in the `LOCATION` clause in the `CREATE EXTERNAL TABLE STORED AS ORC LOCATION` sql statement, Spark will use the Hive ORC reader to read the data into the table if the directory or sub-directory contains the matching data, if you specify the wild card(*), the Hive ORC reader will not be able to read the data, because it is treating the wild card as a directory. For example: ORC data is stored at `/tmp/orctab1/dir1/`, `create external table tab1(...) stored as orc location '/tmp/orctab1/'` will read the data into the table, `create external table tab2(...) stored as orc location '/tmp/orctab1/*' ` will not. Since Spark 2.4, `spark.sql.hive.convertMetaStoreOrc` default is `true`, Spark will use native ORC reader, if you specify the wild card, it will try to read the matching data from current directory and sub-directory, if you specify a directory which does not contains the matching data, native ORC reader will not be able to read, even the data is in the sub-directory. For example: ORC data is stored at `/tmp/orctab1/dir1/`, `create external table tab3(...) stored as orc location '/tmp/orctab1/'` will not read the data from sub-directory into the table. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.sql.hive.converMetastoreOrc -> spark.sql.hive.convertMetastoreOrc

Copy link
Member

@dongjoon-hyun dongjoon-hyun Nov 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read this again. In fact, this is not a new behavior for Spark users because Apache Spark uses Parquet as a default format since 2.0 and the default behavior of STORED AS PARQUET works like this.

In order to give the richer context to the users and to avoid irrelevant confusions, we had better merge this part into the above line (line 112). For example, I'd like to update line 112 like the following.

applied. In addition, this makes Spark's Hive table read behavior more consistent over different formats and with the behavior of spark.read.load. For example, for both ORC/Parquet Hive tables, LOCATION '/table/*' is required instead of LOCATION '/table/' to create an external table reading its direct sub-directories like '/table/dir'. To set false to spark.sql.hive.convertMetastoreOrc restores the previous behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will make changes.

| c2 int,
| c3 string)
|STORED AS orc
|LOCATION '$wildCardDir'""".stripMargin
Copy link
Member

@dongjoon-hyun dongjoon-hyun Nov 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinyu98 . This works, but there is a side effect with this. I mean this creates additional directory who name is '*' literally. And, newly inserted data go into that directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Hello Dongjoon, yes, you are right. It will create a directory with the name is '*', and it is the same behavior prior spark 2.4. I was just following the examples from the jira. Do you have any suggestions here? Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two suggestions.

  1. Is this PR aiming only one-level subdirectories? Could you check the behavior on one, two, three level subdirectories in Parquet Hive tables first?
  2. Since the test case looks general for both Parquet/ORC, please add a test case for Parquet while you are here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Thanks for the suggestions. I tried with three level subdirectores for Parquet/ORC. Here is the result:

sql("set spark.sql.hive.convertMetastoreOrc=true")

three level directories

ORC:

  • "/*" can read sub directory data, but not three level subdirectories
  • "/" can only read current directory

Parquet:

  • "/*" can read sub directory data, but not three level subdirectories
  • "/" can only read current directory

sql("set spark.sql.hive.convertMetastoreOrc=false")

ORC:

  • "/" can read three level subdirectories
  • "/*" can't read any data

parquet:

  • "/" can only read current directory
  • "/*" can read sub directory data, but not three level subdirectories.

With sql("set spark.sql.hive.convertMetastoreOrc=true"), the ORC and Parquet behavior is consistent.

  1. I think this PR is aiming only one-level subdirectores.
  2. Sure, I will add one more for Parquet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for investigating. I agree with you for (1). For the test case, please add three-level subdirectories. That will help us to improve Spark later. You may file another JIRA issue for that as a new feature JIRA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Sorry for the delay. My got some issues with my Intellij environment. Sure, I will add three level subdirectories for this PR. FYI, I also tried with convertMetastoreParquet for Parquet, the behavior is consistent.
sql("set spark.sql.hive.convertMetastoreParquet = true")

three level

Parquet:

-- "/" can only read current directory
-- "/*" can read sub directory data, but not three level subdirectories.

sql("set spark.sql.hive.convertMetastoreParquet = false")

-- "/" can only read current directory
-- "/*" can read sub directory data, but not three level subdirectories.

Copy link
Contributor Author

@kevinyu98 kevinyu98 Dec 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun I forgot to add three level subdirectores in the last commit, I am adding now, will submit soon.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Dec 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @kevinyu98 . Also, please update the title and description of this PR and SPARK-25993 JIRA issue.

- [Spark-25993][SQL][TEST]Add test cases for resolution of ORC table location 
+ [SPARK-25993][SQL][TEST] Add test cases for CREATE EXTERNAL TABLE with subdirectories

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@SparkQA
Copy link

SparkQA commented Dec 3, 2018

Test build #99622 has finished for PR 23108 at commit d75b923.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kevinyu98 kevinyu98 changed the title [Spark-25993][SQL][TEST]Add test cases for resolution of ORC table location [Spark-25993][SQL][TEST]Add test cases for CREATE EXTERNAL TABLE with subdirectories Dec 5, 2018
@SparkQA
Copy link

SparkQA commented Dec 5, 2018

Test build #99688 has finished for PR 23108 at commit fe472c8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
}

test("SPARK-25993 Add test cases for resolution of Parquet table location") {
Copy link
Member

@dongjoon-hyun dongjoon-hyun Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, move to HiveParquetSourceSuite? That's the similar one with HiveOrcSourceSuite.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, let's replace the test case name with CREATE EXTERNAL TABLE with subdirectories.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, for the full test coverage, can we have the following combination like ORC, too?

Seq(true, false).foreach { convertMetastore =>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, changed.

}
}

test("SPARK-25993 Add test cases for resolution of ORC table location") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change this to CREATE EXTERNAL TABLE with subdirectories, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.


test("SPARK-25993 Add test cases for resolution of Parquet table location") {
withTempPath { path =>
val someDF1 = Seq((1, 1, "parq1"), (2, 2, "parq2")).toDF("c1", "c2", "c3").repartition(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

withTempPath { path =>
val someDF1 = Seq((1, 1, "parq1"), (2, 2, "parq2")).toDF("c1", "c2", "c3").repartition(1)
withTable("tbl1", "tbl2", "tbl3") {
val dataDir = s"${path.getCanonicalPath}/l3/l2/l1/"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

(1 to 2).map(i => Row(i, i, s"parq$i")))

val wildcardL3Statement =
s"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}
}

protected def testORCTableLocation(isConvertMetastore: Boolean): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this test helper function is only used in HiveOrcSourceSuite, can we move this into HiveOrcSourceSuite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I moved.

@SparkQA
Copy link

SparkQA commented Dec 6, 2018

Test build #99746 has finished for PR 23108 at commit 51d1d78.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

|OPTIONS (
| PATH '${new File(orcTableAsDir.getAbsolutePath).toURI}'
| PATH '${new File(orcTableAsDir.getAbsolutePath
).toURI}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above change in line 76 ~ 80 looks strange and irrelevant. Let's revert this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, I didn't notice my changes affect the formatting in the file. I have revert the change. Thanks

} catch {
case e: IOException =>
assert(e.getMessage().contains("java.io.IOException: Not a file"))
}
Copy link
Member

@dongjoon-hyun dongjoon-hyun Dec 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinyu98 . Is this testing exceptions for the above all three SQLs? We use intercept[IOException] to test expected Exceptions.

For now, this looks like not a robust test case, because there is no assert(false) after sql(sqlStmt). We need to check the individual query failure and success exactly for the specific configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, I will make changes. Thanks.

checkAnswer(sql(wildcardL3SqlStatement), Nil)
}
} finally {
hiveClient.runSqlHive("DROP TABLE IF EXISTS hive_orc")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we need to clean up tbl1 ~ tbl4, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun at line 221, I put the tbl1 ~ tbl4 with the withTable, I think it will get dropped. I tried to run it couple time in intellij, it seems work fine. what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I missed that.

class HiveParquetSourceSuite extends ParquetPartitioningTest {
import testImplicits._
import spark._
import java.io.IOException
Copy link
Member

@dongjoon-hyun dongjoon-hyun Dec 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This had better go to line 20.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@SparkQA
Copy link

SparkQA commented Dec 7, 2018

Test build #99799 has finished for PR 23108 at commit d851169.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kevinyu98
Copy link
Contributor Author

retest please

@dilipbiswal
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Dec 7, 2018

Test build #99804 has finished for PR 23108 at commit d851169.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It seems that we are testing the only data in depth 3.
1) tbl1 LOCATION '${path}/l3/l2/'
2) tbl2 LOCATION '${path}/l3/l2/*'
3) tbl3 LOCATION '${path}/l3/*' 

Like the followings, the test coverage should have depth 1 case, too. (as I requested before)

4) tbl4 LOCATION '${path}/*' 
5) tbl5 LOCATION '${path}'
  1. Let's have multiple data directories. Current one only puts the data into a single directory. Multiple data directory will represent the real use cases.

  2. Could you make these Parquet/ORC test cases a little bit more similarily? We had better have the same style for the same test coverage. Currently, they look different. The best way is having a helper function, but it seems that it's gone during previous commits.

}
}

test("SPARK-25993 CREATE EXTERNAL TABLE with subdirectories") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix this first for the first and second review comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Thanks for the comments, I have tried to make the changes for the first and second review comments, I changed both suites to make it looks similar, also add more test cases. For the 3rd comments, I haven't found a common place to both suites, when you say the help function missing in the previous commit, can you help to point what kind of help function I missed? Thanks.

@SparkQA
Copy link

SparkQA commented Dec 12, 2018

Test build #100042 has finished for PR 23108 at commit fef8c68.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Dec 13, 2018

Test build #100049 has finished for PR 23108 at commit fef8c68.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kevinyu98
Copy link
Contributor Author

Retest please. I looked at the Test build, saw Builder #10055 has failure, but when I run the failure test case local, it works.
SparkPullRequestBuilder #100055
build/sbt "test-only org.apache.spark.sql.hive.client.HiveClientSuites"
info] Run completed in 15 minutes, 21 seconds.
[info] Total number of tests run: 189
[info] Suites: completed 18, aborted 0
[info] Tests: succeeded 189, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[info] Passed: Total 189, Failed 0, Errors 0, Passed 189

@dilipbiswal
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Dec 13, 2018

Test build #100072 has finished for PR 23108 at commit fef8c68.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Dec 14, 2018

Test build #100111 has finished for PR 23108 at commit fef8c68.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

cc @dilipbiswal @kevinyu98

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@github-actions
Copy link

github-actions bot commented Jan 4, 2020

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

@kevinyu98
Copy link
Contributor Author

kevinyu98 commented Jan 8, 2020

I did the rebase on the master, open a new pr27130. There is no sql-migration-guide-upgrade.md on the master code, it is on the branch 2.4, should I update doc on branch 2.4? thanks

dongjoon-hyun pushed a commit that referenced this pull request Jan 18, 2020
…th subdirectories

### What changes were proposed in this pull request?

This PR aims to add these test cases for resolution of ORC table location reported by [SPARK-25993](https://issues.apache.org/jira/browse/SPARK-25993)
also add corresponding test cases for Parquet table.

### Why are the changes needed?

The current behavior is complex, this test case suites are designed to prevent the accidental behavior change. This pr is rebased on master, the original pr is [23108](#23108)

### Does this PR introduce any user-facing change?

No. This adds test cases only.

### How was this patch tested?

This is a new test case.

Closes #27130 from kevinyu98/spark-25993-2.

Authored-by: Kevin Yu <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants