Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Apr 17, 2019

What changes were proposed in this pull request?

This pr fix hadoop-3.2 test issues(except the hive-thriftserver module):

  1. Add hive.metastore.schema.verification and datanucleus.schema.autoCreateAll to HiveConf.
  2. hadoop-3.2 support access the Hive metastore from 0.12 to 2.2

After SPARK-27176 and this PR, we upgraded the built-in Hive to 2.3 when enabling the Hadoop 3.2+ profile. This upgrade fixes the following issues:

  • HIVE-6727: Table level stats for external tables are set incorrectly.
  • HIVE-15653: Some ALTER TABLE commands drop table stats.
  • SPARK-12014: Spark SQL query containing semicolon is broken in Beeline.
  • SPARK-25193: insert overwrite doesn't throw exception when drop old data fails.
  • SPARK-25919: Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned.
  • SPARK-26332: Spark sql write orc table on viewFS throws exception.
  • SPARK-26437: Decimal data becomes bigint to query, unable to query.

How was this patch tested?

This pr test Spark’s Hadoop 3.2 profile on jenkins and #24591 test Spark’s Hadoop 2.7 profile on jenkins

This PR close #24591

for dep in dependencies:
dep.dependent_modules.add(self)
all_modules.append(self)
if name == "hive-thriftserver" and hadoop_version == "hadoop3.2":
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used to skip hive-thriftserver module for hadoop-3.2. Will revert this change once we can merge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe leave a TODO here just to try to make sure that doesn't get lost

@SparkQA
Copy link

SparkQA commented Apr 17, 2019

Test build #104646 has finished for PR 24391 at commit 9c0046a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Apr 17, 2019

retest this please

protected def isBarrierClass(name: String): Boolean =
name.startsWith(classOf[HiveClientImpl].getName) ||
name.startsWith(classOf[Shim].getName) ||
name.startsWith(classOf[ShimLoader].getName) ||
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add org.apache.hadoop.hive.shims.ShimLoader to BarrierClass, otherwise hadoop-3.2 can't access the Hive metastore from 0.12 to 2.2:

build/sbt "hive/testOnly *.VersionsSuite" -Phadoop-3.2 -Phive
...
[info] - 0.12: create client *** FAILED *** (36 seconds, 207 milliseconds)
[info]   java.lang.reflect.InvocationTargetException:
[info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
[info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
[info]   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
[info]   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
[info]   at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:295)
[info]   at org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:58)
[info]   at org.apache.spark.sql.hive.client.VersionsSuite.$anonfun$new$7(VersionsSuite.scala:130)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
[info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:105)
[info]   at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
[info]   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
[info]   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
[info]   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
[info]   at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
[info]   at org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
[info]   at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
[info]   at scala.collection.immutable.List.foreach(List.scala:392)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
[info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
[info]   at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
[info]   at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
[info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
[info]   at org.scalatest.Suite.run(Suite.scala:1147)
[info]   at org.scalatest.Suite.run$(Suite.scala:1129)
[info]   at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
[info]   at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
[info]   at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
[info]   at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
[info]   at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:54)
[info]   at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
[info]   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info]   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info]   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:54)
[info]   at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
[info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:507)
[info]   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
[info]   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
[info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info]   at java.lang.Thread.run(Thread.java:748)
[info]   Cause: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: java.lang.IllegalArgumentException: Unrecognized Hadoop major version number: 3.2.0
[info]   at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:286)
[info]   at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:187)
[info]   at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:119)

@SparkQA
Copy link

SparkQA commented Apr 17, 2019

Test build #104649 has finished for PR 24391 at commit 9c0046a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 17, 2019

Test build #104647 has finished for PR 24391 at commit 9c0046a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 17, 2019

Test build #104657 has finished for PR 24391 at commit 490a751.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 19, 2019

Test build #104750 has finished for PR 24391 at commit 70b4b0b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Apr 20, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Apr 20, 2019

Test build #104768 has finished for PR 24391 at commit 70b4b0b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum wangyum changed the title [SPARK-27402][SQL][test-hadoop3.2] Fix hadoop-3.2 test issue(except the hive-thriftserver module) [SPARK-27402][SQL][test-hadoop2.7] Fix hadoop-3.2 test issue(except the hive-thriftserver module) Apr 22, 2019
@wangyum
Copy link
Member Author

wangyum commented Apr 22, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Apr 22, 2019

Test build #104805 has finished for PR 24391 at commit 0893172.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 22, 2019

Test build #104809 has finished for PR 24391 at commit 0893172.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum wangyum changed the title [SPARK-27402][SQL][test-hadoop2.7] Fix hadoop-3.2 test issue(except the hive-thriftserver module) [SPARK-27402][SQL][test-hadoop3.2] Fix hadoop-3.2 test issue(except the hive-thriftserver module) Apr 23, 2019
@wangyum
Copy link
Member Author

wangyum commented Apr 23, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Apr 23, 2019

Test build #104838 has finished for PR 24391 at commit 0893172.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 27, 2019

Test build #104963 has finished for PR 24391 at commit 262be84.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Apr 27, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Apr 27, 2019

Test build #104964 has finished for PR 24391 at commit 262be84.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Apr 27, 2019

retest this please


if (analyzedBySpark) {
if (HiveUtils.isHive23) {
assert(numRows.isDefined && numRows.get == 500)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, Hive fixed this annoying bug. The statistics are very important for us too. Please add this in the migration guide and document the behavior difference when they used different Hadoop profile.

assert(properties("totalSize").toLong <= 0, "external table totalSize must be <= 0")
assert(properties("rawDataSize").toLong <= 0, "external table rawDataSize must be <= 0")
if (HiveUtils.isHive23) {
assert(properties("totalSize").toLong > 0, "external table totalSize must be > 0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the number? How does Hive collect this info?

@gatorsmile
Copy link
Member

My review is not finished. I will review it this weekend. If I do not leave the comments, please ping me using IMs

@SparkQA
Copy link

SparkQA commented May 3, 2019

Test build #105106 has finished for PR 24391 at commit e321f61.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

hadoop_version = os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE", "hadoop2.7")
else:
hadoop_version = os.environ.get("HADOOP_PROFILE", "hadoop2.7")

Copy link
Member

@gatorsmile gatorsmile May 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a log message to show which profile we are using; otherwise, it is hard for us to know which profile is activated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean, we should have a log message to show which profile is being actually used.

Just relying on the command parameter lists does not sound very reliable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. updated to:
image


## Upgrading From Spark SQL 2.4 to 3.0

- Since Spark 3.0, we upgraded the built-in Hive to 2.3 for hadoop-3.x. This upgrade fixes some Hive issues:
Copy link
Member

@gatorsmile gatorsmile May 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for hadoop-3.x

when enabling the Hadoop 3.2+ profile.

@SparkQA
Copy link

SparkQA commented May 7, 2019

Test build #105177 has finished for PR 24391 at commit aaf2952.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


## Upgrading From Spark SQL 2.4 to 3.0

- Since Spark 3.0, we upgraded the built-in Hive to 2.3 when enabling the Hadoop 3.2+ profile. This upgrade fixes some Hive issues:
Copy link
Member

@gatorsmile gatorsmile May 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some Hive issues -> the following issues:

@SparkQA
Copy link

SparkQA commented May 10, 2019

Test build #105302 has finished for PR 24391 at commit 994b5a7.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// assert(storageFormat.properties.get("path") === expected)
// }
assert(storageFormat.locationUri === Some(expected))
assert(Some(storageFormat.locationUri.get.getPath) === Some(expected.getPath))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add Some?

What is the reason we need to change this test case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
Add Some to support both Hive 1.2 and 2.3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storageFormat.locationUri.map(_.getPath) then

// We hard-code these configurations here to allow bin/spark-shell, bin/spark-sql
// and sbin/start-thriftserver.sh automatically creates Derby metastore.
hiveConf.setBoolean("hive.metastore.schema.verification", false)
hiveConf.setBoolean("datanucleus.schema.autoCreateAll", true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test case to ensure that we still respect the values set from users?

For example, if users set a different value, we do not reset them to our values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I update it to:

  def conf: HiveConf = {
    val hiveConf = state.getConf
    // Hive changed the default of datanucleus.schema.autoCreateAll from true to false
    // and hive.metastore.schema.verification from false to true since Hive 2.0.
    // For details, see the JIRA HIVE-6113, HIVE-12463 and HIVE-1841.
    // For the production environment. Either isDefaultMSUri or isDerbyMS should not be true.
    // We hard-code hive.metastore.schema.verification and datanucleus.schema.autoCreateAll to allow
    // bin/spark-shell, bin/spark-sql and sbin/start-thriftserver.sh to automatically create the
    // Derby Metastore when running Spark in the non-production environment.
    val isDefaultMSUri = hiveConf.getVar(METASTOREURIS).equals(METASTOREURIS.defaultStrVal)
    val isDerbyMS = hiveConf.getVar(METASTORECONNECTURLKEY).startsWith("jdbc:derby")
    if (isDefaultMSUri && isDerbyMS) {
      hiveConf.setBoolean("hive.metastore.schema.verification", false)
      hiveConf.setBoolean("datanucleus.schema.autoCreateAll", true)
    }
    hiveConf
  }

Because we don't know it's default value or set by user. so just set it for non-production environment.

@gatorsmile
Copy link
Member

My review is finished. Overall, it looks good to me. After we ensure that maven can pass all the tests, we also need to rerun SBT as well.

@gatorsmile
Copy link
Member

@wangyum Thank you for your hard work!!!

for dep in dependencies:
dep.dependent_modules.add(self)
all_modules.append(self)
if name == "hive-thriftserver" and hadoop_version == "hadoop3.2":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe leave a TODO here just to try to make sure that doesn't get lost


## Upgrading From Spark SQL 2.4 to 3.0

- Since Spark 3.0, we upgraded the built-in Hive to 2.3 when enabling the Hadoop 3.2+ profile. This upgrade fixes the following issues:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to go in the migration guide? it's helpful info but does it help a user know what they need to do differently?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add to release note?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JIRA lists can be moved to the PR description, but we should document the actual behavior difference in migration guide.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Create SPARK-27686 to update migration guide

// assert(storageFormat.properties.get("path") === expected)
// }
assert(storageFormat.locationUri === Some(expected))
assert(Some(storageFormat.locationUri.get.getPath) === Some(expected.getPath))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storageFormat.locationUri.map(_.getPath) then

@SparkQA
Copy link

SparkQA commented May 10, 2019

Test build #105319 has finished for PR 24391 at commit 4372759.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 11, 2019

Test build #105326 has finished for PR 24391 at commit 8520124.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


/** Returns the configuration for the current session. */
def conf: HiveConf = state.getConf
def conf: HiveConf = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid a regression, could you change it to ?

if (!isHive23) {
    state.getConf
  } else {
    ...
  }

@SparkQA
Copy link

SparkQA commented May 13, 2019

Test build #105344 has finished for PR 24391 at commit 27e5b6e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented May 13, 2019

retest this please

@SparkQA
Copy link

SparkQA commented May 13, 2019

Test build #105346 has finished for PR 24391 at commit 27e5b6e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

Thanks! Merged to master.

import re
import os

if os.environ.get("AMPLAB_JENKINS"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangyum, this will shows the info every time this modules is imported. why did we do this here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay. it's a temp fix so I'm fine. I will make a followup to handle https://github.com/apache/spark/pull/24639/files

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skip the hive-thriftserver module when running the hadoop-3.2 test. Will remove it in another PR: https://github.com/apache/spark/pull/24628/files

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @HyukjinKwon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants