[SPARK-32145][SQL][test-hive1.2][test-hadoop2.7] ThriftCLIService.GetOperationStatus should include exception's stack trace to the error message #28963

yaooqinn · 2020-07-01T07:17:54Z

What changes were proposed in this pull request?

In https://issues.apache.org/jira/browse/SPARK-29283, we only show the error message of root cause to end-users through JDBC client. In some cases, it erases the straightaway messages that we intentionally make to help them for better understanding.

The root cause is somehow obscure for JDBC end-users who only writing SQL queries.

e.g

Error running query: org.apache.spark.sql.AnalysisException: The second argument of 'date_sub' function needs to be an integer.;

is better than just

Caused by: java.lang.NumberFormatException: invalid input syntax for type numeric: 1.2

We should do as Hive does in https://issues.apache.org/jira/browse/HIVE-14368

In general, this PR partially reverts SPARK-29283, ports HIVE-14368, and improves test coverage

Why are the changes needed?

Do the same as Hive 2.3 and later for getting an error message in ThriftCLIService.GetOperationStatus
The root cause is somehow obscure for JDBC end-users who only writing SQL queries.
Consistency with spark-sql script

Does this PR introduce any user-facing change?

Yes, when running queries using thrift server and an error occurs, you will get the full stack traces instead of only the message of the root cause

How was this patch tested?

add unit test

… exception's stack trace to the error message

yaooqinn · 2020-07-01T07:20:02Z

cc @juliuszsompolski @cloud-fan @maropu @wangyum @LantaoJin thanks

SparkQA · 2020-07-01T07:28:53Z

Test build #124751 has finished for PR 28963 at commit 7cb0ae8.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

LantaoJin · 2020-07-01T08:10:34Z

Two things to know:

Does the changes work as expected when AQE is enabled？The root cause won't be hidden.
Do we need a configuration to control this behavior?

yaooqinn · 2020-07-01T08:22:20Z

Does the changes work as expected when AQE is enabled？The root cause won't be hidden.

I could not reproduce the case you provided in that PR, the query didn't fail. But to this question, the answer is obviously yes.

Do we need a configuration to control this behavior?

Personally, I don't think we need an extra configuration for this, because we use HiveJDBC to operate with Spark thrift server. For the default Hive 2.3.7 we use, the behavior is consistent

SparkQA · 2020-07-01T08:37:44Z

Test build #124755 has finished for PR 28963 at commit 48c2862.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-01T08:38:31Z

Test build #124758 has finished for PR 28963 at commit ba0c44d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

juliuszsompolski · 2020-07-01T12:14:45Z

@LantaoJin changed it the other way around in #25960, to display the actual error in AQE, I think there are more cases where the more useful internal error gets obscured - e.g. when a task fails, you would get actual reason for failure from the root cause, as opposed to just a "Task failed 4 times" RuntimeException.
We've been recently investigating another issue, where "SparkUpgradeException" would not be displayed by thriftserver, because it was going for the root cause.

Maybe we should do it in general, but then whitelist a few cases that should go for the rootcause

AQE exception
Job/stage failed
...
?

juliuszsompolski · 2020-07-01T12:15:50Z

cc @alismess-db

yaooqinn · 2020-07-01T12:31:37Z

I guess the current approach is able to uncover the real reason for task failures via printing all suppressed exceptions if any. Too much important information is missing in the way either before or after #25960 when error occurs

LantaoJin · 2020-07-01T12:45:23Z

Personally, I don't think we need an extra configuration for this, because we use HiveJDBC to operate with Spark thrift server. For the default Hive 2.3.7 we use, the behavior is consistent

I still consider to add a configuration if we need to do this. Since there are still many Spark users use Hive 1.x. Actually, Hive changed its behavior. But Spark users can choose the results they want. In my company, most users access Spark thrift-server from JDBC, most of users don't want a long error stack message instead of root cause. So providing a configuration to show all stack or root cause may be more friendly.

LantaoJin · 2020-07-01T12:52:33Z

I could not reproduce the case you provided in that PR, the query didn't fail. But to this question, the answer is obviously yes.

You'd better to test it before making a conclusion. in the description part of #25960, I give a reproduce approach, I don't know why it doesn't work for you.

yaooqinn · 2020-07-02T02:26:01Z

Personally, I don't think we need an extra configuration for this, because we use HiveJDBC to operate with Spark thrift server. For the default Hive 2.3.7 we use, the behavior is consistent

I still consider to add a configuration if we need to do this. Since there are still many Spark users use Hive 1.x. Actually, Hive changed its behavior. But Spark users can choose the results they want. In my company, most users access Spark thrift-server from JDBC, most of users don't want a long error stack message instead of root cause. So providing a configuration to show all stack or root cause may be more friendly.

Thanks for your explanation, sgtm

cloud-fan · 2020-07-02T02:38:30Z

AQE exception should have nothing special now, see 45864fa

cloud-fan · 2020-07-02T02:41:44Z

So providing a configuration to show all stack or root cause may be more friendly.

What's the behavior before and after this PR? I thought this PR is only to pick the actual exception, not add/remove stacktraces.

yaooqinn · 2020-07-02T03:17:05Z

What's the behavior before and after this PR? I thought this PR is only to pick the actual exception, not add/remove stacktraces.

Yes, we do not change the exception itself but only pick things to deliver to the JDBC-client side through error responses. , which eventually result in the reason filed of java.sql.SQLException at client-side.

Before this PR and w/ SPARK-29283, only the error messages of the root causes will be delivered. In this case, many intentionally made exceptions will be omitted, e.g. SparkUpgradeException, AnalysisException. Thus, the root may not always be the actual one.
After this PR, we fill the error response with the full stacktraces like Hive.

WDYT about adding a configuration to control this behavior? @cloud-fan. Personally, I'm -1 of it, because the change is trivial and not really a behavior change and it is consistent with hive currently.

LantaoJin · 2020-07-02T06:55:06Z

Our users (data analysts and scientists) from JDBC/ODBC don't want to debug anything from the JDBC error message (we still can get the full error stack log from Driver log, right?). Like other DBs, the endpoint error message is straightforward. In most cases, this root cause message is satisfied for users. But I am open to the configuration. We will add it in our internal Spark. Anyway, adding a configuration and keep the default value to show full-stack or root cause only is not hard.

cloud-fan · 2020-07-02T09:09:53Z

only the error messages of the root causes will be delivered.

Can you be more specific? If AnalysisException is skipped, it's a bug to me, as AnalysisException is the exception we expect end-users to see, not the root cause.

I'm not a fan of hiding disagreements behind a config. Let's reach a consensus about what the error we should deliver to end-users. IMO thriftserver, spark-sql shell, spark application should have consistent error messages.

yaooqinn · 2020-07-02T09:19:26Z

here is an example

w/o pr

 kentyao@hulk  ~/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200620  bin/beeline -u 'jdbc:hive2://localhost:10000/default;a=bc;'
Connecting to jdbc:hive2://localhost:10000/default;a=bc;
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Connected to: Spark SQL (version 3.1.0-SNAPSHOT)
Driver: Hive JDBC (version 2.3.7)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.3.7 by Apache Hive
0: jdbc:hive2://localhost:10000/default> select date_sub(date'2011-11-11', '1.2');
Error: Error running query: java.lang.NumberFormatException: invalid input syntax for type numeric: 1.2 (state=,code=0)

w/ pr

 kentyao@hulk  ~/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200630  bin/beeline -u 'jdbc:hive2://localhost:10000/default;a=bc;'
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Connecting to jdbc:hive2://localhost:10000/default;a=bc;
Connected to: Spark SQL (version 3.1.0-SNAPSHOT)
Driver: Hive JDBC (version 2.3.7)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.3.7 by Apache Hive
0: jdbc:hive2://localhost:10000/default> select date_sub(date'2011-11-11', '1.2');
Error: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: The second argument of 'date_sub' function needs to be an integer.;
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:322)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.$anonfun$run$1(SparkExecuteStatementOperation.scala:222)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78)
	at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:46)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:222)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:217)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:233)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.sql.AnalysisException: The second argument of 'date_sub' function needs to be an integer.;
	at org.apache.spark.sql.catalyst.analysis.TypeCoercion$StringLiteralCoercion$$anonfun$coerceTypes$14.applyOrElse(TypeCoercion.scala:1097)

....


Caused by: java.lang.NumberFormatException: invalid input syntax for type numeric: 1.2
	at org.apache.spark.unsafe.types.UTF8String.toIntExact(UTF8String.java:1335)
	at org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToInt$2(Cast.scala:515)
	at org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToInt$2$adapted(Cast.scala:515)
	at org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:295)
	at org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToInt$1(Cast.scala:515)
	at org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:824)
	at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:475)
	at org.apache.spark.sql.catalyst.analysis.TypeCoercion$StringLiteralCoercion$$anonfun$coerceTypes$14.applyOrElse(TypeCoercion.scala:1094)
	... 96 more (state=,code=0)

You can see in the logs above, w/o this the AnalysisException is missing.

cloud-fan · 2020-07-02T13:08:27Z

But we do add a lot more stacktrace. Is it possible to hide the stacktrace? BTW do we hide stacktrace since 3.0?

juliuszsompolski · 2020-07-02T13:13:39Z

For debugging purposes, I would be happy to get more stacktrace rather than less and get all the causes.
If what @yaooqinn now proposes displays more stack trace and more causes, I would be +1 for that.
I thought that what @LantaoJin did in #25960 was because it was displaying just the AQE exception, and hiding the actual trace and cause - that's what the examples in that PRs suggest?

Did something else change about these full stack traces being displayed?

cloud-fan · 2020-07-02T13:17:45Z

If we hide the stacktrace mistakenly, I'm +1 to this PR.

yaooqinn · 2020-07-03T03:13:41Z

But we do add a lot more stacktrace. Is it possible to hide the stacktrace?

Compared to going through the whole log of the long-running thrift server driver log to find some error trace which may be missing here, I'd say it is much happier and easier to print a little bit more here.

BTW do we hide stacktrace since 3.0?

Not really, the difference between 2.x and 3.0 is only showing the error message part of the exception getting from top or bottom. This PR introduces the stack trace in to match the behavior of spark-sql shell.

yaooqinn · 2020-07-03T03:34:14Z

Did something else change about these full stack traces being displayed?

Nope

cloud-fan · 2020-07-03T07:22:50Z

...r/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala

+            statementId, e.getMessage, SparkUtils.exceptionString(e))
          e match {
-            case hiveException: HiveSQLException =>
-              HiveThriftServer2.eventManager.onStatementError(


now the onStatementError is never called?

https://github.com/apache/spark/pull/28963/files#diff-424526b50bfb53733a4c2e6c6a3ddd8dR101 it's here

and here https://github.com/apache/spark/pull/28963/files/ba0c44ddb558f5b52e23e88f6e6972e5d3192ab4#diff-72dcd8f81a51c8a815159fdf0332acdcR314

cloud-fan · 2020-07-03T08:46:40Z

Personally I'd like to see stacktraces as I'm a developer. But I agree with @LantaoJin that it may not be friendly to end-users. If the thriftserver never shows stacktraces so far, I think we at least should have a dedicated PR to discuss it. Can we focus on getting the actual exception in this PR?

yaooqinn · 2020-07-03T09:11:38Z

The thing is that the actual exception here is hard to define as we capture all Throwables here.
Or we can add a configuration that is not to decide which exception to deal with but just w/ or w/o exception stacktrace.
if w/o, we just print the message parts of the exception tree, something like:

Error running query: org.apache.spark.sql.AnalysisException: The second argument of 'date_sub' function needs to be an integer.;
  - Caused by: java.lang.NumberFormatException: invalid input syntax for type numeric: 1.2

otherwise, you get full stacktrace like #28963 (comment)

cloud-fan · 2020-07-03T09:13:34Z

SGTM.

juliuszsompolski · 2020-07-03T10:35:20Z

I would be +1 for printing as much of exception as possible.
We often get support tickets from end users without an indication of the timestamp when it happened, and sometimes it might be from a week or more ago, so anything that the end user could copy paste into their bug report is helpful.
I wouldn't worry about the exception being not friendly to the end-user. I think getting the error in the first place is even more unfriendly, and if the error stack can be more efficiently used to resolve the error, than that is friendly :-).

cloud-fan · 2020-07-03T11:46:46Z

There are user-facing errors and unexpected internal errors. For example, if a user makes a mistake in the SQL query, we should tell him what the error is (like table not found), and stack trace is not useful in this case. But we do need the stack trace for unexpected internal errors, to help us debug.

That said, adding a config is not the best solution. We need an error reporting framework to hide the stacktrace for user-facing errors, and keep the stacktrace for others.

How about we keep this PR as it is, so that thriftserver is consistent with sql-shell and spark applications. And work on the error reporting framework later? cc @gatorsmile

yaooqinn · 2020-07-03T13:43:26Z

There are user-facing errors and unexpected internal errors. For example, if a user makes a mistake in the SQL query, we should tell him what the error is (like table not found), and stack trace is not useful in this case.

Yes, for such a case, the stack trace is useless but will also be very short. so +1 for keeping this PR as it is.

yaooqinn · 2020-07-03T13:43:41Z

retest this please

SparkQA · 2020-07-05T20:45:26Z

Test build #124938 has finished for PR 28963 at commit d72074a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2020-07-06T02:53:59Z

retest this please

SparkQA · 2020-07-06T04:27:51Z

Test build #125008 has finished for PR 28963 at commit d72074a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2020-07-06T07:17:20Z

retest this please

SparkQA · 2020-07-06T07:58:17Z

Test build #125043 has finished for PR 28963 at commit d72074a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2020-07-06T08:54:05Z

kindly ping @juliuszsompolski @cloud-fan thanks

cloud-fan · 2020-07-06T10:34:29Z

thanks, merging to master!

alismess-db · 2020-07-15T10:33:43Z

...hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkOperation.scala

+
+  protected def onError(): PartialFunction[Throwable, Unit] = {
+    case e: Throwable =>
+      logError(s"Error executing get catalogs operation with $statementId", e)


This error message still refers to "get catalogs" but is logged for every type of operation.

nice catch!thanks!

### What changes were proposed in this pull request? Fix typo error in the error log of SparkOperation trait, reported by #28963 (comment) ### Why are the changes needed? fix error in thrift server driver log ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Passing GitHub actions Closes #29140 from yaooqinn/SPARK-32145-F. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

[SPARK-32145][SQL] ThriftCLIService.GetOperationStatus should include…

345faf7

… exception's stack trace to the error message

probot-autolabeler bot added the SQL label Jul 1, 2020

clean

7cb0ae8

yaooqinn added 2 commits July 1, 2020 15:35

style

48c2862

style

ba0c44d

cloud-fan reviewed Jul 3, 2020

View reviewed changes

hive1.2

d72074a

yaooqinn changed the title ~~[SPARK-32145][SQL] ThriftCLIService.GetOperationStatus should include exception's stack trace to the error message~~ [SPARK-32145][SQL][test-hive1.2] ThriftCLIService.GetOperationStatus should include exception's stack trace to the error message Jul 3, 2020

yaooqinn mentioned this pull request Jul 6, 2020

[SPARK-32058][BUILD][SQL][test-hive1.2][test-hadoop2.7][FOLLOWUP] Set hadoop 2.7.4 for hive 1.2 profile #29006

Closed

cloud-fan closed this in 59a7087 Jul 6, 2020

alismess-db reviewed Jul 15, 2020

View reviewed changes

yaooqinn mentioned this pull request Jul 17, 2020

[SPARK-32145][SQL][FOLLOWUP] Fix type in the error log of SparkOperation #29140

Closed

[SPARK-32145][SQL][test-hive1.2][test-hadoop2.7] ThriftCLIService.GetOperationStatus should include exception's stack trace to the error message #28963

[SPARK-32145][SQL][test-hive1.2][test-hadoop2.7] ThriftCLIService.GetOperationStatus should include exception's stack trace to the error message #28963

Uh oh!

Conversation

yaooqinn commented Jul 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

yaooqinn commented Jul 1, 2020

Uh oh!

SparkQA commented Jul 1, 2020

Uh oh!

LantaoJin commented Jul 1, 2020

Uh oh!

yaooqinn commented Jul 1, 2020

Uh oh!

SparkQA commented Jul 1, 2020

Uh oh!

SparkQA commented Jul 1, 2020

Uh oh!

juliuszsompolski commented Jul 1, 2020

Uh oh!

juliuszsompolski commented Jul 1, 2020

Uh oh!

yaooqinn commented Jul 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LantaoJin commented Jul 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LantaoJin commented Jul 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaooqinn commented Jul 2, 2020

Uh oh!

cloud-fan commented Jul 2, 2020

Uh oh!

cloud-fan commented Jul 2, 2020

Uh oh!

yaooqinn commented Jul 2, 2020

Uh oh!

LantaoJin commented Jul 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Jul 2, 2020

Uh oh!

yaooqinn commented Jul 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

w/o pr

w/ pr

Uh oh!

cloud-fan commented Jul 2, 2020

Uh oh!

juliuszsompolski commented Jul 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Jul 2, 2020

Uh oh!

yaooqinn commented Jul 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaooqinn commented Jul 3, 2020

Uh oh!

cloud-fan Jul 3, 2020

Choose a reason for hiding this comment

Uh oh!

yaooqinn Jul 3, 2020

Choose a reason for hiding this comment

Uh oh!

yaooqinn Jul 3, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 3, 2020

Uh oh!

yaooqinn commented Jul 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaooqinn commented Jul 1, 2020 •

edited

Loading

yaooqinn commented Jul 1, 2020 •

edited

Loading

LantaoJin commented Jul 1, 2020 •

edited

Loading

LantaoJin commented Jul 1, 2020 •

edited

Loading

LantaoJin commented Jul 2, 2020 •

edited

Loading

yaooqinn commented Jul 2, 2020 •

edited

Loading

juliuszsompolski commented Jul 2, 2020 •

edited

Loading

yaooqinn commented Jul 3, 2020 •

edited

Loading

yaooqinn commented Jul 3, 2020 •

edited

Loading