Skip to content

Conversation

@yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Jul 1, 2020

What changes were proposed in this pull request?

In https://issues.apache.org/jira/browse/SPARK-29283, we only show the error message of root cause to end-users through JDBC client. In some cases, it erases the straightaway messages that we intentionally make to help them for better understanding.

The root cause is somehow obscure for JDBC end-users who only writing SQL queries.

e.g

Error running query: org.apache.spark.sql.AnalysisException: The second argument of 'date_sub' function needs to be an integer.;

is better than just

Caused by: java.lang.NumberFormatException: invalid input syntax for type numeric: 1.2

We should do as Hive does in https://issues.apache.org/jira/browse/HIVE-14368

In general, this PR partially reverts SPARK-29283, ports HIVE-14368, and improves test coverage

Why are the changes needed?

  1. Do the same as Hive 2.3 and later for getting an error message in ThriftCLIService.GetOperationStatus
  2. The root cause is somehow obscure for JDBC end-users who only writing SQL queries.
  3. Consistency with spark-sql script

Does this PR introduce any user-facing change?

Yes, when running queries using thrift server and an error occurs, you will get the full stack traces instead of only the message of the root cause

How was this patch tested?

add unit test

… exception's stack trace to the error message
@yaooqinn
Copy link
Member Author

yaooqinn commented Jul 1, 2020

@SparkQA
Copy link

SparkQA commented Jul 1, 2020

Test build #124751 has finished for PR 28963 at commit 7cb0ae8.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LantaoJin
Copy link
Contributor

Two things to know:

  1. Does the changes work as expected when AQE is enabled?The root cause won't be hidden.
  2. Do we need a configuration to control this behavior?

@yaooqinn
Copy link
Member Author

yaooqinn commented Jul 1, 2020

  1. Does the changes work as expected when AQE is enabled?The root cause won't be hidden.

I could not reproduce the case you provided in that PR, the query didn't fail. But to this question, the answer is obviously yes.

  1. Do we need a configuration to control this behavior?

Personally, I don't think we need an extra configuration for this, because we use HiveJDBC to operate with Spark thrift server. For the default Hive 2.3.7 we use, the behavior is consistent

@SparkQA
Copy link

SparkQA commented Jul 1, 2020

Test build #124755 has finished for PR 28963 at commit 48c2862.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 1, 2020

Test build #124758 has finished for PR 28963 at commit ba0c44d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@juliuszsompolski
Copy link
Contributor

@LantaoJin changed it the other way around in #25960, to display the actual error in AQE, I think there are more cases where the more useful internal error gets obscured - e.g. when a task fails, you would get actual reason for failure from the root cause, as opposed to just a "Task failed 4 times" RuntimeException.
We've been recently investigating another issue, where "SparkUpgradeException" would not be displayed by thriftserver, because it was going for the root cause.

Maybe we should do it in general, but then whitelist a few cases that should go for the rootcause

  • AQE exception
  • Job/stage failed
  • ...
    ?

@juliuszsompolski
Copy link
Contributor

cc @alismess-db

@yaooqinn
Copy link
Member Author

yaooqinn commented Jul 1, 2020

I guess the current approach is able to uncover the real reason for task failures via printing all suppressed exceptions if any. Too much important information is missing in the way either before or after #25960 when error occurs

@LantaoJin
Copy link
Contributor

LantaoJin commented Jul 1, 2020

Personally, I don't think we need an extra configuration for this, because we use HiveJDBC to operate with Spark thrift server. For the default Hive 2.3.7 we use, the behavior is consistent

I still consider to add a configuration if we need to do this. Since there are still many Spark users use Hive 1.x. Actually, Hive changed its behavior. But Spark users can choose the results they want. In my company, most users access Spark thrift-server from JDBC, most of users don't want a long error stack message instead of root cause. So providing a configuration to show all stack or root cause may be more friendly.

@LantaoJin
Copy link
Contributor

LantaoJin commented Jul 1, 2020

I could not reproduce the case you provided in that PR, the query didn't fail. But to this question, the answer is obviously yes.

You'd better to test it before making a conclusion. in the description part of #25960, I give a reproduce approach, I don't know why it doesn't work for you.

@yaooqinn
Copy link
Member Author

yaooqinn commented Jul 2, 2020

Personally, I don't think we need an extra configuration for this, because we use HiveJDBC to operate with Spark thrift server. For the default Hive 2.3.7 we use, the behavior is consistent

I still consider to add a configuration if we need to do this. Since there are still many Spark users use Hive 1.x. Actually, Hive changed its behavior. But Spark users can choose the results they want. In my company, most users access Spark thrift-server from JDBC, most of users don't want a long error stack message instead of root cause. So providing a configuration to show all stack or root cause may be more friendly.

Thanks for your explanation, sgtm

@cloud-fan
Copy link
Contributor

AQE exception should have nothing special now, see 45864fa

@cloud-fan
Copy link
Contributor

So providing a configuration to show all stack or root cause may be more friendly.

What's the behavior before and after this PR? I thought this PR is only to pick the actual exception, not add/remove stacktraces.

@yaooqinn
Copy link
Member Author

yaooqinn commented Jul 2, 2020

What's the behavior before and after this PR? I thought this PR is only to pick the actual exception, not add/remove stacktraces.

Yes, we do not change the exception itself but only pick things to deliver to the JDBC-client side through error responses. , which eventually result in the reason filed of java.sql.SQLException at client-side.

Before this PR and w/ SPARK-29283, only the error messages of the root causes will be delivered. In this case, many intentionally made exceptions will be omitted, e.g. SparkUpgradeException, AnalysisException. Thus, the root may not always be the actual one.
After this PR, we fill the error response with the full stacktraces like Hive.

WDYT about adding a configuration to control this behavior? @cloud-fan. Personally, I'm -1 of it, because the change is trivial and not really a behavior change and it is consistent with hive currently.

@LantaoJin
Copy link
Contributor

LantaoJin commented Jul 2, 2020

Our users (data analysts and scientists) from JDBC/ODBC don't want to debug anything from the JDBC error message (we still can get the full error stack log from Driver log, right?). Like other DBs, the endpoint error message is straightforward. In most cases, this root cause message is satisfied for users. But I am open to the configuration. We will add it in our internal Spark. Anyway, adding a configuration and keep the default value to show full-stack or root cause only is not hard.

@cloud-fan
Copy link
Contributor

only the error messages of the root causes will be delivered.

Can you be more specific? If AnalysisException is skipped, it's a bug to me, as AnalysisException is the exception we expect end-users to see, not the root cause.

I'm not a fan of hiding disagreements behind a config. Let's reach a consensus about what the error we should deliver to end-users. IMO thriftserver, spark-sql shell, spark application should have consistent error messages.

@yaooqinn
Copy link
Member Author

yaooqinn commented Jul 2, 2020

here is an example

w/o pr

 kentyao@hulk  ~/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200620  bin/beeline -u 'jdbc:hive2://localhost:10000/default;a=bc;'
Connecting to jdbc:hive2://localhost:10000/default;a=bc;
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Connected to: Spark SQL (version 3.1.0-SNAPSHOT)
Driver: Hive JDBC (version 2.3.7)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.3.7 by Apache Hive
0: jdbc:hive2://localhost:10000/default> select date_sub(date'2011-11-11', '1.2');
Error: Error running query: java.lang.NumberFormatException: invalid input syntax for type numeric: 1.2 (state=,code=0)

w/ pr

 kentyao@hulk  ~/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200630  bin/beeline -u 'jdbc:hive2://localhost:10000/default;a=bc;'
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Connecting to jdbc:hive2://localhost:10000/default;a=bc;
Connected to: Spark SQL (version 3.1.0-SNAPSHOT)
Driver: Hive JDBC (version 2.3.7)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.3.7 by Apache Hive
0: jdbc:hive2://localhost:10000/default> select date_sub(date'2011-11-11', '1.2');
Error: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: The second argument of 'date_sub' function needs to be an integer.;
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:322)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.$anonfun$run$1(SparkExecuteStatementOperation.scala:222)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78)
	at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:46)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:222)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:217)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:233)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.sql.AnalysisException: The second argument of 'date_sub' function needs to be an integer.;
	at org.apache.spark.sql.catalyst.analysis.TypeCoercion$StringLiteralCoercion$$anonfun$coerceTypes$14.applyOrElse(TypeCoercion.scala:1097)

....


Caused by: java.lang.NumberFormatException: invalid input syntax for type numeric: 1.2
	at org.apache.spark.unsafe.types.UTF8String.toIntExact(UTF8String.java:1335)
	at org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToInt$2(Cast.scala:515)
	at org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToInt$2$adapted(Cast.scala:515)
	at org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:295)
	at org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToInt$1(Cast.scala:515)
	at org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:824)
	at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:475)
	at org.apache.spark.sql.catalyst.analysis.TypeCoercion$StringLiteralCoercion$$anonfun$coerceTypes$14.applyOrElse(TypeCoercion.scala:1094)
	... 96 more (state=,code=0)

You can see in the logs above, w/o this the AnalysisException is missing.

@cloud-fan
Copy link
Contributor

But we do add a lot more stacktrace. Is it possible to hide the stacktrace? BTW do we hide stacktrace since 3.0?

@juliuszsompolski
Copy link
Contributor

juliuszsompolski commented Jul 2, 2020

For debugging purposes, I would be happy to get more stacktrace rather than less and get all the causes.
If what @yaooqinn now proposes displays more stack trace and more causes, I would be +1 for that.
I thought that what @LantaoJin did in #25960 was because it was displaying just the AQE exception, and hiding the actual trace and cause - that's what the examples in that PRs suggest?

Did something else change about these full stack traces being displayed?

@cloud-fan
Copy link
Contributor

If we hide the stacktrace mistakenly, I'm +1 to this PR.

@yaooqinn
Copy link
Member Author

yaooqinn commented Jul 3, 2020

But we do add a lot more stacktrace. Is it possible to hide the stacktrace?

Compared to going through the whole log of the long-running thrift server driver log to find some error trace which may be missing here, I'd say it is much happier and easier to print a little bit more here.

BTW do we hide stacktrace since 3.0?

Not really, the difference between 2.x and 3.0 is only showing the error message part of the exception getting from top or bottom. This PR introduces the stack trace in to match the behavior of spark-sql shell.

@yaooqinn
Copy link
Member Author

yaooqinn commented Jul 3, 2020

Did something else change about these full stack traces being displayed?

Nope

statementId, e.getMessage, SparkUtils.exceptionString(e))
e match {
case hiveException: HiveSQLException =>
HiveThriftServer2.eventManager.onStatementError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now the onStatementError is never called?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan
Copy link
Contributor

Personally I'd like to see stacktraces as I'm a developer. But I agree with @LantaoJin that it may not be friendly to end-users. If the thriftserver never shows stacktraces so far, I think we at least should have a dedicated PR to discuss it. Can we focus on getting the actual exception in this PR?

@yaooqinn
Copy link
Member Author

yaooqinn commented Jul 3, 2020

The thing is that the actual exception here is hard to define as we capture all Throwables here.
Or we can add a configuration that is not to decide which exception to deal with but just w/ or w/o exception stacktrace.
if w/o, we just print the message parts of the exception tree, something like:

Error running query: org.apache.spark.sql.AnalysisException: The second argument of 'date_sub' function needs to be an integer.;
  - Caused by: java.lang.NumberFormatException: invalid input syntax for type numeric: 1.2

otherwise, you get full stacktrace like #28963 (comment)

@cloud-fan
Copy link
Contributor

SGTM.

@juliuszsompolski
Copy link
Contributor

I would be +1 for printing as much of exception as possible.
We often get support tickets from end users without an indication of the timestamp when it happened, and sometimes it might be from a week or more ago, so anything that the end user could copy paste into their bug report is helpful.
I wouldn't worry about the exception being not friendly to the end-user. I think getting the error in the first place is even more unfriendly, and if the error stack can be more efficiently used to resolve the error, than that is friendly :-).

@cloud-fan
Copy link
Contributor

There are user-facing errors and unexpected internal errors. For example, if a user makes a mistake in the SQL query, we should tell him what the error is (like table not found), and stack trace is not useful in this case. But we do need the stack trace for unexpected internal errors, to help us debug.

That said, adding a config is not the best solution. We need an error reporting framework to hide the stacktrace for user-facing errors, and keep the stacktrace for others.

How about we keep this PR as it is, so that thriftserver is consistent with sql-shell and spark applications. And work on the error reporting framework later? cc @gatorsmile

@yaooqinn yaooqinn changed the title [SPARK-32145][SQL] ThriftCLIService.GetOperationStatus should include exception's stack trace to the error message [SPARK-32145][SQL][test-hive1.2] ThriftCLIService.GetOperationStatus should include exception's stack trace to the error message Jul 3, 2020
@yaooqinn
Copy link
Member Author

yaooqinn commented Jul 3, 2020

There are user-facing errors and unexpected internal errors. For example, if a user makes a mistake in the SQL query, we should tell him what the error is (like table not found), and stack trace is not useful in this case.

Yes, for such a case, the stack trace is useless but will also be very short. so +1 for keeping this PR as it is.

@yaooqinn
Copy link
Member Author

yaooqinn commented Jul 3, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jul 5, 2020

Test build #124938 has finished for PR 28963 at commit d72074a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn yaooqinn changed the title [SPARK-32145][SQL][test-hive1.2] ThriftCLIService.GetOperationStatus should include exception's stack trace to the error message [SPARK-32145][SQL][test-hive1.2][test-hadoop2.7] ThriftCLIService.GetOperationStatus should include exception's stack trace to the error message Jul 6, 2020
@yaooqinn
Copy link
Member Author

yaooqinn commented Jul 6, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #125008 has finished for PR 28963 at commit d72074a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

yaooqinn commented Jul 6, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #125043 has finished for PR 28963 at commit d72074a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

yaooqinn commented Jul 6, 2020

kindly ping @juliuszsompolski @cloud-fan thanks

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 59a7087 Jul 6, 2020

protected def onError(): PartialFunction[Throwable, Unit] = {
case e: Throwable =>
logError(s"Error executing get catalogs operation with $statementId", e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message still refers to "get catalogs" but is logged for every type of operation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch!thanks!

cloud-fan pushed a commit that referenced this pull request Jul 17, 2020
### What changes were proposed in this pull request?

Fix typo error in the error log of SparkOperation trait, reported by #28963 (comment)

### Why are the changes needed?

fix error in thrift server driver log

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Passing GitHub actions

Closes #29140 from yaooqinn/SPARK-32145-F.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants