-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-32145][SQL][test-hive1.2][test-hadoop2.7] ThriftCLIService.GetOperationStatus should include exception's stack trace to the error message #28963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… exception's stack trace to the error message
|
cc @juliuszsompolski @cloud-fan @maropu @wangyum @LantaoJin thanks |
|
Test build #124751 has finished for PR 28963 at commit
|
|
Two things to know:
|
I could not reproduce the case you provided in that PR, the query didn't fail. But to this question, the answer is obviously yes.
Personally, I don't think we need an extra configuration for this, because we use HiveJDBC to operate with Spark thrift server. For the default Hive 2.3.7 we use, the behavior is consistent |
|
Test build #124755 has finished for PR 28963 at commit
|
|
Test build #124758 has finished for PR 28963 at commit
|
|
@LantaoJin changed it the other way around in #25960, to display the actual error in AQE, I think there are more cases where the more useful internal error gets obscured - e.g. when a task fails, you would get actual reason for failure from the root cause, as opposed to just a "Task failed 4 times" RuntimeException. Maybe we should do it in general, but then whitelist a few cases that should go for the rootcause
|
|
cc @alismess-db |
|
I guess the current approach is able to uncover the real reason for task failures via printing all suppressed exceptions if any. Too much important information is missing in the way either before or after #25960 when error occurs |
I still consider to add a configuration if we need to do this. Since there are still many Spark users use Hive 1.x. Actually, Hive changed its behavior. But Spark users can choose the results they want. In my company, most users access Spark thrift-server from JDBC, most of users don't want a long error stack message instead of root cause. So providing a configuration to show all stack or root cause may be more friendly. |
You'd better to test it before making a conclusion. in the description part of #25960, I give a reproduce approach, I don't know why it doesn't work for you. |
Thanks for your explanation, sgtm |
|
AQE exception should have nothing special now, see 45864fa |
What's the behavior before and after this PR? I thought this PR is only to pick the actual exception, not add/remove stacktraces. |
Yes, we do not change the exception itself but only pick things to deliver to the JDBC-client side through error responses. , which eventually result in the Before this PR and w/ SPARK-29283, only the error messages of the root causes will be delivered. In this case, many intentionally made exceptions will be omitted, e.g. WDYT about adding a configuration to control this behavior? @cloud-fan. Personally, I'm -1 of it, because the change is trivial and not really a behavior change and it is consistent with hive currently. |
|
Our users (data analysts and scientists) from JDBC/ODBC don't want to debug anything from the JDBC error message (we still can get the full error stack log from Driver log, right?). Like other DBs, the endpoint error message is straightforward. In most cases, this root cause message is satisfied for users. But I am open to the configuration. We will add it in our internal Spark. Anyway, adding a configuration and keep the default value to show full-stack or root cause only is not hard. |
Can you be more specific? If I'm not a fan of hiding disagreements behind a config. Let's reach a consensus about what the error we should deliver to end-users. IMO thriftserver, |
|
here is an example w/o pr kentyao@hulk ~/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200620 bin/beeline -u 'jdbc:hive2://localhost:10000/default;a=bc;'
Connecting to jdbc:hive2://localhost:10000/default;a=bc;
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Connected to: Spark SQL (version 3.1.0-SNAPSHOT)
Driver: Hive JDBC (version 2.3.7)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.3.7 by Apache Hive
0: jdbc:hive2://localhost:10000/default> select date_sub(date'2011-11-11', '1.2');
Error: Error running query: java.lang.NumberFormatException: invalid input syntax for type numeric: 1.2 (state=,code=0)w/ pr kentyao@hulk ~/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200630 bin/beeline -u 'jdbc:hive2://localhost:10000/default;a=bc;'
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Connecting to jdbc:hive2://localhost:10000/default;a=bc;
Connected to: Spark SQL (version 3.1.0-SNAPSHOT)
Driver: Hive JDBC (version 2.3.7)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.3.7 by Apache Hive
0: jdbc:hive2://localhost:10000/default> select date_sub(date'2011-11-11', '1.2');
Error: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: The second argument of 'date_sub' function needs to be an integer.;
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:322)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.$anonfun$run$1(SparkExecuteStatementOperation.scala:222)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78)
at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:46)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:222)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:217)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:233)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.sql.AnalysisException: The second argument of 'date_sub' function needs to be an integer.;
at org.apache.spark.sql.catalyst.analysis.TypeCoercion$StringLiteralCoercion$$anonfun$coerceTypes$14.applyOrElse(TypeCoercion.scala:1097)
....
Caused by: java.lang.NumberFormatException: invalid input syntax for type numeric: 1.2
at org.apache.spark.unsafe.types.UTF8String.toIntExact(UTF8String.java:1335)
at org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToInt$2(Cast.scala:515)
at org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToInt$2$adapted(Cast.scala:515)
at org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:295)
at org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToInt$1(Cast.scala:515)
at org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:824)
at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:475)
at org.apache.spark.sql.catalyst.analysis.TypeCoercion$StringLiteralCoercion$$anonfun$coerceTypes$14.applyOrElse(TypeCoercion.scala:1094)
... 96 more (state=,code=0)You can see in the logs above, w/o this the |
|
But we do add a lot more stacktrace. Is it possible to hide the stacktrace? BTW do we hide stacktrace since 3.0? |
|
For debugging purposes, I would be happy to get more stacktrace rather than less and get all the causes. Did something else change about these full stack traces being displayed? |
|
If we hide the stacktrace mistakenly, I'm +1 to this PR. |
Compared to going through the whole log of the long-running thrift server driver log to find some error trace which may be missing here, I'd say it is much happier and easier to print a little bit more here.
Not really, the difference between 2.x and 3.0 is only showing the error message part of the exception getting from top or bottom. This PR introduces the stack trace in to match the behavior of |
Nope |
| statementId, e.getMessage, SparkUtils.exceptionString(e)) | ||
| e match { | ||
| case hiveException: HiveSQLException => | ||
| HiveThriftServer2.eventManager.onStatementError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now the onStatementError is never called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Personally I'd like to see stacktraces as I'm a developer. But I agree with @LantaoJin that it may not be friendly to end-users. If the thriftserver never shows stacktraces so far, I think we at least should have a dedicated PR to discuss it. Can we focus on getting the actual exception in this PR? |
|
The thing is that the actual exception here is hard to define as we capture all Error running query: org.apache.spark.sql.AnalysisException: The second argument of 'date_sub' function needs to be an integer.;
- Caused by: java.lang.NumberFormatException: invalid input syntax for type numeric: 1.2otherwise, you get full stacktrace like #28963 (comment) |
|
SGTM. |
|
I would be +1 for printing as much of exception as possible. |
|
There are user-facing errors and unexpected internal errors. For example, if a user makes a mistake in the SQL query, we should tell him what the error is (like table not found), and stack trace is not useful in this case. But we do need the stack trace for unexpected internal errors, to help us debug. That said, adding a config is not the best solution. We need an error reporting framework to hide the stacktrace for user-facing errors, and keep the stacktrace for others. How about we keep this PR as it is, so that thriftserver is consistent with sql-shell and spark applications. And work on the error reporting framework later? cc @gatorsmile |
Yes, for such a case, the stack trace is useless but will also be very short. so +1 for keeping this PR as it is. |
|
retest this please |
|
Test build #124938 has finished for PR 28963 at commit
|
|
retest this please |
|
Test build #125008 has finished for PR 28963 at commit
|
|
retest this please |
|
Test build #125043 has finished for PR 28963 at commit
|
|
kindly ping @juliuszsompolski @cloud-fan thanks |
|
thanks, merging to master! |
|
|
||
| protected def onError(): PartialFunction[Throwable, Unit] = { | ||
| case e: Throwable => | ||
| logError(s"Error executing get catalogs operation with $statementId", e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error message still refers to "get catalogs" but is logged for every type of operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch!thanks!
### What changes were proposed in this pull request? Fix typo error in the error log of SparkOperation trait, reported by #28963 (comment) ### Why are the changes needed? fix error in thrift server driver log ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Passing GitHub actions Closes #29140 from yaooqinn/SPARK-32145-F. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
In https://issues.apache.org/jira/browse/SPARK-29283, we only show the error message of root cause to end-users through JDBC client. In some cases, it erases the straightaway messages that we intentionally make to help them for better understanding.
The root cause is somehow obscure for JDBC end-users who only writing SQL queries.
e.g
is better than just
We should do as Hive does in https://issues.apache.org/jira/browse/HIVE-14368
In general, this PR partially reverts SPARK-29283, ports HIVE-14368, and improves test coverage
Why are the changes needed?
spark-sqlscriptDoes this PR introduce any user-facing change?
Yes, when running queries using thrift server and an error occurs, you will get the full stack traces instead of only the message of the root cause
How was this patch tested?
add unit test