-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20213][SQL][UI] Fix DataFrameWriter operations in SQL UI tab. #17540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
ffb3e84
a59872e
82f9c36
7cf0f3c
2afefa9
6ad635e
163a07a
653cdf3
85b819f
a043e5c
20a0796
aad22f8
6752ff2
90045cf
f63b773
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -21,11 +21,11 @@ import java.util.concurrent.ConcurrentHashMap | |
| import java.util.concurrent.atomic.AtomicLong | ||
|
|
||
| import org.apache.spark.SparkContext | ||
| import org.apache.spark.internal.Logging | ||
| import org.apache.spark.sql.SparkSession | ||
| import org.apache.spark.sql.execution.ui.{SparkListenerSQLExecutionEnd, | ||
| SparkListenerSQLExecutionStart} | ||
| import org.apache.spark.sql.execution.ui.{SparkListenerSQLExecutionEnd, SparkListenerSQLExecutionStart} | ||
|
|
||
| object SQLExecution { | ||
| object SQLExecution extends Logging { | ||
|
|
||
| val EXECUTION_ID_KEY = "spark.sql.execution.id" | ||
|
|
||
|
|
@@ -39,6 +39,32 @@ object SQLExecution { | |
| executionIdToQueryExecution.get(executionId) | ||
| } | ||
|
|
||
| private val testing = sys.props.contains("spark.testing") | ||
|
|
||
| private[sql] def checkSQLExecutionId(sparkSession: SparkSession): Unit = { | ||
|
||
| // only throw an exception during tests. a missing execution ID should not fail a job. | ||
| if (testing && sparkSession.sparkContext.getLocalProperty(EXECUTION_ID_KEY) == null) { | ||
| // Attention testers: when a test fails with this exception, it means that the action that | ||
| // started execution of a query didn't call withNewExecutionId. The execution ID should be | ||
| // set by calling withNewExecutionId in the action that begins execution, like | ||
| // Dataset.collect or DataFrameWriter.insertInto. | ||
| throw new IllegalStateException("Execution ID should be set") | ||
| } | ||
| } | ||
|
|
||
| private val ALLOW_NESTED_EXECUTION = "spark.sql.execution.nested" | ||
|
|
||
| private[sql] def nested[T](sparkSession: SparkSession)(body: => T): T = { | ||
| val sc = sparkSession.sparkContext | ||
| val allowNestedPreviousValue = sc.getLocalProperty(SQLExecution.ALLOW_NESTED_EXECUTION) | ||
| try { | ||
| sc.setLocalProperty(SQLExecution.ALLOW_NESTED_EXECUTION, "true") | ||
| body | ||
| } finally { | ||
| sc.setLocalProperty(SQLExecution.ALLOW_NESTED_EXECUTION, allowNestedPreviousValue) | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Wrap an action that will execute "queryExecution" to track all Spark jobs in the body so that | ||
| * we can connect them with an execution. | ||
|
|
@@ -73,21 +99,35 @@ object SQLExecution { | |
| } | ||
| r | ||
| } else { | ||
| // Don't support nested `withNewExecutionId`. This is an example of the nested | ||
| // `withNewExecutionId`: | ||
| // Nesting `withNewExecutionId` may be incorrect; log a warning. | ||
| // | ||
| // This is an example of the nested `withNewExecutionId`: | ||
| // | ||
| // class DataFrame { | ||
| // // Note: `collect` will call withNewExecutionId | ||
| // def foo: T = withNewExecutionId { something.createNewDataFrame().collect() } | ||
| // } | ||
| // | ||
| // Note: `collect` will call withNewExecutionId | ||
| // In this case, only the "executedPlan" for "collect" will be executed. The "executedPlan" | ||
| // for the outer DataFrame won't be executed. So it's meaningless to create a new Execution | ||
| // for the outer DataFrame. Even if we track it, since its "executedPlan" doesn't run, | ||
| // for the outer Dataset won't be executed. So it's meaningless to create a new Execution | ||
| // for the outer Dataset. Even if we track it, since its "executedPlan" doesn't run, | ||
| // all accumulator metrics will be 0. It will confuse people if we show them in Web UI. | ||
| // | ||
| // A real case is the `DataFrame.count` method. | ||
| throw new IllegalArgumentException(s"$EXECUTION_ID_KEY is already set") | ||
| // Some operations will start nested executions. For example, CacheTableCommand will uses | ||
| // Dataset#count to materialize cached records when caching is not lazy. Because there are | ||
| // legitimate reasons to nest executions in withNewExecutionId, this logs a warning but does | ||
| // not throw an exception to avoid failing at runtime. Exceptions will be thrown for tests | ||
| // to ensure that nesting is avoided. | ||
| // | ||
| // To avoid this warning, use nested { ... } | ||
| if (!Option(sc.getLocalProperty(ALLOW_NESTED_EXECUTION)).exists(_.toBoolean)) { | ||
| if (testing) { | ||
| throw new IllegalArgumentException(s"$EXECUTION_ID_KEY is already set: $oldExecutionId") | ||
| } else { | ||
| logWarning(s"$EXECUTION_ID_KEY is already set") | ||
| } | ||
| } | ||
| body | ||
| } | ||
| } | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
|
|
@@ -161,50 +161,51 @@ object FileFormatWriter extends Logging { | |||
| } | ||||
| } | ||||
|
|
||||
| SQLExecution.withNewExecutionId(sparkSession, queryExecution) { | ||||
| // This call shouldn't be put into the `try` block below because it only initializes and | ||||
| // prepares the job, any exception thrown from here shouldn't cause abortJob() to be called. | ||||
| committer.setupJob(job) | ||||
|
|
||||
| try { | ||||
| val rdd = if (orderingMatched) { | ||||
| queryExecution.toRdd | ||||
| } else { | ||||
| SortExec( | ||||
| requiredOrdering.map(SortOrder(_, Ascending)), | ||||
| global = false, | ||||
| child = queryExecution.executedPlan).execute() | ||||
| } | ||||
| val ret = new Array[WriteTaskResult](rdd.partitions.length) | ||||
| sparkSession.sparkContext.runJob( | ||||
| rdd, | ||||
| (taskContext: TaskContext, iter: Iterator[InternalRow]) => { | ||||
| executeTask( | ||||
| description = description, | ||||
| sparkStageId = taskContext.stageId(), | ||||
| sparkPartitionId = taskContext.partitionId(), | ||||
| sparkAttemptNumber = taskContext.attemptNumber(), | ||||
| committer, | ||||
| iterator = iter) | ||||
| }, | ||||
| 0 until rdd.partitions.length, | ||||
| (index, res: WriteTaskResult) => { | ||||
| committer.onTaskCommit(res.commitMsg) | ||||
| ret(index) = res | ||||
| }) | ||||
|
|
||||
| val commitMsgs = ret.map(_.commitMsg) | ||||
| val updatedPartitions = ret.flatMap(_.updatedPartitions) | ||||
| .distinct.map(PartitioningUtils.parsePathFragment) | ||||
|
|
||||
| committer.commitJob(job, commitMsgs) | ||||
| logInfo(s"Job ${job.getJobID} committed.") | ||||
| refreshFunction(updatedPartitions) | ||||
| } catch { case cause: Throwable => | ||||
| logError(s"Aborting job ${job.getJobID}.", cause) | ||||
| committer.abortJob(job) | ||||
| throw new SparkException("Job aborted.", cause) | ||||
| // During tests, make sure there is an execution ID. | ||||
| SQLExecution.checkSQLExecutionId(sparkSession) | ||||
|
||||
| queryExecution = Dataset.ofRows(sparkSession, query).queryExecution, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make SQL metrics work, we should always wrap the correct QueryExecution with SparkListenerSQLExecutionStart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zsxwing, that and similar cases are what I was talking about earlier when I said there are two physical plans. The inner Dataset.ofRows ends up creating a completely separate plan.
Are you saying that adding SparkListenerSQLExecutionStart (and also end) events will fix the metrics problem? I think it would at least require the metrics work-around I added to SQLListener, since metrics are filtered out if they aren't reported by the physical plan.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
LocalRelation(c.output, withAction("collect", queryExecution)(_. executeCollect()))There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually do we need to do this? most
Commands are just local operations(talking with metastore).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the check I added to ensure we get the same results in the SQL tab has several hundred failures that go through this. Looks like the path is almost always
spark.sqlwhen the SQL statement is a command like CTAS.I like your version and will update.