Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Apr 5, 2017

What changes were proposed in this pull request?

Wraps DataFrameWriter operations in SQLExecution.withNewExecutionId so that SparkListenerSQLExecutionStart and SparkListenerSQLExecutionEnd are sent and the query shows up in the SQL tab of the UI.

How was this patch tested?

Tested by hand that insertInto results in queries in the SQL tab.

@srowen
Copy link
Member

srowen commented Apr 5, 2017

Looks closely related to #17535 ?

@rdblue
Copy link
Contributor Author

rdblue commented Apr 5, 2017

@srowen, agreed. Closely related but not the same code paths. The question is: when should withNewExecutionId get called?

I'm running the test suite now and this patch causes test failures when withNewExecutionId is called twice; once in DataFrameWriter and once in InsertIntoHadoopFsRelationCommand. It looks like the call has now been littered about the codebase (e.g. in InsertIntoHadoopFsRelationCommand and other execution nodes) to fix this problem on certain operations, so we should decide where it should be used and fix tests around that.

The reason why I added it to DataFrameWriter is that it is called in Dataset actions, and it makes sense to call it once from where an action is started. I think it makes the most sense for action methods, like Dataset#collect or DataFrameWriter#insertInto to minimize the number of places we need to add it. I don't think this is a concern that should be addressed by the execution plan.

@SparkQA
Copy link

SparkQA commented Apr 5, 2017

Test build #75549 has finished for PR 17540 at commit f9342b5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor Author

rdblue commented Apr 6, 2017

@cloud-fan, can you look at this? What do you think about the question above: when should withNewExecutionId get called?

@SparkQA
Copy link

SparkQA commented Apr 6, 2017

Test build #75579 has finished for PR 17540 at commit a3296a2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor Author

rdblue commented Apr 6, 2017

The new test failures are caused by a check I inserted. Moving where withNewExecutionId gets called could result in missing SQL queries in the UI, so anywhere I'm replacing withNewExecutionId with checkSQLExecutionId that will cause tests (and only tests) to fail. That way, we can catch all of the call stacks that should have it.

This caught problems in SQL command execution and I've added a patch to fix it.

@SparkQA
Copy link

SparkQA commented Apr 6, 2017

Test build #75581 has finished for PR 17540 at commit 429edfb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

The withNewExecutionId was added at rxin@1b0317f#diff-89b9796aae086e790ddd9351f0db8115R134 .

The execution id is used to track all jobs that belong to the same query, so I think it makes sense to call withExecutionId at action methods like Dataset#collect or DataFrameWriter#insertInto

Copy link
Contributor

@cloud-fan cloud-fan Apr 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about LocalRelation(c.output, withAction("collect", queryExecution)(_. executeCollect()))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually do we need to do this? most Commands are just local operations(talking with metastore).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the check I added to ensure we get the same results in the SQL tab has several hundred failures that go through this. Looks like the path is almost always spark.sql when the SQL statement is a command like CTAS.

I like your version and will update.

@cloud-fan
Copy link
Contributor

LGTM, @rdblue the failed tests are thrift server tests, which are hard to debug. You can run hive tests locally and see what failed.(usually failed thrift server tests means we have failed hive tests)

@rdblue
Copy link
Contributor Author

rdblue commented Apr 8, 2017

Thanks for the review! I'll get the thrift-server tests fixed up next week.

@SparkQA
Copy link

SparkQA commented Apr 10, 2017

Test build #75675 has finished for PR 17540 at commit 7cb7d4e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 10, 2017

Test build #75676 has finished for PR 17540 at commit 68cc2b3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 11, 2017

Test build #75677 has finished for PR 17540 at commit ce0dbe7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 11, 2017

Test build #75681 has finished for PR 17540 at commit 7910825.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 12, 2017

Test build #75748 has finished for PR 17540 at commit 2cee6b7.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@rdblue rdblue force-pushed the SPARK-20213-fix-sql-tab branch from 2cee6b7 to 4f3a02b Compare April 12, 2017 22:21
@SparkQA
Copy link

SparkQA commented Apr 13, 2017

Test build #75749 has finished for PR 17540 at commit 4f3a02b.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@rdblue rdblue force-pushed the SPARK-20213-fix-sql-tab branch 2 times, most recently from cfe4e2c to 36da73b Compare April 13, 2017 22:56
@SparkQA
Copy link

SparkQA commented Apr 14, 2017

Test build #75780 has finished for PR 17540 at commit cfe4e2c.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 14, 2017

Test build #75781 has finished for PR 17540 at commit 36da73b.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 14, 2017

Test build #75786 has finished for PR 17540 at commit a822309.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 14, 2017

Test build #75813 has finished for PR 17540 at commit 30fa4fc.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor Author

rdblue commented Apr 14, 2017

@cloud-fan, could you have another look at this?

There are a few new changes:

  • withNewExecutionId now warns instead of throwing an exception, but still throws exceptions if spark.testing is defined
  • SQLExecution.nested allows nested execution IDs without test failures or warnings. This is needed because several places will nest when withNewExceptionId is called at the high-level operations. CacheTableCommand is an example.

Over the last week, I've fixed nearly all of the tests. The remaining failure, SQLExecutionSuite.concurrent query execution (SPARK-10548), is fixed in maven, but fails in SBT. The problem is that exceptions are now only thrown if spark.testing is defined, and for some reason adding it to the test's SparkSession or SparkContext doesn't work on Jenkins. Because this test is reproducing a case that now will never happen for two reasons (the original multi-threading fix and throw only if spark.testing), I'd like to simply remove it. Let me know what you think about that.

Other changes to look at:

  • SQLMetricsSuite.save metrics started failing because there is a nested execution ID. This is because there are two SQL physical plans. The first, ExecutedCommandExec links in a logical plan that is turned into a second physical plan at runtime. This means that the inner plan can't report the metrics that will be collected when analyzing the outer plan because it doesn't exist yet. The long-term solution is to fix ExecutedCommandExec, but for now this accepts any metrics created by the inner plan.
  • StreamExecution wasn't calling withNewExecutionId and was caught by the new assertion. I added the call around the entire execution so that there isn't a new SQL execution for every batch. This required creating a special queryExecution to pass in.
  • DataFrameCallbackSuite had to be updated to include commands that were previously not registered in the SQL tab. The new SQL executions are for dropping tables, so the result looks more correct than before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only called in FileFormatWirter, is there any other places we need to consider?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep this PR from growing too big, I want to just use it where I've removed withNewExecutionId to check for regressions. I'll follow up with another PR with more checks.

@cloud-fan
Copy link
Contributor

yea let's remove that test

@rdblue rdblue force-pushed the SPARK-20213-fix-sql-tab branch from 30fa4fc to 901cec8 Compare April 17, 2017 16:23
@rdblue
Copy link
Contributor Author

rdblue commented Apr 17, 2017

Removed the failing SPARK-10548 test and rebased.

@rdblue rdblue force-pushed the SPARK-20213-fix-sql-tab branch from 1ce1a81 to bd324e6 Compare May 2, 2017 16:27
@SparkQA
Copy link

SparkQA commented May 2, 2017

Test build #76388 has finished for PR 17540 at commit bd324e6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue rdblue force-pushed the SPARK-20213-fix-sql-tab branch from bd324e6 to 7131c32 Compare May 2, 2017 21:34
@SparkQA
Copy link

SparkQA commented May 2, 2017

Test build #76395 has finished for PR 17540 at commit 7131c32.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue rdblue force-pushed the SPARK-20213-fix-sql-tab branch from 7131c32 to 69ed59e Compare May 2, 2017 23:23
@SparkQA
Copy link

SparkQA commented May 3, 2017

Test build #76396 has finished for PR 17540 at commit 69ed59e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue rdblue force-pushed the SPARK-20213-fix-sql-tab branch from 69ed59e to 4db4fc9 Compare May 3, 2017 16:47
@SparkQA
Copy link

SparkQA commented May 3, 2017

Test build #76422 has finished for PR 17540 at commit 4db4fc9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue rdblue force-pushed the SPARK-20213-fix-sql-tab branch from 4db4fc9 to f63b773 Compare May 3, 2017 23:16
@zsxwing
Copy link
Member

zsxwing commented May 3, 2017

@rdblue I just tested this PR and found that I could not see any SQL metrics on Web UI. This is pretty important for many users to analyze their queries.

What's your plan to fix it? As far as I understand, you want to show the parameters of write actions, such as InsertIntoHadoopFsRelationCommand(file://..., Parquet, ...). However, that's not the running QueryExecution. Right now we just put SQL metrics to the DAG on UI. After your change, the DAG shown on UI is not the running QueryExecution, and I don't know how to show SQL metrics on a wrong DAG. Is it even possible to fix it in a follow up PR quickly?

@rdblue
Copy link
Contributor Author

rdblue commented May 4, 2017

@zsxwing, there should be a fix for the metrics without waiting for all of the bad plans to be fixed (which is to basically eliminate the use of ExecutedCommandExec).

The metrics are missing because ExecutedCommandExec doesn't report them via metrics. So we need to update it to get metrics from the command that is run. That requires breaking the command into two phases, one to get a SparkPlan and one to run it. Right now, that all happens at once. This shouldn't be too difficult as a follow-up, but will be a substantial number of changes so I think this PR should be independent.

@zsxwing
Copy link
Member

zsxwing commented May 4, 2017

So we need to update it to get metrics from the command that is run. That requires breaking the command into two phases, one to get a SparkPlan and one to run it.

Yeah, but how to show metrics you get from a plan on another plan's DAG considering these two plans could be different?

@rdblue
Copy link
Contributor Author

rdblue commented May 4, 2017

@zsxwing, I don't know. Sounds like we should fix the underlying problem that there are 2 physical plans.

@zsxwing
Copy link
Member

zsxwing commented May 4, 2017

@zsxwing, I don't know. Sounds like we should fix the underlying problem that there are 2 physical plans.

SQL metrics won't work without fixing it. IMO, that's more serious than the problem you are fixing.

@rdblue
Copy link
Contributor Author

rdblue commented May 4, 2017

@zsxwing, you don't think there's a way to fix metrics? I don't know exactly how to fix the UI to show two plans worth of metrics, but it seems like it can be done. What about also updating ExecutedCommandExec also report the plan from its child command?

Having two physical plans is a pretty bad problem for a SQL engine to have. If the work-around is to ignore that some parts of the UI don't work, I don't think that's a good plan. Sure, this is going to be a short-term regression for metrics, but what is the alternative to fix the underlying problem?

@zsxwing
Copy link
Member

zsxwing commented May 4, 2017

I was not saying there is no way to fix metrics. Just asking your thoughts. If we don't have a concrete plan, it might be a long-term regression if just merging this PR.

I just want to ensure that it's going to be fixed soon after merging this PR.

@rdblue
Copy link
Contributor Author

rdblue commented May 4, 2017

I'm not an expert on the metrics path, but I think we should be able to join up the actual physical plans well enough to display everything. I doubt it will be a long-term regression, but I don't think the fix is small enough that we should include it here. I also think that it is important for this to go in so we can fix the problem. Otherwise, I think it is too easy to ignore it. Not having queries show up in the SQL tab is just as bad as the metrics issue, so I think we're trading up.

@SparkQA
Copy link

SparkQA commented May 4, 2017

Test build #76431 has finished for PR 17540 at commit f63b773.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

how about we make InsertIntoHadoopFsRelation and InsertIntoHiveTable physical plans instead of RunnableCommand? Someone also asked for this refactor in https://issues.apache.org/jira/browse/SPARK-19256

@rdblue
Copy link
Contributor Author

rdblue commented May 4, 2017

@cloud-fan, all of the RunnableCommand instances that are currently run through ExecutedCommandExec need to be fixed so that there is only one physical plan. But the scope of those changes is larger than what's needed here. And as @zsxwing notes, the next step is probably to fix metrics for those broken plans.

@rdblue
Copy link
Contributor Author

rdblue commented May 5, 2017

@cloud-fan, @zsxwing, tests are passing now. Should we commit this so we can start fixing the metrics?

@zsxwing
Copy link
Member

zsxwing commented May 5, 2017

I don't think we need to rush. As far as I can tell, this PR breaks two things:

  • SQL Metrics on Web UI is broken.
  • It doesn't display the batch queries inside a Structured Streaming query. Right now it always shows one SQL query.

What benefits we can gain from this PR that are worth to break the above things?

@zsxwing
Copy link
Member

zsxwing commented May 5, 2017

I suggest that you just fix them in this PR. If it has to be a large PR, I'm okey with that.

currentBatchId,
offsetSeqMetadata)

SQLExecution.withNewExecutionId(sparkSessionToRunBatches, genericStreamExecution) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have an execution id for each streaming batch

@cloud-fan
Copy link
Contributor

cloud-fan commented May 8, 2017

Hi @rdblue , I have sent you a PR(rdblue#1) to fix the missing metrics issue of ExecutedCommandExec, and we also need to fix the streaming batch metrics issue, then we are ready to go.

@asfgit asfgit closed this in 10e526e May 31, 2017
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Aug 20, 2018
Currently the `DataFrameWriter` operations have several problems:

1. non-file-format data source writing action doesn't show up in the SQL tab in Spark UI
2. file-format data source writing action shows a scan node in the SQL tab, without saying anything about writing. (streaming also have this issue, but not fixed in this PR)
3. Spark SQL CLI actions don't show up in the SQL tab.

This PR fixes all of them, by refactoring the `ExecuteCommandExec` to make it have children.

 close apache#17540

existing tests.

Also test the UI manually. For a simple command: `Seq(1 -> "a").toDF("i", "j").write.parquet("/tmp/qwe")`

before this PR:
<img width="266" alt="qq20170523-035840 2x" src="https://cloud.githubusercontent.com/assets/3182036/26326050/24e18ba2-3f6c-11e7-8817-6dd275bf6ac5.png">
after this PR:
<img width="287" alt="qq20170523-035708 2x" src="https://cloud.githubusercontent.com/assets/3182036/26326054/2ad7f460-3f6c-11e7-8053-d68325beb28f.png">

Author: Wenchen Fan <[email protected]>

Closes apache#18064 from cloud-fan/execution.

This also includes the following commits:
0795c16 introduce SQLExecution.ignoreNestedExecutionId
cd6e3f0 address comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants