Skip to content

Conversation

@vanzin
Copy link
Owner

@vanzin vanzin commented May 30, 2017

There are two main changes to speed up rendering of the tasks list
when rendering the stage page.

The first one makes the code only load the tasks being shown in the
current page of the tasks table, and information related to only
those tasks. One side-effect of this change is that the graph that
shows task-related events now only shows events for the tasks in
the current page, instead of the previously hardcoded limit of "events
for the first 1000 tasks". That ends up helping with readability,
though.

To make sorting efficient when using a disk store, the task wrapper
was extended to include many new indices, one for each of the sortable
columns in the UI, and metrics for which quantiles are calculated.

The second changes the way metric quantiles are calculated for stages.
Instead of using the "Distribution" class to process data for all task
metrics, which requires scanning all tasks of a stage, the code now
uses the KVStore "skip()" functionality to only read tasks that contain
interesting information for the quantiles that are desired.

This is still not cheap; because there are many metrics that the UI
and API track, the code needs to scan the index for each metric to
gather the information. Savings come mainly from skipping deserialization
when using the disk store, but the in-memory code also seems to be
faster than before (most probably because of other changes in this
patch).

With the above changes, a lot of code in the UI layer could be simplified.

@libratiger
Copy link

is this branch stable enough now?

@libratiger
Copy link

I just run the UnitTesst, and found some test failed:

stage task summary w shuffle write
stage task summary w shuffle read
stage task list w/ sortBy
stage task list w/ sortBy short names
job progress bars / cells reflect skipped stages

@vanzin
Copy link
Owner Author

vanzin commented Jun 6, 2017

@djvulee there's a couple of things I need to fix in this last patch... if you just reset the branch to the previous commit things should be more stable.

@vanzin
Copy link
Owner Author

vanzin commented Jun 6, 2017

Unit tests should be fixed in this patch too, now.

@libratiger
Copy link

libratiger commented Jun 7, 2017

Ok, Thanks! I found the current branch can not deal with the failed Stage well enough, it will produce the following error:

java.lang.IndexOutOfBoundsException: Page 1 is out of range. Please select a page number between 1 and 0.
at org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:56)
at org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:108)
at org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:702)
at org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:295)
at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:284)
at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82)
at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82)
at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:88)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)
at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1689)
at org.apache.spark.deploy.history.ApplicationCacheCheckFilter.doFilter(ApplicationCache.scala:437)
at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.spark_project.jetty.server.Server.handle(Server.java:524)
at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)
at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
at org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)
at org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:745)

@libratiger
Copy link

Another issue is that the SQL tab page will lead to nullPointerException(M8 branch)

@vanzin
Copy link
Owner Author

vanzin commented Jun 7, 2017

@djvulee do you have some code that can reproduce the failed stage you're having trouble with? I can't see any issues in my local build. The SQL tab and individual executions also render fine for me.

@vanzin vanzin force-pushed the shs-ng/M8 branch 2 times, most recently from 85b9ca1 to 172d0bb Compare December 5, 2017 20:54
Marcelo Vanzin added 2 commits December 11, 2017 11:51
Detect the deletion of event log files from storage, and remove
data about the related application attempt in the SHS.
There are two main changes to speed up rendering of the tasks list
when rendering the stage page.

The first one makes the code only load the tasks being shown in the
current page of the tasks table, and information related to only
those tasks. One side-effect of this change is that the graph that
shows task-related events now only shows events for the tasks in
the current page, instead of the previously hardcoded limit of "events
for the first 1000 tasks". That ends up helping with readability,
though.

To make sorting efficient when using a disk store, the task wrapper
was extended to include many new indices, one for each of the sortable
columns in the UI, and metrics for which quantiles are calculated.

The second changes the way metric quantiles are calculated for stages.
Instead of using the "Distribution" class to process data for all task
metrics, which requires scanning all tasks of a stage, the code now
uses the KVStore "skip()" functionality to only read tasks that contain
interesting information for the quantiles that are desired.

This is still not cheap; because there are many metrics that the UI
and API track, the code needs to scan the index for each metric to
gather the information. Savings come mainly from skipping deserialization
when using the disk store, but the in-memory code also seems to be
faster than before (most probably because of other changes in this
patch).

To make subsequent calls faster, some quantiles are cached in the
status store. This makes UIi much faster after the first time a stage
has been loaded.

With the above changes, a lot of code in the UI layer could be simplified.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants