-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-21809] : Change Stage Page to use datatables to support sorting columns and searching #19270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…g columns and searching Converted static html tables to datatables on stage page [SPARK-21809] : Change Stage Page to use datatables to support sorting columns and searching
[SPARK-21809] : Fixing code to pass ScalaStyle Check
|
@ajbozarth Thank you for your comment on the previous PR. I have closed that one. Apologies for the confusion caused in the previous PR! |
|
Thanks, I'll try to review this by EOD tomorrow |
|
I'll look at the html/js code tomorrow, but it looks like there still unrelated code that adds new fields, is that code supposed to be there or is it for another task? |
|
All of the code is part of the same task. Can you please be more specific about the code that you have doubts about, and I can elaborate further on it. |
|
ok to test |
|
Test build #81936 has finished for PR 19270 at commit
|
|
The error logs for test build #81683 state that method this(Long,Int,Int,Long,Long,Long,Long,Long,Long)Unit in class org.apache.spark.status.api.v1.ExecutorStageSummary does not have a correspondent in current version. All I have done is add new fields in the api ExecutorStageSummary and have not modified any existing ones. It should be fine but please let me know if it is not. |
|
On a second look I think I figured out my misunderstanding, and I've realized a through review will take quite a bit of time, I'll do my best to finish by the end of the week but no promises. As for the MiMa failure, any change to a public api (even additions) must be added to the MiMa excludes. |
|
No problem. Thank you for your valuable comments. |
Adding Problem filter for ExecutorStageSummary to MiMa Excludes
|
Test build #82003 has finished for PR 19270 at commit
|
|
I'm still going through the code but I also checked out, built and ran you changes and found that the page doesn't work in the web UI only in the SHS. Did you test this on both the Web UI and SHS? I'll continue my read through and testing of your code while you fix this. |
|
I believe you mean the opposite of what you wrote. My changes are visible in the web ui(while the app is running) and not in the SHS(once the job is done). Yep I see that and am working on the fix. Thank you. |
|
For me it's definitely the UI that doesn't work and the SHS that does, I'' see if I can recreate and screenshot the js error I'm getting for you |
|
I just tried this out and it appears to be working for me for a running application, haven't tried the history UI yet. @ajbozarth What browser are you using and what are you running (a wordcount, or similar)? I tried both in chrome and firefox. I just checked out his pull request and built that. one thing I noticed was if you select everything under the "Show additional metrics", then refresh the page, is saves the settings but the check boxes aren't checked anymore. "Select All" should be called what it used to be "(De)select All" It also seems to be missing some of the options there when you do a shuffle there should be Shuffle read/write metrics: "Shuffle Read Blocked Time" and "Shuffle Remote Reads" |
|
Ok so I'm still doing more testing but I've narrowed the above problem. The above error is occurring when using either local or standalone, the error doesn't appear when using yarn. I'll continue my testing and review. |
|
Ok, I will look into it. I am currently fixing ui bugs and unit tests, so will commit those changes first, then will look into the above issue. Thank you. |
…g columns and searching [SPARK-21809] : Fixed a couple of ui issues; the changes are visible both in webui and SHS when running on yarn. Fixed unit tests. Removed two obsolete unit tests.
|
Test build #82252 has finished for PR 19270 at commit
|
|
@ajbozarth There were two unit tests in StagePageSuite.scala that were failing as they are no longer valid for the modified ui that generate datatables dynamically from Javascript. I have removed them for the time being, but if you have any other suggestions, let me know. |
…ui and history on local and standalone [SPARK-21809] : Fixing issue of not being able to see changes in web ui and history on local and standalone mode as well as on yarn
|
Test build #82321 has finished for PR 19270 at commit
|
|
@ajbozarth I have fixed the issue of my changes not working in the web ui for local, standalone and yarn. Let me know if you are still facing issues with the testing. |
|
Thanks, I'll pull the latest changes and keep testing. And thanks for your quick responses, I understand large changes like this take forever to review and can get frustrating for the submitter. |
|
No problem, @ajbozarth , and thank you for your valuable feedback. I really appreciate it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't gotten through tasks pages.js completely yet but wanted to post my comments so far. (also I think taskspages.js should either be taskspage.js or stagepage.js)
Two bits of functionality I found missing:
- Show additional metrics don't toggle the metrics in the summary table, they're just always on
# Complete Tasksdoesn't link to the Tasks table anymore
| } | ||
|
|
||
| def convertTaskData(uiData: TaskUIData, lastUpdateTime: Option[Long]): TaskData = { | ||
| private def getGettingResultTime(info: TaskInfo, currentTime: Long): Long = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currentTime is never used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it
| def activeStorageStatusList: Seq[StorageStatus] = storageStatusListener.storageStatusList | ||
|
|
||
| def deadStorageStatusList: Seq[StorageStatus] = storageStatusListener.deadStorageStatusList | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove these lines? They don't seem to be an issue
| execTaskSummary.isBlacklisted = isBlacklisted | ||
| } | ||
|
|
||
| def getExecutorHost(eid: String): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this how getExecutorHost works in other classes? How did you decide to implement it this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the list active StorageStatusList stores the info of all the active executors at any point, I thought it could be used in that manner. I tested it and it seems to work fine. If you think there could be a potential issue with this approach, do let me know and we can discuss further.
| for { | ||
| stageId <- 0 to 1 | ||
| attemptId <- 0 to 1 | ||
| attemptId <- 1 to 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this test was initially failing due to following reason. For stage id 1 and attempt id 0, the stage is designed to fail. So ideally, for this case, when the test tries to connect to the backend to get the json file in line 352, it would exit. But as there is no code writen to handle the exception, the test would quit and fail as the last case never ran. So, by changing the order, I ensured that all the stage success cases would run and the last one would fail and test will pass. I went as far as I could to debug and this was what I could find. If you think there is something more to this, let me know and we can discuss further.
|
|
||
| $(document).ajaxStop($.unblockUI); | ||
| $(document).ajaxStart(function () { | ||
| $.blockUI({message: '<h3>Loading Tasks Page...</h3>'}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more commonly referred to as the Stage Page than the Tasks Page
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| } | ||
| } ); | ||
|
|
||
| function createTemplateURI(appId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This think this function can be abstracted out (with the template file name added as a param) and moved to utils.js
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved them all to utils.js
|
|
||
| // This function will only parse the URL under certain formate | ||
| // e.g. https://axonitered-jt1.red.ygrid.yahoo.com:50509/history/application_1502220952225_59143/stages/stage/?id=0&attempt=0 | ||
| function StageEndPoint(appId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: stageEndPoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| } | ||
| } | ||
| var stageIdLen = words2[1].indexOf('&'); | ||
| var stageId = words2[1].substr(3, stageIdLen - 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iiuc this doesn't account for anchors in the url (#tasks-section), I'd look into some url parsing js functions libs, I've seen some good ones before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I'm not quite sure what the 3s are doing here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the javascript substr function takes the first param as the starting index and the second param as the length of characters. And the split happens way before so it is independent of any anchors. I have tested it for many cases and it seems to look well, but if you are still unsure, I can look into better ways of parsing the url.
| } | ||
|
|
||
| $(document).ready(function () { | ||
| $.extend($.fn.dataTable.defaults, { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might also be worth moving to utils.js since we set the same defaults on all our pages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| pageLength: 20 | ||
| }); | ||
|
|
||
| $("#showAdditionalMetrics").append( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is just adding static html we can probably leave it in StagePage.scala
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I wrote that piece of code in javascript is because I thought if in future, the data-column indices needed to be rearranged dynamically, it would make more sense to do it in javascript. But, if you feel otherwise, I shall move it to html.
|
Thanks, I'll take a look later today |
|
Also could you update the description with new (and more) screen shots? |
|
Yep I did that! Thanks! |
|
Test build #82563 has finished for PR 19270 at commit
|
|
So I think I know why the appId was handled the way it was, the live app ui no longer works because the appId var is "undefined" in all the api calls |
|
@ajbozarth I do not quite understand what you are saying. Everything seems to be working fine on my test setup. Can you please let me know how do I replicate the issue? Thank you. |
ajbozarth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some time to figure out the source of the issues I've run into and made a handful of comments below
| $.getJSON(location.origin + "/api/v1/applications", function(response, status, jqXHR) { | ||
| if (response && response.length > 0) { | ||
| var appId = response[0].id | ||
| return appId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is inside a getJSON this value will never return. This line here is why this function took in a function before. So we'll need to change it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed that
| //Let's get application-id using REST End Point | ||
| $.getJSON(location.origin + "/api/v1/applications", function(response, status, jqXHR) { | ||
| if (response && response.length > 0) { | ||
| var appId = response[0].id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing ;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed now.
| } | ||
| } ); | ||
|
|
||
| // This function will only parse the URL under certain formate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| dagViz ++ <div id="showAdditionalMetrics"></div> ++ | ||
| makeTimeline( | ||
| // Only show the tasks in the table | ||
| stageData.taskData.values.toSeq.filter(t => taskIdsInPage.contains(t.taskInfo.taskId)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(commenting here since it won't let me comment on the actual line)
taskIdsInPage is only used here, so following up where it's created taskTableHTML is no longer needed, maybe clean up that code up there a bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| { | ||
| data : function (row, type) { | ||
| if (accumulator_table.length > 0) { | ||
| return row.accumulatorUpdates[0].name + ' : ' + row.accumulatorUpdates[0].update; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting row.accumulatorUpdates empty, on inspection it seems response.accumulatorUpdates is what's checked by if (accumulator_table.length > 0) though I'm not sure why I'm getting this (note: I only have one accumulator)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why is that happening, however, have introduced an additional check whether row.accumulatorUpdates is empty or not. Let me know if that does not work!
Reverted the function getStandAloneAppId() to take in function as parameter, removing unused variable taskTableHtml in StagePage.scala and adding additional check for displaying accumulator column in task table
|
Test build #82646 has finished for PR 19270 at commit
|
ajbozarth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wapping up nicely, just one more comment
| if (accumulator_table.length > 0 && row.accumulatorUpdates.length > 0) { | ||
| return row.accumulatorUpdates[0].name + ' : ' + row.accumulatorUpdates[0].update; | ||
| } else { | ||
| return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The should be "" instead of 0 to match the previous page (similar to the error column below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| } | ||
| }, | ||
| {data : "executorLogs", render: formatLogsCells}, | ||
| {data : "launchTime"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The formatDate function in historypage.js should be moved to utils.js and used here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it
|
I'm confused about the PR description. It mentions "sorting was disabled if there is any application that has more than one attempt" which as far as I know has nothing to do with the stage page. Also what's a "Stage Server" (from "sort and search for Stage Server")? As far as the change itself, I assume it's only related to the stage page, and from the little javascript I know, you seem to be downloading the whole stage with all its tasks from the REST API. How fast is that with really large stages (e.g. 100k tasks which I think is the default retention)? You can try |
|
@vanzin Actually the JIRA was reported by somebody else and I copied the description in the first para. so I just knew what needed to be done but missed out on the specifics of the description. I have updated the PR description. Thank you for pointing that out. Regarding your comment on the speed of JS execution for really large stages, the initial page loading time may be slightly high, but it is greatly compensated by the time spent in retrieving subsequent task info in the data table as opposed to the traditional HTML table, where time spent retrieving task information for every 10 tasks or so is higher. And also the initial page loading time overhead is compensated for the convenience that the Search functionality provides across all columns and the speed with which it supports sorting of the data. So, in the end, it boils down to how useful this change is as opposed to the cost involved in implementing the change. |
Moved function formatDate to utils.js and used it in stagepage.js to format launch time column
But how high is high? Does datatables support requesting pages dynamically from the server side instead of having to load the full data set from the get go? |
|
Also, there's a second question of how much memory does it take to generate the JSON for such a large stage. While working on SPARK-18085 I remember measuring this and one time requesting the stage page caused more than 600MB of memory to be used (which caused my small driver to OOM). |
If that is the case, then my change serves no purpose. As I have mentioned in the PR, we would like to support sort over all the entries rather than the 20 entries in the current page and this is not possible if the data is loaded dynamically.
For around 100000 tasks as you suggested in the above comment, the loading time is approximately 50-55 seconds which is compensated by the extremely short times for sorting, searching and pagination for a hundred thousand tasks! |
Sort is over all entries, isn't it? If it isn't, the work I have on SPARK-18085 sorts over all entries and is pretty fast at that.
That's pretty slow. My code in SPARK-18085 takes a few seconds to load each page, and can be sped up a lot by caching some data that has to be computed on every load. It doesn't support search, but that can be added. |
|
@vanzin I haven't followed the prs from SPARK-18085 closely, you keep mentioning it here but don't give anymore context as to what that has to do with this change. Did you change the format of tables in there? Please point to specifics. Yes there is obviously a trade off here since its loading all the data at first. This is useful in many situations but not in others. the memory usage can be an issue but isn't worse then anyone hitting the rest api directly (other then your browser side). Personally I hate the tables as is now, they are not friendly at all when looking for things in the task tables so we need to improve them somehow. If you have other design ideas for this perhaps you can talk about those. There are definitely other approaches/frameworks, we can definitely improve upon this to make things happen more on the server side if its needed and we support in the rest api. We could also change back away from the rest api, but I thought that is the way we were wanting to go but maybe that has changed again? |
|
The main change I'm talking about is https://issues.apache.org/jira/browse/SPARK-20657 (code at vanzin#41). I did not change the format of the tables, but how data is loaded into them. The new backing store is much faster at sorting than the current code; and it tries to only load the data that will be shown, so it's also light on memory. The only thing it doesn't support, as I mentioned, is searching, but that could be added at a cost. (And it currently doesn't cache metrics, so each page load scans all metrics which is a little slow, but an order of magnitude faster than the numbers shown above.) I'm not a fan of the tables currently, nor am I saying they should stay as is. But my main concern here is really SHS memory usage. Your point of hitting the rest api is valid, and I think it should be considered a bug that no default limit is imposed for large lists. The way we've generally solved the "arbitrarily large" lists in other apps here is to use infinite scrolling + server side search. I'm not really a front end dev so I don't know what the current opinions on that approach are, but it's definitely lighter as far as memory usage on the server side goes, and load times on the client side are pretty fast too. |
|
Test build #82693 has finished for PR 19270 at commit
|
|
Right so from my understanding https://issues.apache.org/jira/browse/SPARK-20657 is proposing to not using the rest api and force more on the backing store which exists only for history server but not a running application. Running application already have this data in memory anyway. Or are you proposing changing the rest api to use the new backing store and the stage page to the rest api? If not then you are splitting the ui pages to be more specific to running vs history or at least having a different backend store to fetch them from. If its specific to the ui pages then it doesn't help the rest api. If it helps the rest api then why not use that for the web pages too. Either way the data has to get to the web browser, I would think not using the rest api would be memory efficient if it can read directly from existing objects (or the backing store) without having to create a new objects in the rest api to send out. Maybe that doesn't matter so much if we make it more server side to send out small bits at a time. The rest api would be a nice way to abstract the backend out from the UI pages but if it doesn't perform well enough we shouldn't do it. Either one is doable I think we should just choose a direction. Let me know your thoughts. |
Not really; I just don't know enough JS to actually make the change to use the REST api, but I do believe using it is a better approach. The main thing about that change is to generate the view the client wants without having the whole data set in memory (either in the client or server side).
I'm not really against the idea of the changes in this PR, but if we're changing things, I'd like at least a little more thought put into how to do things right. I don't think the "transfer everything from the server to the client" approach is the right thing to do in the long term; it will make looking at small stages better, yes, but it will make looking at larger stages worse. Even if afterwards it's quicker to resort columns, I have little patience to wait 1min for a page to load. And so do not others who have filed bugs about that kind of thing. Another point I'd like to make, and this is selfish me talking, is that large changes in these areas really slow down progress in SPARK-18085. Everybody tries to make local fixes to the problems I've tried to holistically fix in that project, and very few people seem interested in reviewing the code I've been putting out. I believe the changes in that project will make changes like this a lot easier, so I'd really appreciate getting those changes reviewed and committed. |
|
disclaimer is I personally haven't tried this out on larger stages so I'll try to do that tomorrow to really see how the user experience is with larger # of tasks. If its really that bad then definitely agree doing the server side stuff makes sense. I wouldn't expect it to take 55 seconds. if its really taking that long seems like something would be wrong. especially on a running application or after initial history load. I was expecting the change even for large applications to better then the current if you are actually doing something more then the very basic. Sorry I've unfortunately been busy doing internal projects so I haven't had time to help with many reviews. It should get better in a week or two and I'll try to look at yours prs to help. I also didn't think the initial part of that jira was doing anything with the UI so didn't expect these things to conflict and of course this has taken a lot longer then expected to get up too. |
|
seems to be issues with the pr still. Getting errors trying to run on larger jobs: Also I see the accumulators table on jobs that shouldn't have it, this was on a running application. |
|
ok trying this out it is taking a lot longer then I expected. retrieving the data is relatively quick (2 seconds for 40000 tasks) but then is taking a lot longer to parse and then display. Total time for 40000 was 17 seconds. Either way it sounds like this needs more work and if we have other jira with large changes perhaps we should hold off on this until some of that gets in. |
|
After reading through Tom and Marcello's comments I'll defer to them on whether this should be merged in or not. I will note that moving from a hard-coded, server-side, scala generated web ui to a dynamic, client-side, js generated web ui utilizing the metrics api is the direction we've been moving the ui for some time, first with the SHS summery page then with the executors page. I still think this is the right direction, but of all the remaining pages to convert this page at the same time has both the most to gain and the most potential to drag down the UI. If you guys think holding off on this until Marcello's work is done and re-optimizing then is better I'll stand by that, and even though I never did heavy testing for speed I'll stand by Tom that this still needs optimization work before merging. |
|
Thank you everybody for their valuable comments. I do agree with Tom that optimizations need to be done on this issue. We need to move sorting to server side and try reducing the number of loops used to parse the json files and a couple of other things. So I shall close this PR for the time being. |

Support column sort and search for Stage Page using jQuery DataTable and REST API. Before this commit, the Stage page generated a hard-coded HTML table that could not support search. Supporting search and sort (over all applications rather than the 20 entries in the current page) in any case will greatly improve the user experience.
Created the stagespage-template.html for displaying application information in datables. Added REST api endpoint and javascript code to fetch data from the endpoint and display it on the data table.
Because of the above change, certain functionalities in the page had to be modified to support the addition of datatables. For example, the toggle checkbox 'Select All' previously would add the checked fields as columns in the Task table and as rows in the Summary Metrics table, but after the change, only columns are added in the Task Table as it got tricky to add rows dynamically in the datatables.
How was this patch tested?
I have attached the screenshots of the Stage Page UI before and after the fix.

Before:
Accumulators Table:

After:

Accumulators Table:
