-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-3446] Expose underlying job ids in FutureAction. #2337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
FutureAction is the only type exposed through the async APIs, so for job IDs to be useful they need to be exposed there. The complication is that some async jobs run more than one job (e.g. takeAsync), so the exposed ID has to actually be a list of IDs that can actually change over time. So the interface doesn't look very nice, but... Change is actually small, I just added a basic test to make sure it works.
|
I don't understand this claim: "...for job IDs to be useful they need to be exposed there." Could you clarify, please? |
|
The point of adding the "jobId" method to SimpleFutureAction was so that code calling these async methods knew the IDs of the jobs they were triggering (see SPARK-2636). Except the job ID is not really exposed at all since SimpleFutureAction is not exposed through the async APIs. (Sure you could cast the result, but that's ugly, and that also does not cover ComplexFutureAction.) |
|
QA tests have started for PR 2337 at commit
|
|
QA tests have finished for PR 2337 at commit
|
|
Ping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multiple jobs
|
It would be good to test the complex case with multiple job ids, but overall looks good. @rxin you added this interface - can you take a look (this is a very small patch)? |
|
QA tests have started for PR 2337 at commit
|
|
QA tests have finished for PR 2337 at commit
|
|
The API is slightly awkward as you suggested. Is this intended to get job progress? If yes, maybe we can do that through the "job group" to get the list of job ids? |
|
My initial thought was that a "job group"-based approach might be a bit cleaner, but there are a few subtleties with that proposal that we need to consider. What if we had an API that accepts a "job group" and returns the ids of jobs in that group? e.g. // in SparkContext:
def getJobsForGroup(jobGroup: String): Seq[Int]What should we return here? The list of active jobs? All jobs, including failed and completed ones? It looks like this pull request addresses a case where you run some asynchronous action and want to retrieve the ids of all jobs associated with that action. If Another subtle problem with this |
|
@rxin @pwendell Since we have job groups and the ability to cancel all jobs running in a job group ( I imagine that many developers would like to be able to fire off an entire workflow, potentially comprising multiple actions, monitor its overall progress, and cancel the whole thing. It seems like job groups offer a strictly more powerful set of features that allows users to perform progress-monitoring and cancellation on entire workflows, not just individual actions. If the motivation for FutureAction is that job groups are inconvenient for simple things, then I think we can address that by adding convenience wrappers that act like Python context managers and make it easy to run a block of code inside of a particular job group. Or, we could add an API that executes an arbitrary user-defined code block using a specified job group and returns a cancelable future. |
|
@vanzin it would be helpful to hear what the needs are for Hive on Spark. Other applications I've seen have been using the job group for this purpose. And this will actually work even if a query involves multiple jobs (which using this Future interface would make that much harder). It would work such that at the beginning of each query you set the job group before calling any Spark actions. Then in another thread you can read the job ids associated with the group for progress tracking. |
|
I've opened #2482 , a pull request (WIP) illustrating my proposal to remove |
|
Lots of questions, let's go one by one. MotivationThis is discussed in SPARK-2636 (and probably a coupe of others), but I'll try to summarize it quickly here. Hive-on-Spark generates multiple jobs for a single query, and needs to monitor and collect metrics for each of those jobs - separately. The way to do this in Spark is through the use of Job GroupsI was not familiar with the API and it sounds great to me. It would make monitoring jobs in my remote API prototype (SPARK-3215) much cleaner. The only missing piece from looking at the API is that I don't see "job group" anywhere in the events sent to listeners. e.g.: Unless Async API vs. Something ElseI'm not sold on using the async API, and in fact its use in my remote client prototype looks sort of hacky and ugly. But currently that's the only way to gather the information HoS needs. Any substitute needs to allow the caller to match events to the job that was submitted, which is not possible via other means today (or, at least, not that I can see). I assume that job groups still work OK with the current async API, since the thread local data is using an |
It does, actually; the property is named val jobGroupId = Option(properties).map(_.get(SparkContext.SPARK_JOB_GROUP_ID)).orNull)to check the job group. This is kind of messy (this isn't a documented / stable API, though). |
More than that, it's |
|
Just to be clear, I'm ok with switching to using job groups to achieve what HoS needs (and close this PR/bug), but even that path seems like it could use some changes to make the lives of people using the API easier. |
Yeah, I wasn't suggesting that as a substitute for a real public API.
I think there are two separate design issues here:
Let's keep this open for now, since this PR sounds like an okay way to address 1) and these two concerns are largely orthogonal. |
|
I've given it some thought and I don't think that we should merge the more general async. mechanism that I described in #2482. It had some confusing semantics surrounding cancellation (see the discussion of Thread.interrupt) and was probably more general than what most users need. Given that we should probably keep the current async APIs, this PR's change looks good. I'm going to merge this into |
|
Thanks Josh. I meant to comment on your other PR (also about the weird cancellation semantics), but life got in the way. :-) |
FutureAction is the only type exposed through the async APIs, so
for job IDs to be useful they need to be exposed there. The complication
is that some async jobs run more than one job (e.g. takeAsync),
so the exposed ID has to actually be a list of IDs that can actually
change over time. So the interface doesn't look very nice, but...
Change is actually small, I just added a basic test to make sure
it works.