Skip to content

Conversation

@WeichenXu123
Copy link
Contributor

What changes were proposed in this pull request?

Add parallelism support for ML tuning in pyspark.

How was this patch tested?

Test updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here maybe need a discussion.
Currently in pyspark it both do not cache train dataset and validation dataset but in scala impl it cache both of them.
But I prefer cache validation dataset but do not cache train dataset, because the size of validation dataset is only 1/numFolds of input dataset, it deserve caching otherwise it will scan input dataset again. But the size train dataset is (numFolds - 1)/numFolds of input dataset. We can directly scan from input dataset to generate the train dataset and won't slow down too much.
@BryanCutler @MLnick What do you think about it ? Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will do multi-model training when fitting on the estimator. So I think it's still beneficial to cache training dataset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose we have already cached input dataset, then generate "training dataset" only need a "map" operation on cached df with filtering out only 1/numFolds. So I think the cost won't be too much more compared with caching "training dataset".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, but seems we don't check if input dataset is cached or not here? Should we cache it if it is not cached?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... How to checking input dataset caching status is not easy, there are still discussions in SPARK-18608. But for now I think we can keep consistent with scala-side to cache both of training and validation dataset. So I update code.

@SparkQA
Copy link

SparkQA commented Sep 4, 2017

Test build #81386 has finished for PR 19122 at commit 57cf534.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class HasParallelism(Params):
  • class CrossValidator(Estimator, ValidatorParams, HasParallelism, MLReadable, MLWritable):
  • class TrainValidationSplit(Estimator, ValidatorParams, HasParallelism, MLReadable, MLWritable):

@SparkQA
Copy link

SparkQA commented Sep 4, 2017

Test build #81387 has finished for PR 19122 at commit be2f3d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya Sep 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a benchmark for this? Can Python GIL be a problem here to downgrade performance in the end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the line metrics[index] += metric/nFolds will downgrade perf because of lock issue ?
I can change code to avoid this. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual fitting and evaluation methods run here might include CPU bound codes. So I am not sure if multithreading here can well boost the performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think this work well. We already have PRs do similar things #19110 and #16774 .

@SparkQA
Copy link

SparkQA commented Sep 5, 2017

Test build #81411 has finished for PR 19122 at commit 596ce35.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 6, 2017

Test build #81457 has finished for PR 19122 at commit 0a94344.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a few questions, but why don't we get OneVsRest with the shared param merged in first?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the param is not being passed to Java, should we check that it is >=1 here and in setParam?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add check when creating thread pool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you just use epm as the argument in the function instead of an index? e.g. pool.map(singleTrain, epm)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the description should be more general. Is the plan to put the shared param in here or in OneVsRest first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worry, #19110 will merge first and then I will merge it to this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning on adding a unit test to verify that parallel has the same results as serial?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test added.

@WeichenXu123
Copy link
Contributor Author

@BryanCutler code updated. thanks!

@SparkQA
Copy link

SparkQA commented Sep 13, 2017

Test build #81705 has finished for PR 19122 at commit d6cf103.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 13, 2017

Test build #81707 has finished for PR 19122 at commit 7122884.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

@BryanCutler Do you have more comments? I can check it out now but don't want to review at the same time.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had just a few minor suggestions, otherwise LGTM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: import order

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a little better to check if the bestModel chosen was the same in both cases, same with the TrainValidationSplit test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... I tried. But how to get model parents ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right, I guess you'd have to check cvSerialModel.bestModel.weights which isn't too ideal either. It's fine how it is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I asked for this check but I don't think we really need it. The ValueError from creating a ThreadPool with processes < 1 is pretty easy to see the problem. If you want to leave it that's fine too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a big deal, but you could use variable j instead of k here now

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to review a little...and then realized I made an error when reviewing the original patches in this work. I posted here: https://issues.apache.org/jira/browse/SPARK-19357 Please let me know what your thoughts are on the best way to fix this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't sort the metrics. The metrics are guaranteed to be returned in the same order as the estimatorParamMaps, so they should match up already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: don't sort the metrics

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: This should be grouped with the other 3rd-party library imports

@SparkQA
Copy link

SparkQA commented Sep 22, 2017

Test build #82058 has finished for PR 19122 at commit 3464dfe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

ping @jkbradley

@SparkQA
Copy link

SparkQA commented Sep 28, 2017

Test build #82275 has finished for PR 19122 at commit 93ab39a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Discussed elsewhere: We'll delay the multi-model fitting optimization in favor of getting this in for now. Taking a look now...

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks basically ready; just a tiny comment

["features", "label"])

lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With only 0 or 1 iteration, I don't think we could expect to see big differences between parallelism 1 or 2, even if there were bugs in our implementation. How about trying more, saying 5 and 6 iterations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for TrainValidationSplit

@SparkQA
Copy link

SparkQA commented Oct 25, 2017

Test build #3961 has finished for PR 19122 at commit 93ab39a.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 26, 2017

Test build #83075 has finished for PR 19122 at commit 8b3ef97.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM
Merging with master
Thanks @WeichenXu123 , @BryanCutler and @viirya !

@asfgit asfgit closed this in 20eb95e Oct 27, 2017
@jkbradley
Copy link
Member

Whoops, could you please send a follow-up PR to do 1 doc update?

@WeichenXu123
Copy link
Contributor Author

@jkbradley Sure I will!

@WeichenXu123 WeichenXu123 deleted the par-ml-tuning-py branch October 28, 2017 01:50
asfgit pushed a commit that referenced this pull request Nov 14, 2017
## What changes were proposed in this pull request?

Fix doc issue mentioned here: #19122 (comment)

## How was this patch tested?

N/A

Author: WeichenXu <[email protected]>

Closes #19641 from WeichenXu123/fix_doc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants