Skip to content

Conversation

@PhillHenry
Copy link
Contributor

What changes were proposed in this pull request?

Code in the PR generates random parameters for hyperparameter tuning. A discussion with Sean Owen can be found on the dev mailing list here:

http://apache-spark-developers-list.1001551.n3.nabble.com/Hyperparameter-Optimization-via-Randomization-td30629.html

All code is entirely my own work and I license the work to the project under the project’s open source license.

Why are the changes needed?

Randomization can be a more effective techinique than a grid search since min/max points can fall between the grid and never be found. Randomisation is not so restricted although the probability of finding minima/maxima is dependent on the number of attempts.

Alice Zheng has an accessible description on how this technique works at https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html

Although there are Python libraries with more sophisticated techniques, not every Spark developer is using Python.

Does this PR introduce any user-facing change?

A new class (ParamRandomBuilder.scala) and its tests have been created but there is no change to existing code. This class offers an alternative to ParamGridBuilder and can be dropped into the code wherever ParamGridBuilder appears. Indeed, it extends ParamGridBuilder and is completely compatible with its interface. It merely adds one method that provides a range over which a hyperparameter will be randomly defined.

How was this patch tested?

Tests ParamRandomBuilderSuite.scala and RandomRangesSuite.scala were added.

ParamRandomBuilderSuite is the analogue of the already existing ParamGridBuilderSuite which tests the user-facing interface.

RandomRangesSuite uses ScalaCheck to test the random ranges over which hyperparameters are distributed.

@SparkQA
Copy link

SparkQA commented Feb 24, 2021

Test build #135389 has finished for PR 31535 at commit 259edfe.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@PhillHenry
Copy link
Contributor Author

@srowen Cool. Will do.

@srowen
Copy link
Member

srowen commented Feb 25, 2021

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Feb 25, 2021

Test build #135473 has started for PR 31535 at commit 183c2cd.

@SparkQA
Copy link

SparkQA commented Feb 25, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40053/

@SparkQA
Copy link

SparkQA commented Feb 25, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40053/

@PhillHenry PhillHenry requested a review from srowen February 26, 2021 10:51
@srowen
Copy link
Member

srowen commented Feb 26, 2021

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Feb 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40098/

@SparkQA
Copy link

SparkQA commented Feb 26, 2021

Test build #135518 has finished for PR 31535 at commit ddfe4a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 26, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40098/

@srowen
Copy link
Member

srowen commented Feb 27, 2021

Merged to master

@srowen srowen closed this in 397b843 Feb 27, 2021
@dongjoon-hyun
Copy link
Member

Hi, @PhillHenry and @srowen .
This seems to break GitHub Action linter job.
Could you check the doc generation part?

@srowen
Copy link
Member

srowen commented Feb 28, 2021

Oh OK let me figure that out - can probably fix forward with a patch. Er, where can I see the output? I don't see it here.

@srowen
Copy link
Member

srowen commented Feb 28, 2021

Ah I see it:
https://github.com/apache/spark/pull/31681/checks?check_run_id=1998307491
No such file or directory @ rb_sysopen - /__w/spark/spark/docs/../examples/src/main/python/ml/model_selection_random_hyperparameters_example.py

@PhillHenry was there an additional example file that was meant to be included in the PR? if so just open another PR and I'll add it. If necessary to restore the linter soon I can temporarily remove the reference to this example.

@dongjoon-hyun
Copy link
Member

Thanks for the analysis, @srowen !

@PhillHenry
Copy link
Contributor Author

@srowen Odd. The file does not seem to have been pushed (multiple IntelliJ's open on the same codebase - one for Python one for Scala?). My bad. I've created a new PR at:
#31687

srowen pushed a commit that referenced this pull request Feb 28, 2021
Missing Python example file for [SPARK-34415][ML] Randomization in hyperparameter optimization
 (#31535)

### What changes were proposed in this pull request?
For some reason (probably me being silly) a examples/src/main/python/ml/model_selection_random_hyperparameters_example.py was not pushed in a previous PR.
This PR restores that file.

### Why are the changes needed?
A single file (examples/src/main/python/ml/model_selection_random_hyperparameters_example.py) that should have been pushed as part of SPARK-34415 but was not. This was causing Lint errors as highlighted by dongjoon-hyun. Consequently, srowen asked for a new PR.

### Does this PR introduce _any_ user-facing change?
No, it merely restores a file that was overlook in SPARK-34415.

### How was this patch tested?
By running:
`bin/spark-submit examples/src/main/python/ml/model_selection_random_hyperparameters_example.py`

Closes #31687 from PhillHenry/SPARK-34415_model_selection_random_hyperparameters_example.

Authored-by: Phillip Henry <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants