Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
606b959
Random integers within a range.
PhillHenry Feb 1, 2021
c518e5f
Refactored.
PhillHenry Feb 1, 2021
37f32c2
Random longs.
PhillHenry Feb 1, 2021
77cf678
Better use of type classes.
PhillHenry Feb 1, 2021
8bd0dd7
Checks distribution.
PhillHenry Feb 1, 2021
6c9a918
Refactored.
PhillHenry Feb 2, 2021
46540bc
Merge branch 'master' of https://github.com/apache/spark
PhillHenry Feb 4, 2021
5489de6
Formatting.
PhillHenry Feb 4, 2021
3d1f46c
Linear random doubles added.
PhillHenry Feb 4, 2021
290c815
Refactored.
PhillHenry Feb 4, 2021
dadbd54
Added random floats.
PhillHenry Feb 4, 2021
f8339d7
Refactored.
PhillHenry Feb 4, 2021
70cdf24
Even more refactoring.
PhillHenry Feb 4, 2021
8148a69
RandomRange tests in a separate class.
PhillHenry Feb 7, 2021
0d9fd66
Merge branch 'master' of https://github.com/apache/spark
PhillHenry Feb 7, 2021
ea945c5
Linear numerics supported.
PhillHenry Feb 7, 2021
cb4e4b8
Checkstyle.
PhillHenry Feb 7, 2021
f558060
Log space.
PhillHenry Feb 7, 2021
d0c8eaa
Log space.
PhillHenry Feb 7, 2021
affb9e4
Logarithm for any base.
PhillHenry Feb 7, 2021
7808b7c
Logarithm for any base.
PhillHenry Feb 7, 2021
b433ab1
Still a problem with double conversion.
PhillHenry Feb 8, 2021
f12cf9a
Merge branch 'master' of https://github.com/apache/spark
PhillHenry Feb 8, 2021
844f706
Extreme Long/Int ranges may cause trouble with being converted to a d…
PhillHenry Feb 8, 2021
3d03565
Restored a tag that was making my IntelliJ upset.
PhillHenry Feb 8, 2021
27b323e
Removed test println.
PhillHenry Feb 8, 2021
5e870ea
Merge branch 'master' of https://github.com/apache/spark
PhillHenry Feb 9, 2021
ef7bfd7
Commit re. Hyperopt and its ilk.
PhillHenry Feb 9, 2021
0de1ba5
@Since tags added.
PhillHenry Feb 9, 2021
9e7b7bd
Code style.
PhillHenry Feb 9, 2021
fd8ac9f
Code style.
PhillHenry Feb 9, 2021
e402df5
Code style.
PhillHenry Feb 9, 2021
e54a58f
Code style.
PhillHenry Feb 9, 2021
b0455a1
Code style.
PhillHenry Feb 9, 2021
f641d51
Superfluous parentheses.
PhillHenry Feb 9, 2021
6f6dabd
Merge branch 'master' of https://github.com/apache/spark into ParamRa…
PhillHenry Feb 12, 2021
86a781d
[SPARK-34415][ML] Made private anything that wasn't and that was not …
PhillHenry Feb 12, 2021
b805ea5
[SPARK-34415][ML] Oops. The user needs Limits and added log methods.
PhillHenry Feb 12, 2021
b02f64e
[SPARK-34415][ML] Oops. The user needs Limits and added log methods.
PhillHenry Feb 12, 2021
44061a4
[SPARK-34415][ML] Oops. Base 10.
PhillHenry Feb 12, 2021
07a1e01
[SPARK-34415][ML] Added Java specific API and tests.
PhillHenry Feb 12, 2021
232d359
Merge branch 'master' of https://github.com/apache/spark into ParamRa…
PhillHenry Feb 13, 2021
2344495
[SPARK-34415][ML] Random Long generated removed as superfluous (per c…
PhillHenry Feb 13, 2021
25737d7
[SPARK-34415][ML] Documentation and Scala example.
PhillHenry Feb 13, 2021
308f1c3
[SPARK-34415][ML] Documentation, Scala and Java examples.
PhillHenry Feb 15, 2021
e88f907
[SPARK-34415][ML] Removed random log2 space, fixed error in documenta…
PhillHenry Feb 16, 2021
4e48759
[SPARK-34415][ML] Everything that can be made private as srowen recom…
PhillHenry Feb 22, 2021
4fab7ac
Merge branch 'master' of https://github.com/apache/spark into ParamRa…
PhillHenry Feb 22, 2021
259edfe
[SPARK-34415][ML] ScalaStyle violation.
PhillHenry Feb 22, 2021
62f305c
Merge branch 'master' of https://github.com/apache/spark into ParamRa…
PhillHenry Feb 24, 2021
a41c8f3
[SPARK-34415][ML] Very hacky first draft of the Python version of Par…
PhillHenry Feb 24, 2021
73d077b
[SPARK-34415][ML] Added ParamRandomBuilder to the .pyi file. More tes…
PhillHenry Feb 25, 2021
5d89774
[SPARK-34415][ML] Python log10 space.
PhillHenry Feb 25, 2021
183c2cd
[SPARK-34415][ML] More tests.
PhillHenry Feb 25, 2021
73e3b0c
Merge branch 'master' of https://github.com/apache/spark into ParamRa…
PhillHenry Feb 26, 2021
ddfe4a9
[SPARK-34415][ML] Python example.
PhillHenry Feb 26, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 35 additions & 1 deletion docs/ml-tuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,10 +71,44 @@ for multiclass problems, a [`MultilabelClassificationEvaluator`](api/scala/org/a
[`RankingEvaluator`](api/scala/org/apache/spark/ml/evaluation/RankingEvaluator.html) for ranking problems. The default metric used to
choose the best `ParamMap` can be overridden by the `setMetricName` method in each of these evaluators.

To help construct the parameter grid, users can use the [`ParamGridBuilder`](api/scala/org/apache/spark/ml/tuning/ParamGridBuilder.html) utility.
To help construct the parameter grid, users can use the [`ParamGridBuilder`](api/scala/org/apache/spark/ml/tuning/ParamGridBuilder.html) utility (see the *Cross-Validation* section below for an example).
By default, sets of parameters from the parameter grid are evaluated in serial. Parameter evaluation can be done in parallel by setting `parallelism` with a value of 2 or more (a value of 1 will be serial) before running model selection with `CrossValidator` or `TrainValidationSplit`.
The value of `parallelism` should be chosen carefully to maximize parallelism without exceeding cluster resources, and larger values may not always lead to improved performance. Generally speaking, a value up to 10 should be sufficient for most clusters.

Alternatively, users can use the [`ParamRandomBuilder`](api/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.html) utility.
This has the same properties of `ParamGridBuilder` mentioned above, but hyperparameters are chosen at random within a user-defined range.
The mathematical principle behind this is that given enough samples, the probability of at least one sample *not* being near the optimum within a range tends to zero.
Irrespective of machine learning model, the expected number of samples needed to have at least one within 5% of the optimum is about 60.
If this 5% volume lies between the parameters defined in a grid search, it will *never* be found by `ParamGridBuilder`.

<div class="codetabs">

<div data-lang="scala" markdown="1">

Refer to the [`ParamRandomBuilder` Scala docs](api/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.html) for details on the API.

{% include_example scala/org/apache/spark/examples/ml/ModelSelectionViaRandomHyperparametersExample.scala %}
</div>

<div data-lang="java" markdown="1">

Refer to the [`ParamRandomBuilder` Java docs](api/java/org/apache/spark/ml/tuning/ParamRandomBuilder.html) for details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaModelSelectionViaRandomHyperparametersExample.java %}
</div>

<div data-lang="python" markdown="1">

Python users are recommended to look at Python libraries that are specifically for hyperparameter tuning such as Hyperopt.

Refer to the [`ParamRandomBuilder` Java docs](api/python/reference/api/pyspark.ml.tuning.ParamRandomBuilder.html) for details on the API.

{% include_example python/ml/model_selection_random_hyperparameters_example.py %}

</div>

</div>

# Cross-Validation

`CrossValidator` begins by splitting the dataset into a set of *folds* which are used as separate training and test datasets. E.g., with `$k=3$` folds, `CrossValidator` will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing. To evaluate a particular `ParamMap`, `CrossValidator` computes the average evaluation metric for the 3 `Model`s produced by fitting the `Estimator` on the 3 different (training, test) dataset pairs.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.examples.ml;

// $example on$
import org.apache.spark.ml.evaluation.RegressionEvaluator;
import org.apache.spark.ml.param.ParamMap;
import org.apache.spark.ml.regression.LinearRegression;
import org.apache.spark.ml.tuning.*;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
// $example off$

/**
* A simple example demonstrating model selection using ParamRandomBuilder.
*
* Run with
* {{{
* bin/run-example ml.JavaModelSelectionViaRandomHyperparametersExample
* }}}
*/
public class JavaModelSelectionViaRandomHyperparametersExample {

public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("JavaModelSelectionViaTrainValidationSplitExample")
.getOrCreate();

// $example on$
Dataset<Row> data = spark.read().format("libsvm")
.load("data/mllib/sample_linear_regression_data.txt");

LinearRegression lr = new LinearRegression();

// We sample the regularization parameter logarithmically over the range [0.01, 1.0].
// This means that values around 0.01, 0.1 and 1.0 are roughly equally likely.
// Note that both parameters must be greater than zero as otherwise we'll get an infinity.
// We sample the the ElasticNet mixing parameter uniformly over the range [0, 1]
// Note that in real life, you'd choose more than the 5 samples we see below.
ParamMap[] hyperparameters = new ParamRandomBuilder()
.addLog10Random(lr.regParam(), 0.01, 1.0, 5)
.addRandom(lr.elasticNetParam(), 0.0, 1.0, 5)
.addGrid(lr.fitIntercept())
.build();

System.out.println("hyperparameters:");
for (ParamMap param : hyperparameters) {
System.out.println(param);
}

CrossValidator cv = new CrossValidator()
.setEstimator(lr)
.setEstimatorParamMaps(hyperparameters)
.setEvaluator(new RegressionEvaluator())
.setNumFolds(3);
CrossValidatorModel cvModel = cv.fit(data);
LinearRegression parent = (LinearRegression)cvModel.bestModel().parent();

System.out.println("Optimal model has\n" + lr.regParam() + " = " + parent.getRegParam()
+ "\n" + lr.elasticNetParam() + " = "+ parent.getElasticNetParam()
+ "\n" + lr.fitIntercept() + " = " + parent.getFitIntercept());
// $example off$

spark.stop();
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.examples.ml

// $example on$
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.tuning.{CrossValidator, CrossValidatorModel, Limits, ParamRandomBuilder}
import org.apache.spark.ml.tuning.RandomRanges._
// $example off$
import org.apache.spark.sql.SparkSession

/**
* A simple example demonstrating model selection using ParamRandomBuilder.
*
* Run with
* {{{
* bin/run-example ml.ModelSelectionViaRandomHyperparametersExample
* }}}
*/
object ModelSelectionViaRandomHyperparametersExample {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("ModelSelectionViaTrainValidationSplitExample")
.getOrCreate()
// scalastyle:off println
// $example on$
// Prepare training and test data.
val data = spark.read.format("libsvm").load("data/mllib/sample_linear_regression_data.txt")

val lr = new LinearRegression().setMaxIter(10)

// We sample the regularization parameter logarithmically over the range [0.01, 1.0].
// This means that values around 0.01, 0.1 and 1.0 are roughly equally likely.
// Note that both parameters must be greater than zero as otherwise we'll get an infinity.
// We sample the the ElasticNet mixing parameter uniformly over the range [0, 1]
// Note that in real life, you'd choose more than the 5 samples we see below.
val hyperparameters = new ParamRandomBuilder()
.addLog10Random(lr.regParam, Limits(0.01, 1.0), 5)
.addGrid(lr.fitIntercept)
.addRandom(lr.elasticNetParam, Limits(0.0, 1.0), 5)
.build()

println(s"hyperparameters:\n${hyperparameters.mkString("\n")}")

val cv: CrossValidator = new CrossValidator()
.setEstimator(lr)
.setEstimatorParamMaps(hyperparameters)
.setEvaluator(new RegressionEvaluator)
.setNumFolds(3)
val cvModel: CrossValidatorModel = cv.fit(data)
val parent: LinearRegression = cvModel.bestModel.parent.asInstanceOf[LinearRegression]

println(s"""Optimal model has:
|${lr.regParam} = ${parent.getRegParam}
|${lr.elasticNetParam} = ${parent.getElasticNetParam}
|${lr.fitIntercept} = ${parent.getFitIntercept}""".stripMargin)
// $example off$

spark.stop()
}
// scalastyle:on println
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.ml.tuning

import org.apache.spark.annotation.Since
import org.apache.spark.ml.param._
import org.apache.spark.ml.tuning.RandomRanges._

case class Limits[T: Numeric](x: T, y: T)

private[ml] abstract class RandomT[T: Numeric] {
def randomT(): T
def randomTLog(n: Int): T
}

abstract class Generator[T: Numeric] {
def apply(lim: Limits[T]): RandomT[T]
}

object RandomRanges {

private val rnd = new scala.util.Random

private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
var randVal = BigInt(x.bitLength, rnd)
while (randVal > x) {
randVal = BigInt(x.bitLength, rnd)
}
randVal
}

private[ml] def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
val diff: BigInt = upper - lower
randomBigInt0To(diff) + lower
}

private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
val range: BigDecimal = upper - lower
val halfWay: BigDecimal = lower + range / 2
(zeroCenteredRnd * range) + halfWay
}

implicit object DoubleGenerator extends Generator[Double] {
def apply(limits: Limits[Double]): RandomT[Double] = new RandomT[Double] {
import limits._
val lower: Double = math.min(x, y)
val upper: Double = math.max(x, y)

override def randomTLog(n: Int): Double =
RandomRanges.randomLog(lower, upper, n)

override def randomT(): Double =
randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).doubleValue
}
}

implicit object FloatGenerator extends Generator[Float] {
def apply(limits: Limits[Float]): RandomT[Float] = new RandomT[Float] {
import limits._
val lower: Float = math.min(x, y)
val upper: Float = math.max(x, y)

override def randomTLog(n: Int): Float =
RandomRanges.randomLog(lower, upper, n).toFloat

override def randomT(): Float =
randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).floatValue
}
}

implicit object IntGenerator extends Generator[Int] {
def apply(limits: Limits[Int]): RandomT[Int] = new RandomT[Int] {
import limits._
val lower: Int = math.min(x, y)
val upper: Int = math.max(x, y)

override def randomTLog(n: Int): Int =
RandomRanges.randomLog(lower, upper, n).toInt

override def randomT(): Int =
bigIntBetween(BigInt(lower), BigInt(upper)).intValue
}
}

private[ml] def logN(x: Double, base: Int): Double = math.log(x) / math.log(base)

private[ml] def randomLog(lower: Double, upper: Double, n: Int): Double = {
val logLower: Double = logN(lower, n)
val logUpper: Double = logN(upper, n)
val logLimits: Limits[Double] = Limits(logLower, logUpper)
val rndLogged: RandomT[Double] = RandomRanges(logLimits)
math.pow(n, rndLogged.randomT())
}

private[ml] def apply[T: Generator](lim: Limits[T])(implicit t: Generator[T]): RandomT[T] = t(lim)

}

/**
* "For any distribution over a sample space with a finite maximum, the maximum of 60 random
* observations lies within the top 5% of the true maximum, with 95% probability"
* - Evaluating Machine Learning Models by Alice Zheng
* https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
*
* Note: if you want more sophisticated hyperparameter tuning, consider Python libraries
* such as Hyperopt.
*/
@Since("3.2.0")
class ParamRandomBuilder extends ParamGridBuilder {
def addRandom[T: Generator](param: Param[T], lim: Limits[T], n: Int): this.type = {
val gen: RandomT[T] = RandomRanges(lim)
addGrid(param, (1 to n).map { _: Int => gen.randomT() })
}

def addLog10Random[T: Generator](param: Param[T], lim: Limits[T], n: Int): this.type =
addLogRandom(param, lim, n, 10)

private def addLogRandom[T: Generator](param: Param[T], lim: Limits[T],
n: Int, base: Int): this.type = {
val gen: RandomT[T] = RandomRanges(lim)
addGrid(param, (1 to n).map { _: Int => gen.randomTLog(base) })
}

// specialized versions for Java.

def addRandom(param: DoubleParam, x: Double, y: Double, n: Int): this.type =
addRandom(param, Limits(x, y), n)(DoubleGenerator)

def addLog10Random(param: DoubleParam, x: Double, y: Double, n: Int): this.type =
addLogRandom(param, Limits(x, y), n, 10)(DoubleGenerator)

def addRandom(param: FloatParam, x: Float, y: Float, n: Int): this.type =
addRandom(param, Limits(x, y), n)(FloatGenerator)

def addLog10Random(param: FloatParam, x: Float, y: Float, n: Int): this.type =
addLogRandom(param, Limits(x, y), n, 10)(FloatGenerator)

def addRandom(param: IntParam, x: Int, y: Int, n: Int): this.type =
addRandom(param, Limits(x, y), n)(IntGenerator)

def addLog10Random(param: IntParam, x: Int, y: Int, n: Int): this.type =
addLogRandom(param, Limits(x, y), n, 10)(IntGenerator)

}
Loading