From 184d1133d1557211d8bb309883af60613c3e77fb Mon Sep 17 00:00:00 2001
From: Nick Pentreath <nickp@za.ibm.com>
Date: Tue, 23 Jan 2018 16:05:51 +0200
Subject: [PATCH 1/5] Add highlights and migration guide for 2.3

---
 docs/ml-guide.md            | 78 ++++++++++++++-----------------------
 docs/ml-migration-guides.md | 23 +++++++++++
 2 files changed, 52 insertions(+), 49 deletions(-)

diff --git a/docs/ml-guide.md b/docs/ml-guide.md
index f6288e7c32d9..b957445579ff 100644
--- a/docs/ml-guide.md
+++ b/docs/ml-guide.md
@@ -72,32 +72,31 @@ To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4
 [^1]: To learn more about the benefits and background of system optimised natives, you may wish to
     watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in Scala](http://fommil.github.io/scalax14/#/).
 
-# Highlights in 2.2
+# Highlights in 2.3
 
-The list below highlights some of the new features and enhancements added to MLlib in the `2.2`
+The list below highlights some of the new features and enhancements added to MLlib in the `2.3`
 release of Spark:
 
-* [`ALS`](ml-collaborative-filtering.html) methods for _top-k_ recommendations for all
- users or items, matching the functionality in `mllib`
- ([SPARK-19535](https://issues.apache.org/jira/browse/SPARK-19535)).
- Performance was also improved for both `ml` and `mllib`
- ([SPARK-11968](https://issues.apache.org/jira/browse/SPARK-11968) and
- [SPARK-20587](https://issues.apache.org/jira/browse/SPARK-20587))
-* [`Correlation`](ml-statistics.html#correlation) and
- [`ChiSquareTest`](ml-statistics.html#hypothesis-testing) stats functions for `DataFrames`
- ([SPARK-19636](https://issues.apache.org/jira/browse/SPARK-19636) and
- [SPARK-19635](https://issues.apache.org/jira/browse/SPARK-19635))
-* [`FPGrowth`](ml-frequent-pattern-mining.html#fp-growth) algorithm for frequent pattern mining
- ([SPARK-14503](https://issues.apache.org/jira/browse/SPARK-14503))
-* `GLM` now supports the full `Tweedie` family
- ([SPARK-18929](https://issues.apache.org/jira/browse/SPARK-18929))
-* [`Imputer`](ml-features.html#imputer) feature transformer to impute missing values in a dataset
- ([SPARK-13568](https://issues.apache.org/jira/browse/SPARK-13568))
-* [`LinearSVC`](ml-classification-regression.html#linear-support-vector-machine)
- for linear Support Vector Machine classification
- ([SPARK-14709](https://issues.apache.org/jira/browse/SPARK-14709))
-* Logistic regression now supports constraints on the coefficients during training
- ([SPARK-20047](https://issues.apache.org/jira/browse/SPARK-20047))
+* Built-in support for reading images into a `DataFrame` was added
+([SPARK-21866](https://issues.apache.org/jira/browse/SPARK-21866)).
+* [`OneHotEncoderEstimator`](ml-features.html#onehotencoderestimator) was added, and should be
+used instead of the existing `OneHotEncoder` transformer. The new estimator supports
+transforming multiple columns.
+* Multiple column support was also added to `QuantileDiscretizer` and `Bucketizer`
+([SPARK-22397](https://issues.apache.org/jira/browse/SPARK-22397) and
+[SPARK-20542](https://issues.apache.org/jira/browse/SPARK-20542))
+* A new [`FeatureHasher`](ml-features.html#featurehasher) transformer was added
+ ([SPARK-13969](https://issues.apache.org/jira/browse/SPARK-13969)).
+* Added support for evaluating multiple models in parallel when performing cross-validation using
+[`TrainValidationSplit` or `CrossValidator`](ml-tuning.html)
+([SPARK-19357](https://issues.apache.org/jira/browse/SPARK-19357)).
+* Improved support for custom pipeline components in Python (see
+[SPARK-21633](https://issues.apache.org/jira/browse/SPARK-21633) and 
+[SPARK-21542](https://issues.apache.org/jira/browse/SPARK-21542)).
+* `DataFrame` functions for descriptive summary statistics over vector columns
+([SPARK-19634](https://issues.apache.org/jira/browse/SPARK-19634)).
+* Robust linear regression with Huber loss
+([SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181)).
 
 # Migration guide
 
@@ -115,36 +114,17 @@ There are no breaking changes.
 
 **Deprecations**
 
-There are no deprecations.
+* `OneHotEncoder` has been deprecated and will be removed in `3.0`. It has been replaced by the
+new [`OneHotEncoderEstimator`](ml-features.html#onehotencoderestimator)
+(see [SPARK-13030](https://issues.apache.org/jira/browse/SPARK-13030)). **Note** that
+`OneHotEncoderEstimator` will be renamed to `OneHotEncoder` in `3.0` (but
+`OneHotEncoderEstimator` will be kept as an alias).
 
 **Changes of behavior**
 
 * [SPARK-21027](https://issues.apache.org/jira/browse/SPARK-21027):
- We are now setting the default parallelism used in `OneVsRest` to be 1 (i.e. serial), in 2.2 and earlier version,
- the `OneVsRest` parallelism would be parallelism of the default threadpool in scala.
-
-## From 2.1 to 2.2
-
-### Breaking changes
-
-There are no breaking changes.
-
-### Deprecations and changes of behavior
-
-**Deprecations**
-
-There are no deprecations.
-
-**Changes of behavior**
-
-* [SPARK-19787](https://issues.apache.org/jira/browse/SPARK-19787):
- Default value of `regParam` changed from `1.0` to `0.1` for `ALS.train` method (marked `DeveloperApi`).
- **Note** this does _not affect_ the `ALS` Estimator or Model, nor MLlib's `ALS` class.
-* [SPARK-14772](https://issues.apache.org/jira/browse/SPARK-14772):
- Fixed inconsistency between Python and Scala APIs for `Param.copy` method.
-* [SPARK-11569](https://issues.apache.org/jira/browse/SPARK-11569):
- `StringIndexer` now handles `NULL` values in the same way as unseen values. Previously an exception
- would always be thrown regardless of the setting of the `handleInvalid` parameter.
+ We are now setting the default parallelism used in `OneVsRest` to be 1 (i.e. serial). In 2.2 and
+ earlier versions, the level of parallelism was set to the default threadpool size in Scala.
   
 ## Previous Spark versions
 
diff --git a/docs/ml-migration-guides.md b/docs/ml-migration-guides.md
index 687d7c893036..f4b0df58cf63 100644
--- a/docs/ml-migration-guides.md
+++ b/docs/ml-migration-guides.md
@@ -7,6 +7,29 @@ description: MLlib migration guides from before Spark SPARK_VERSION_SHORT
 
 The migration guide for the current Spark version is kept on the [MLlib Guide main page](ml-guide.html#migration-guide).
 
+## From 2.1 to 2.2
+
+### Breaking changes
+
+There are no breaking changes.
+
+### Deprecations and changes of behavior
+
+**Deprecations**
+
+There are no deprecations.
+
+**Changes of behavior**
+
+* [SPARK-19787](https://issues.apache.org/jira/browse/SPARK-19787):
+ Default value of `regParam` changed from `1.0` to `0.1` for `ALS.train` method (marked `DeveloperApi`).
+ **Note** this does _not affect_ the `ALS` Estimator or Model, nor MLlib's `ALS` class.
+* [SPARK-14772](https://issues.apache.org/jira/browse/SPARK-14772):
+ Fixed inconsistency between Python and Scala APIs for `Param.copy` method.
+* [SPARK-11569](https://issues.apache.org/jira/browse/SPARK-11569):
+ `StringIndexer` now handles `NULL` values in the same way as unseen values. Previously an exception
+ would always be thrown regardless of the setting of the `handleInvalid` parameter.
+ 
 ## From 2.0 to 2.1
 
 ### Breaking changes

From b6fcf15dec51575bef0b1df787d57ba36b26ef63 Mon Sep 17 00:00:00 2001
From: Nick Pentreath <nickp@za.ibm.com>
Date: Mon, 29 Jan 2018 12:02:24 +0200
Subject: [PATCH 2/5] Update migration guide with breaking changes

---
 docs/ml-guide.md | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/docs/ml-guide.md b/docs/ml-guide.md
index b957445579ff..ca727889476b 100644
--- a/docs/ml-guide.md
+++ b/docs/ml-guide.md
@@ -108,7 +108,12 @@ and the migration guide below will explain all changes between releases.
 
 ### Breaking changes
 
-There are no breaking changes.
+* The class and trait hierarchy for logistic regression model summaries was changed to be cleaner
+and better accommodate the addition of the multi-class summary. This is a breaking change for user
+code that casts a `LogisticRegressionTrainingSummary` to a
+` BinaryLogisticRegressionTrainingSummary`. Users should instead use the `model.binarySummary`
+method. See [SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139) for more detail 
+(_note_ this is an `Experimental` API).
 
 ### Deprecations and changes of behavior
 

From 4433d9cb70bd7a3257aef4e23f8c85f57c7999a6 Mon Sep 17 00:00:00 2001
From: Nick Pentreath <nickp@za.ibm.com>
Date: Tue, 30 Jan 2018 09:33:36 +0200
Subject: [PATCH 3/5] Update behavior change items

---
 docs/ml-guide.md | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/docs/ml-guide.md b/docs/ml-guide.md
index ca727889476b..b61bbdcdede5 100644
--- a/docs/ml-guide.md
+++ b/docs/ml-guide.md
@@ -128,8 +128,11 @@ new [`OneHotEncoderEstimator`](ml-features.html#onehotencoderestimator)
 **Changes of behavior**
 
 * [SPARK-21027](https://issues.apache.org/jira/browse/SPARK-21027):
- We are now setting the default parallelism used in `OneVsRest` to be 1 (i.e. serial). In 2.2 and
+ The default parallelism used in `OneVsRest` is now set to 1 (i.e. serial). In `2.2` and
  earlier versions, the level of parallelism was set to the default threadpool size in Scala.
+* [SPARK-22156](https://issues.apache.org/jira/browse/SPARK-22156):
+ The learning rate update for `Word2Vec` was incorrect when `numIterations` was set greater than
+ `1`. This will cause training results to be different between `2.3` and earlier versions.
   
 ## Previous Spark versions
 

From 4a957f677eadfa5345a62f78b254c999869a1940 Mon Sep 17 00:00:00 2001
From: Nick Pentreath <nickp@za.ibm.com>
Date: Tue, 30 Jan 2018 10:13:58 +0200
Subject: [PATCH 4/5] Add more behavior change items

---
 docs/ml-guide.md | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/docs/ml-guide.md b/docs/ml-guide.md
index b61bbdcdede5..bccb161b0cd7 100644
--- a/docs/ml-guide.md
+++ b/docs/ml-guide.md
@@ -133,6 +133,14 @@ new [`OneHotEncoderEstimator`](ml-features.html#onehotencoderestimator)
 * [SPARK-22156](https://issues.apache.org/jira/browse/SPARK-22156):
  The learning rate update for `Word2Vec` was incorrect when `numIterations` was set greater than
  `1`. This will cause training results to be different between `2.3` and earlier versions.
+* [SPARK-21681](https://issues.apache.org/jira/browse/SPARK-21681):
+ Fixed an edge case bug in multinomial logistic regression that resulted in incorrect coefficients
+ when some features had zero variance.
+* [SPARK-16957](https://issues.apache.org/jira/browse/SPARK-16957):
+ Tree algorithms now use mid-points for split values. This may change results from model training.
+* [SPARK-14657](https://issues.apache.org/jira/browse/SPARK-14657):
+ Fixed an issue where the features generated by `RFormula` without an intercept were inconsistent
+ with the output in R. This may change results from model training in this scenario.
   
 ## Previous Spark versions
 

From 469d87db6278da7f157d8e6c81e7a26c1b969e7c Mon Sep 17 00:00:00 2001
From: Nick Pentreath <nickp@za.ibm.com>
Date: Tue, 30 Jan 2018 10:18:31 +0200
Subject: [PATCH 5/5] Update MLOR summary note

---
 docs/ml-guide.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/docs/ml-guide.md b/docs/ml-guide.md
index bccb161b0cd7..702bcf748fc7 100644
--- a/docs/ml-guide.md
+++ b/docs/ml-guide.md
@@ -113,7 +113,8 @@ and better accommodate the addition of the multi-class summary. This is a breaki
 code that casts a `LogisticRegressionTrainingSummary` to a
 ` BinaryLogisticRegressionTrainingSummary`. Users should instead use the `model.binarySummary`
 method. See [SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139) for more detail 
-(_note_ this is an `Experimental` API).
+(_note_ this is an `Experimental` API). This _does not_ affect the Python `summary` method, which
+will still work correctly for both multinomial and binary cases.
 
 ### Deprecations and changes of behavior