-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-29052][DOCS][ML][PYTHON][CORE][R][SQL][SS] Create a Migration Guide tap in Spark documentation #25757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| - text: Spark Core | ||
| url: core-migration-guide.html | ||
| - text: SQL, Datasets and DataFrame | ||
| url: sql-migration-guide.html | ||
| - text: Structured Streaming | ||
| url: ss-migration-guide.html | ||
| - text: MLlib (Machine Learning) | ||
| url: ml-migration-guide.html | ||
| - text: PySpark (Python on Spark) | ||
| url: pyspark-migration-guide.html | ||
| - text: SparkR (R on Spark) | ||
| url: sparkr-migration-guide.html |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| <div class="left-menu-wrapper"> | ||
| <div class="left-menu"> | ||
| <h3><a href="migration-guide.html">Migration Guide</a></h3> | ||
| {% include nav-left.html nav=include.nav-migration %} | ||
| </div> | ||
| </div> |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,32 @@ | ||
| --- | ||
| layout: global | ||
| title: "Migration Guide: Spark Core" | ||
| displayTitle: "Migration Guide: Spark Core" | ||
| license: | | ||
| Licensed to the Apache Software Foundation (ASF) under one or more | ||
| contributor license agreements. See the NOTICE file distributed with | ||
| this work for additional information regarding copyright ownership. | ||
| The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
| Unless required by applicable law or agreed to in writing, software | ||
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| See the License for the specific language governing permissions and | ||
| limitations under the License. | ||
| --- | ||
|
|
||
| * Table of contents | ||
| {:toc} | ||
|
|
||
| ## Upgrading from Core 2.4 to 3.0 | ||
|
|
||
| - In Spark 3.0, deprecated method `TaskContext.isRunningLocally` has been removed. Local execution was removed and it always has returned `false`. | ||
|
|
||
| - In Spark 3.0, deprecated method `shuffleBytesWritten`, `shuffleWriteTime` and `shuffleRecordsWritten` in `ShuffleWriteMetrics` have been removed. Instead, use `bytesWritten`, `writeTime ` and `recordsWritten` respectively. | ||
|
|
||
| - In Spark 3.0, deprecated method `AccumulableInfo.apply` have been removed because creating `AccumulableInfo` is disallowed. | ||
|
|
||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's probably more we can or should add, given the number of deprecations and removals in 3.0. Is your theory that this should stick to changes that require user code to change, and ones that aren't obvious? Like, a missing method is obvious. A behavior change may not be. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea, I just quickly skimmed so that I can leave this page non empty. For migration guide, I think we should mention something that's not a corner case and when both previous and new behaviours make sense. My thought was that basically bug fixes or minor behaviour changes shouldn't come here |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| --- | ||
| layout: global | ||
| title: Migration Guide | ||
| displayTitle: Migration Guide | ||
| license: | | ||
| Licensed to the Apache Software Foundation (ASF) under one or more | ||
| contributor license agreements. See the NOTICE file distributed with | ||
| this work for additional information regarding copyright ownership. | ||
| The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software | ||
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| See the License for the specific language governing permissions and | ||
| limitations under the License. | ||
| --- | ||
|
|
||
| This page documents sections of the migration guide for each component in order | ||
| for users to migrate effectively. | ||
|
|
||
| * [Spark Core](core-migration-guide.html) | ||
| * [SQL, Datasets, and DataFrame](sql-migration-guide.html) | ||
| * [Structured Streaming](ss-migration-guide.html) | ||
| * [MLlib (Machine Learning)](ml-migration-guide.html) | ||
| * [PySpark (Python on Spark)](pyspark-migration-guide.html) | ||
| * [SparkR (R on Spark)](sparkr-migration-guide.html) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,8 +1,7 @@ | ||
| --- | ||
| layout: global | ||
| title: Old Migration Guides - MLlib | ||
| displayTitle: Old Migration Guides - MLlib | ||
| description: MLlib migration guides from before Spark SPARK_VERSION_SHORT | ||
| title: "Migration Guide: MLlib (Machine Learning)" | ||
| displayTitle: "Migration Guide: MLlib (Machine Learning)" | ||
| license: | | ||
| Licensed to the Apache Software Foundation (ASF) under one or more | ||
| contributor license agreements. See the NOTICE file distributed with | ||
|
|
@@ -20,15 +19,80 @@ license: | | |
| limitations under the License. | ||
| --- | ||
|
|
||
| The migration guide for the current Spark version is kept on the [MLlib Guide main page](ml-guide.html#migration-guide). | ||
| * Table of contents | ||
| {:toc} | ||
|
|
||
| ## From 2.1 to 2.2 | ||
| Note that this migration guide describes the items specific to MLlib. | ||
| Many items of SQL migration can be applied when migrating MLlib to higher versions for DataFrame-based APIs. | ||
| Please refer [Migration Guide: SQL, Datasets and DataFrame](sql-migration-guide.html). | ||
|
|
||
| ## Upgrading from MLlib 2.4 to 3.0 | ||
|
|
||
| ### Breaking changes | ||
| {:.no_toc} | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
|
||
| * `OneHotEncoder` which is deprecated in 2.3, is removed in 3.0 and `OneHotEncoderEstimator` is now renamed to `OneHotEncoder`. | ||
|
|
||
| ### Changes of behavior | ||
| {:.no_toc} | ||
|
|
||
| * [SPARK-11215](https://issues.apache.org/jira/browse/SPARK-11215): | ||
| In Spark 2.4 and previous versions, when specifying `frequencyDesc` or `frequencyAsc` as | ||
| `stringOrderType` param in `StringIndexer`, in case of equal frequency, the order of | ||
| strings is undefined. Since Spark 3.0, the strings with equal frequency are further | ||
| sorted by alphabet. And since Spark 3.0, `StringIndexer` supports encoding multiple | ||
| columns. | ||
|
|
||
| ## Upgrading from MLlib 2.2 to 2.3 | ||
|
|
||
| ### Breaking changes | ||
| {:.no_toc} | ||
|
|
||
| * The class and trait hierarchy for logistic regression model summaries was changed to be cleaner | ||
| and better accommodate the addition of the multi-class summary. This is a breaking change for user | ||
| code that casts a `LogisticRegressionTrainingSummary` to a | ||
| `BinaryLogisticRegressionTrainingSummary`. Users should instead use the `model.binarySummary` | ||
| method. See [SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139) for more detail | ||
| (_note_ this is an `Experimental` API). This _does not_ affect the Python `summary` method, which | ||
| will still work correctly for both multinomial and binary cases. | ||
|
|
||
| ### Deprecations and changes of behavior | ||
| {:.no_toc} | ||
|
|
||
| **Deprecations** | ||
|
|
||
| * `OneHotEncoder` has been deprecated and will be removed in `3.0`. It has been replaced by the | ||
| new [`OneHotEncoderEstimator`](ml-features.html#onehotencoderestimator) | ||
| (see [SPARK-13030](https://issues.apache.org/jira/browse/SPARK-13030)). **Note** that | ||
| `OneHotEncoderEstimator` will be renamed to `OneHotEncoder` in `3.0` (but | ||
| `OneHotEncoderEstimator` will be kept as an alias). | ||
|
|
||
| **Changes of behavior** | ||
|
|
||
| * [SPARK-21027](https://issues.apache.org/jira/browse/SPARK-21027): | ||
| The default parallelism used in `OneVsRest` is now set to 1 (i.e. serial). In `2.2` and | ||
| earlier versions, the level of parallelism was set to the default threadpool size in Scala. | ||
| * [SPARK-22156](https://issues.apache.org/jira/browse/SPARK-22156): | ||
| The learning rate update for `Word2Vec` was incorrect when `numIterations` was set greater than | ||
| `1`. This will cause training results to be different between `2.3` and earlier versions. | ||
| * [SPARK-21681](https://issues.apache.org/jira/browse/SPARK-21681): | ||
| Fixed an edge case bug in multinomial logistic regression that resulted in incorrect coefficients | ||
| when some features had zero variance. | ||
| * [SPARK-16957](https://issues.apache.org/jira/browse/SPARK-16957): | ||
| Tree algorithms now use mid-points for split values. This may change results from model training. | ||
| * [SPARK-14657](https://issues.apache.org/jira/browse/SPARK-14657): | ||
| Fixed an issue where the features generated by `RFormula` without an intercept were inconsistent | ||
| with the output in R. This may change results from model training in this scenario. | ||
|
|
||
| ## Upgrading from MLlib 2.1 to 2.2 | ||
|
|
||
| ### Breaking changes | ||
| {:.no_toc} | ||
|
|
||
| There are no breaking changes. | ||
|
|
||
| ### Deprecations and changes of behavior | ||
| {:.no_toc} | ||
|
|
||
| **Deprecations** | ||
|
|
||
|
|
@@ -45,9 +109,10 @@ There are no deprecations. | |
| `StringIndexer` now handles `NULL` values in the same way as unseen values. Previously an exception | ||
| would always be thrown regardless of the setting of the `handleInvalid` parameter. | ||
|
|
||
| ## From 2.0 to 2.1 | ||
| ## Upgrading from MLlib 2.0 to 2.1 | ||
|
|
||
| ### Breaking changes | ||
| {:.no_toc} | ||
|
|
||
| **Deprecated methods removed** | ||
|
|
||
|
|
@@ -59,6 +124,7 @@ There are no deprecations. | |
| * `validateParams` in `Evaluator` | ||
|
|
||
| ### Deprecations and changes of behavior | ||
| {:.no_toc} | ||
|
|
||
| **Deprecations** | ||
|
|
||
|
|
@@ -74,9 +140,10 @@ There are no deprecations. | |
| * [SPARK-17389](https://issues.apache.org/jira/browse/SPARK-17389): | ||
| `KMeans` reduces the default number of steps from 5 to 2 for the k-means|| initialization mode. | ||
|
|
||
| ## From 1.6 to 2.0 | ||
| ## Upgrading from MLlib 1.6 to 2.0 | ||
|
|
||
| ### Breaking changes | ||
| {:.no_toc} | ||
|
|
||
| There were several breaking changes in Spark 2.0, which are outlined below. | ||
|
|
||
|
|
@@ -171,6 +238,7 @@ Several deprecated methods were removed in the `spark.mllib` and `spark.ml` pack | |
| A full list of breaking changes can be found at [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810). | ||
|
|
||
| ### Deprecations and changes of behavior | ||
| {:.no_toc} | ||
|
|
||
| **Deprecations** | ||
|
|
||
|
|
@@ -221,7 +289,7 @@ Changes of behavior in the `spark.mllib` and `spark.ml` packages include: | |
| `QuantileDiscretizer` now uses `spark.sql.DataFrameStatFunctions.approxQuantile` to find splits (previously used custom sampling logic). | ||
| The output buckets will differ for same input data and params. | ||
|
|
||
| ## From 1.5 to 1.6 | ||
| ## Upgrading from MLlib 1.5 to 1.6 | ||
|
|
||
| There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, but there are | ||
| deprecations and changes of behavior. | ||
|
|
@@ -248,7 +316,7 @@ Changes of behavior: | |
| tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the | ||
| behavior of the simpler `Tokenizer` transformer. | ||
|
|
||
| ## From 1.4 to 1.5 | ||
| ## Upgrading from MLlib 1.4 to 1.5 | ||
|
|
||
| In the `spark.mllib` package, there are no breaking API changes but several behavior changes: | ||
|
|
||
|
|
@@ -267,7 +335,7 @@ In the `spark.ml` package, there exists one breaking API change and one behavior | |
| * [SPARK-10097](https://issues.apache.org/jira/browse/SPARK-10097): `Evaluator.isLargerBetter` is | ||
| added to indicate metric ordering. Metrics like RMSE no longer flip signs as in 1.4. | ||
|
|
||
| ## From 1.3 to 1.4 | ||
| ## Upgrading from MLlib 1.3 to 1.4 | ||
|
|
||
| In the `spark.mllib` package, there were several breaking changes, but all in `DeveloperApi` or `Experimental` APIs: | ||
|
|
||
|
|
@@ -286,7 +354,7 @@ Since the `spark.ml` API was an alpha component in Spark 1.3, we do not list all | |
| However, since 1.4 `spark.ml` is no longer an alpha component, we will provide details on any API | ||
| changes for future releases. | ||
|
|
||
| ## From 1.2 to 1.3 | ||
| ## Upgrading from MLlib 1.2 to 1.3 | ||
|
|
||
| In the `spark.mllib` package, there were several breaking changes. The first change (in `ALS`) is the only one in a component not marked as Alpha or Experimental. | ||
|
|
||
|
|
@@ -313,7 +381,7 @@ Other changes were in `LogisticRegression`: | |
| * The `scoreCol` output column (with default value "score") was renamed to be `probabilityCol` (with default value "probability"). The type was originally `Double` (for the probability of class 1.0), but it is now `Vector` (for the probability of each class, to support multiclass classification in the future). | ||
| * In Spark 1.2, `LogisticRegressionModel` did not include an intercept. In Spark 1.3, it includes an intercept; however, it will always be 0.0 since it uses the default settings for [spark.mllib.LogisticRegressionWithLBFGS](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS). The option to use an intercept will be added in the future. | ||
|
|
||
| ## From 1.1 to 1.2 | ||
| ## Upgrading from MLlib 1.1 to 1.2 | ||
|
|
||
| The only API changes in MLlib v1.2 are in | ||
| [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree), | ||
|
|
@@ -339,7 +407,7 @@ The tree `Node` now includes more information, including the probability of the | |
| Examples in the Spark distribution and examples in the | ||
| [Decision Trees Guide](mllib-decision-tree.html#examples) have been updated accordingly. | ||
|
|
||
| ## From 1.0 to 1.1 | ||
| ## Upgrading from MLlib 1.0 to 1.1 | ||
|
|
||
| The only API changes in MLlib v1.1 are in | ||
| [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree), | ||
|
|
@@ -365,7 +433,7 @@ simple `String` types. | |
| Examples of the new recommended `trainClassifier` and `trainRegressor` are given in the | ||
| [Decision Trees Guide](mllib-decision-tree.html#examples). | ||
|
|
||
| ## From 0.9 to 1.0 | ||
| ## Upgrading from MLlib 0.9 to 1.0 | ||
|
|
||
| In MLlib v1.0, we support both dense and sparse input in a unified way, which introduces a few | ||
| breaking changes. If your data is sparse, please store it in a sparse format instead of dense to | ||
|
|
||


There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to keep these old links?