[SPARK-29052][DOCS][ML][PYTHON][CORE][R][SQL][SS] Create a Migration Guide tap in Spark documentation #25757

HyukjinKwon · 2019-09-11T10:25:15Z

What changes were proposed in this pull request?

Currently, there is no migration section for PySpark, SparkCore and Structured Streaming.
It is difficult for users to know what to do when they upgrade.

This PR proposes to create create a "Migration Guide" tap at Spark documentation.

This page will contain migration guides for Spark SQL, PySpark, SparkR, MLlib, Structured Streaming and Core. Basically it is a refactoring.

There are some new information added, which I will leave a comment inlined for easier review.

MLlib
Merge ml-guide.html#migration-guide and ml-migration-guides.html

'docs/ml-guide.md'
        ↓ Merge new/old migration guides
'docs/ml-migration-guide.md'

PySpark
Extract PySpark specific items from https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html

'docs/sql-migration-guide-upgrade.md'
       ↓ Extract PySpark specific items
'docs/pyspark-migration-guide.md'

SparkR
Move sparkr.html#migration-guide into a separate file, and extract from sql-migration-guide-upgrade.html

'docs/sparkr.md'                     'docs/sql-migration-guide-upgrade.md'
 Move migration guide section ↘     ↙ Extract SparkR specific items
                 docs/sparkr-migration-guide.md

Core
Newly created at 'docs/core-migration-guide.md'. I skimmed resolved JIRAs at 3.0.0 and found some items to note.
Structured Streaming
Newly created at 'docs/ss-migration-guide.md'. I skimmed resolved JIRAs at 3.0.0 and found some items to note.

SQL
Merged sql-migration-guide-upgrade.html and sql-migration-guide-hive-compatibility.html

'docs/sql-migration-guide-hive-compatibility.md'     'docs/sql-migration-guide-upgrade.md'
 Move Hive compatibility section ↘                   ↙ Left over after filtering PySpark and SparkR items
                              'docs/sql-migration-guide.md'

Why are the changes needed?

In order for users in production to effectively migrate to higher versions, and detect behaviour or breaking changes before upgrading and/or migrating.

Does this PR introduce any user-facing change?

Yes, this changes Spark's documentation at https://spark.apache.org/docs/latest/index.html.

How was this patch tested?

Manually build the doc. This can be verified as below:

cd docs
SKIP_API=1 jekyll build
open _site/index.html

HyukjinKwon · 2019-09-11T10:28:45Z

docs/core-migration-guide.md

+- In Spark 3.0, deprecated method `shuffleBytesWritten`, `shuffleWriteTime` and `shuffleRecordsWritten` in `ShuffleWriteMetrics` have been removed. Instead, use `bytesWritten`, `writeTime ` and `recordsWritten` respectively.
+
+- In Spark 3.0, deprecated method `AccumulableInfo.apply` have been removed because creating `AccumulableInfo` is disallowed.
+


cc @srowen . I added those items from 0025a83 while I am adding this section.

There's probably more we can or should add, given the number of deprecations and removals in 3.0. Is your theory that this should stick to changes that require user code to change, and ones that aren't obvious? Like, a missing method is obvious. A behavior change may not be.

Yea, I just quickly skimmed so that I can leave this page non empty. For migration guide, I think we should mention something that's not a corner case and when both previous and new behaviours make sense. My thought was that basically bug fixes or minor behaviour changes shouldn't come here

HyukjinKwon · 2019-09-11T10:31:54Z

docs/ml-migration-guide.md

+## Upgrading from MLlib 2.4 to 3.0
+
+### Breaking changes
+{:.no_toc}


{:.no_toc} is needed to exclude in the "Table of Contents". For example:

Without no_toc

With no_toc

HyukjinKwon · 2019-09-11T10:33:33Z

docs/ss-migration-guide.md

+
+## Upgrading from Structured Streaming 2.4 to 3.0
+
+- In Spark 3.0, Structured Streaming forces the source schema into nullable when file-based datasources such as text, json, csv, parquet and orc are used via `spark.readStream(...)`. Previously, it respected the nullability in source schema; however, it caused issues tricky to debug with NPE. To restore the previous behavior, set `spark.sql.streaming.fileSource.schema.forceNullable` to `false`.


I skimmed and found one SS change to note in migration guide while I am adding this SS migration guide section cc @zsxwing

HyukjinKwon · 2019-09-11T10:34:54Z

FYI /cc @gatorsmile @cloud-fan @BryanCutler @felixcheung @viirya @xuanyuanking

aa

SparkQA · 2019-09-11T11:00:52Z

Test build #110475 has finished for PR 25757 at commit 7a8d639.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
* *(Breaking change)* Theapplyandcopy methods for the case class [BoostingStrategy](api/scala/index.html#org.apache.spark.mllib.tree.configuration.BoostingStrategy) have been changed because of a modification to the case class fields. This could be an issue for users who use BoostingStrategy to set GBT parameters.
* *(Breaking change)* The return value of [LDA.run](api/scala/index.html#org.apache.spark.mllib.clustering.LDA) has changed. It now returns an abstract class LDAModelinstead of the concrete classDistributedLDAModel. The object of type LDAModel can still be cast to the appropriate concrete type, which depends on the optimization algorithm.
* InDecisionTree, the deprecated class method trainhas been removed. (The object/statictrain methods remain.)
* The scoreColoutput column (with default value \"score\") was renamed to beprobabilityCol(with default value \"probability\"). The type was originallyDouble(for the probability of class 1.0), but it is nowVector (for the probability of each class, to support multiclass classification in the future).
- In Spark 3.0, the deprecatedHiveContextclass has been removed. UseSparkSession.builder.enableHiveSupport() instead.

dongjoon-hyun

For me, this looks like a good start. We can add more.
BTW, which version of Jekyll do you use, @HyukjinKwon ?
In the master branch (and this PR), I saw misaligned tabs when I use Jekyll 4.0.

dongjoon-hyun · 2019-09-13T17:07:53Z

Ur, it seems to be not a Jekyll issue. When I use spark-rm docker image, master branch still seems to be broken. Maybe, something went wrong. cc @gatorsmile

spark-rm@f327b886c883:/spark/docs$ jekyll --version
jekyll 3.8.6
spark-rm@f327b886c883:/spark/docs$ ruby --version
ruby 2.3.8p459 (2018-10-18 revision 65136) [x86_64-linux-gnu]

BTW, this is irrelevant to this PR~

viirya · 2019-09-13T18:06:29Z

docs/pyspark-migration-guide.md

+
+  - Since Spark 3.0, PySpark requires a Pandas version of 0.23.2 or higher to use Pandas related functionality, such as `toPandas`, `createDataFrame` from Pandas DataFrame, etc.
+
+  - Since Spark 3.0, PySpark requires a PyArrow version of 0.12.1 or higher to use PyArrow related functionality, such as `pandas_udf`, `toPandas` and `createDataFrame` with "spark.sql.execution.arrow.enabled=true", etc.


"spark.sql.execution.arrow.enabled=true" => `spark.sql.execution.arrow.enabled`=true or `spark.sql.execution.arrow.enabled` set to `True`?

viirya · 2019-09-13T18:14:27Z

docs/pyspark-migration-guide.md

+   access nested values. For example `df['table.column.nestedField']`. However, this means that if
+   your column name contains any dots you must now escape them using backticks (e.g., ``table.`column.with.dots`.nested``).
+
+ - DataFrame.withColumn method in PySpark supports adding a new column or replacing existing columns of the same name.


`DataFrame.withColumn`

viirya · 2019-09-13T18:18:08Z

docs/_data/menu-sql.yaml

-    - text: SQL Reserved/Non-Reserved Keywords
-      url: sql-reserved-and-non-reserved-keywords.html
-
+  url: sql-migration-old.html


Do we need to keep these old links?

viirya

Thanks for doing this. It looks good.

dongjoon-hyun · 2019-09-15T18:15:57Z

Merged to master.
cc @jiangxb1987 since he is a release manager for 3.0.0-preview.
BTW, @jiangxb1987 . The broken migration tab sidebar as you see in this PR description image is an existing issue in master branch.

HyukjinKwon · 2019-09-15T23:29:34Z

Thanks @srowen @dongjoon-hyun @viirya. I was on vacation so the reaction was late.

HyukjinKwon commented Sep 11, 2019

View reviewed changes

HyukjinKwon force-pushed the migration-doc branch from d468715 to 922a33e Compare September 11, 2019 10:29

HyukjinKwon commented Sep 11, 2019

View reviewed changes

This comment has been minimized.

Sign in to view

Create a Migration Guide tap in Spark documentation

7a8d639

aa

HyukjinKwon force-pushed the migration-doc branch from 922a33e to 7a8d639 Compare September 11, 2019 10:47

dongjoon-hyun added DOCUMENTATION ML PYSPARK SPARK CORE SPARKR SQL STRUCTURED STREAMING labels Sep 11, 2019

dongjoon-hyun approved these changes Sep 13, 2019

View reviewed changes

viirya reviewed Sep 13, 2019

View reviewed changes

viirya approved these changes Sep 13, 2019

View reviewed changes

dongjoon-hyun closed this in 7d4eb38 Sep 15, 2019

HyukjinKwon deleted the migration-doc branch March 3, 2020 01:17

		- In Spark 3.0, deprecated method `shuffleBytesWritten`, `shuffleWriteTime` and `shuffleRecordsWritten` in `ShuffleWriteMetrics` have been removed. Instead, use `bytesWritten`, `writeTime ` and `recordsWritten` respectively.

		- In Spark 3.0, deprecated method `AccumulableInfo.apply` have been removed because creating `AccumulableInfo` is disallowed.


		## Upgrading from Structured Streaming 2.4 to 3.0

		- In Spark 3.0, Structured Streaming forces the source schema into nullable when file-based datasources such as text, json, csv, parquet and orc are used via `spark.readStream(...)`. Previously, it respected the nullability in source schema; however, it caused issues tricky to debug with NPE. To restore the previous behavior, set `spark.sql.streaming.fileSource.schema.forceNullable` to `false`.


		- Since Spark 3.0, PySpark requires a Pandas version of 0.23.2 or higher to use Pandas related functionality, such as `toPandas`, `createDataFrame` from Pandas DataFrame, etc.

		- Since Spark 3.0, PySpark requires a PyArrow version of 0.12.1 or higher to use PyArrow related functionality, such as `pandas_udf`, `toPandas` and `createDataFrame` with "spark.sql.execution.arrow.enabled=true", etc.

[SPARK-29052][DOCS][ML][PYTHON][CORE][R][SQL][SS] Create a Migration Guide tap in Spark documentation #25757

[SPARK-29052][DOCS][ML][PYTHON][CORE][R][SQL][SS] Create a Migration Guide tap in Spark documentation #25757

Uh oh!

Conversation

HyukjinKwon commented Sep 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon Sep 11, 2019

Choose a reason for hiding this comment

Uh oh!

srowen Sep 11, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Sep 11, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Sep 11, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Sep 11, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Sep 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

SparkQA commented Sep 11, 2019

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Sep 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya Sep 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Sep 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Sep 13, 2019

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Sep 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Sep 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon commented Sep 11, 2019 •

edited

Loading

HyukjinKwon commented Sep 11, 2019 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun commented Sep 13, 2019 •

edited

Loading

viirya Sep 13, 2019 •

edited

Loading

viirya Sep 13, 2019 •

edited

Loading

dongjoon-hyun commented Sep 15, 2019 •

edited

Loading

HyukjinKwon commented Sep 15, 2019 •

edited

Loading