Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Sep 11, 2019

What changes were proposed in this pull request?

Currently, there is no migration section for PySpark, SparkCore and Structured Streaming.
It is difficult for users to know what to do when they upgrade.

This PR proposes to create create a "Migration Guide" tap at Spark documentation.

Screen Shot 2019-09-11 at 7 02 05 PM

Screen Shot 2019-09-11 at 7 27 15 PM

This page will contain migration guides for Spark SQL, PySpark, SparkR, MLlib, Structured Streaming and Core. Basically it is a refactoring.

There are some new information added, which I will leave a comment inlined for easier review.

  1. MLlib
    Merge ml-guide.html#migration-guide and ml-migration-guides.html

    'docs/ml-guide.md'
            ↓ Merge new/old migration guides
    'docs/ml-migration-guide.md'
    
  2. PySpark
    Extract PySpark specific items from https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html

    'docs/sql-migration-guide-upgrade.md'
           ↓ Extract PySpark specific items
    'docs/pyspark-migration-guide.md'
    
  3. SparkR
    Move sparkr.html#migration-guide into a separate file, and extract from sql-migration-guide-upgrade.html

    'docs/sparkr.md'                     'docs/sql-migration-guide-upgrade.md'
     Move migration guide section ↘     ↙ Extract SparkR specific items
                     docs/sparkr-migration-guide.md 
    
  4. Core
    Newly created at 'docs/core-migration-guide.md'. I skimmed resolved JIRAs at 3.0.0 and found some items to note.

  5. Structured Streaming
    Newly created at 'docs/ss-migration-guide.md'. I skimmed resolved JIRAs at 3.0.0 and found some items to note.

  6. SQL
    Merged sql-migration-guide-upgrade.html and sql-migration-guide-hive-compatibility.html

    'docs/sql-migration-guide-hive-compatibility.md'     'docs/sql-migration-guide-upgrade.md'
     Move Hive compatibility section ↘                   ↙ Left over after filtering PySpark and SparkR items
                                  'docs/sql-migration-guide.md'
    

Why are the changes needed?

In order for users in production to effectively migrate to higher versions, and detect behaviour or breaking changes before upgrading and/or migrating.

Does this PR introduce any user-facing change?

Yes, this changes Spark's documentation at https://spark.apache.org/docs/latest/index.html.

How was this patch tested?

Manually build the doc. This can be verified as below:

cd docs
SKIP_API=1 jekyll build
open _site/index.html

- In Spark 3.0, deprecated method `shuffleBytesWritten`, `shuffleWriteTime` and `shuffleRecordsWritten` in `ShuffleWriteMetrics` have been removed. Instead, use `bytesWritten`, `writeTime ` and `recordsWritten` respectively.

- In Spark 3.0, deprecated method `AccumulableInfo.apply` have been removed because creating `AccumulableInfo` is disallowed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @srowen . I added those items from 0025a83 while I am adding this section.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's probably more we can or should add, given the number of deprecations and removals in 3.0. Is your theory that this should stick to changes that require user code to change, and ones that aren't obvious? Like, a missing method is obvious. A behavior change may not be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I just quickly skimmed so that I can leave this page non empty. For migration guide, I think we should mention something that's not a corner case and when both previous and new behaviours make sense. My thought was that basically bug fixes or minor behaviour changes shouldn't come here

## Upgrading from MLlib 2.4 to 3.0

### Breaking changes
{:.no_toc}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{:.no_toc} is needed to exclude in the "Table of Contents". For example:

Without no_toc
Screen Shot 2019-09-11 at 7 30 08 PM

With no_toc
Screen Shot 2019-09-11 at 7 30 24 PM


## Upgrading from Structured Streaming 2.4 to 3.0

- In Spark 3.0, Structured Streaming forces the source schema into nullable when file-based datasources such as text, json, csv, parquet and orc are used via `spark.readStream(...)`. Previously, it respected the nullability in source schema; however, it caused issues tricky to debug with NPE. To restore the previous behavior, set `spark.sql.streaming.fileSource.schema.forceNullable` to `false`.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skimmed and found one SS change to note in migration guide while I am adding this SS migration guide section cc @zsxwing

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Sep 11, 2019

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA
Copy link

SparkQA commented Sep 11, 2019

Test build #110475 has finished for PR 25757 at commit 7a8d639.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • * *(Breaking change)* Theapplyandcopy methods for the case class [BoostingStrategy](api/scala/index.html#org.apache.spark.mllib.tree.configuration.BoostingStrategy) have been changed because of a modification to the case class fields. This could be an issue for users who use BoostingStrategy to set GBT parameters.
  • * *(Breaking change)* The return value of [LDA.run](api/scala/index.html#org.apache.spark.mllib.clustering.LDA) has changed. It now returns an abstract class LDAModelinstead of the concrete classDistributedLDAModel. The object of type LDAModel can still be cast to the appropriate concrete type, which depends on the optimization algorithm.
  • * InDecisionTree, the deprecated class method trainhas been removed. (The object/statictrain methods remain.)
  • * The scoreColoutput column (with default value \"score\") was renamed to beprobabilityCol(with default value \"probability\"). The type was originallyDouble(for the probability of class 1.0), but it is nowVector (for the probability of each class, to support multiclass classification in the future).
  • - In Spark 3.0, the deprecatedHiveContextclass has been removed. UseSparkSession.builder.enableHiveSupport() instead.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, this looks like a good start. We can add more.
BTW, which version of Jekyll do you use, @HyukjinKwon ?
In the master branch (and this PR), I saw misaligned tabs when I use Jekyll 4.0.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Sep 13, 2019

Ur, it seems to be not a Jekyll issue. When I use spark-rm docker image, master branch still seems to be broken. Maybe, something went wrong. cc @gatorsmile

spark-rm@f327b886c883:/spark/docs$ jekyll --version
jekyll 3.8.6
spark-rm@f327b886c883:/spark/docs$ ruby --version
ruby 2.3.8p459 (2018-10-18 revision 65136) [x86_64-linux-gnu]

Screen Shot 2019-09-13 at 10 07 24 AM

BTW, this is irrelevant to this PR~


- Since Spark 3.0, PySpark requires a Pandas version of 0.23.2 or higher to use Pandas related functionality, such as `toPandas`, `createDataFrame` from Pandas DataFrame, etc.

- Since Spark 3.0, PySpark requires a PyArrow version of 0.12.1 or higher to use PyArrow related functionality, such as `pandas_udf`, `toPandas` and `createDataFrame` with "spark.sql.execution.arrow.enabled=true", etc.
Copy link
Member

@viirya viirya Sep 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"spark.sql.execution.arrow.enabled=true" => `spark.sql.execution.arrow.enabled`=true or `spark.sql.execution.arrow.enabled` set to `True`?

access nested values. For example `df['table.column.nestedField']`. However, this means that if
your column name contains any dots you must now escape them using backticks (e.g., ``table.`column.with.dots`.nested``).

- DataFrame.withColumn method in PySpark supports adding a new column or replacing existing columns of the same name.
Copy link
Member

@viirya viirya Sep 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`DataFrame.withColumn`

- text: SQL Reserved/Non-Reserved Keywords
url: sql-reserved-and-non-reserved-keywords.html

url: sql-migration-old.html
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to keep these old links?

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. It looks good.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Sep 15, 2019

Merged to master.
cc @jiangxb1987 since he is a release manager for 3.0.0-preview.
BTW, @jiangxb1987 . The broken migration tab sidebar as you see in this PR description image is an existing issue in master branch.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Sep 15, 2019

Thanks @srowen @dongjoon-hyun @viirya. I was on vacation so the reaction was late.

@HyukjinKwon HyukjinKwon deleted the migration-doc branch March 3, 2020 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants