[SPARK-31351][DOC] Migration Guide Auditing for Spark 3.0 Release #28125

gatorsmile · 2020-04-05T04:46:04Z

What changes were proposed in this pull request?

This PR is to audit the migration guides in Spark 3.0 release:

correct the grammar errors
clean up some items
replace HTML table by markdown table

Why are the changes needed?

N/A

Does this PR introduce any user-facing change?

No

How was this patch tested?

Screenshot:

gatorsmile · 2020-04-05T04:49:04Z

cc @cloud-fan @srowen @HyukjinKwon @dongjoon-hyun @maropu

SparkQA · 2020-04-05T05:00:53Z

Test build #120826 has finished for PR 28125 at commit 506fcfb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-05T05:06:10Z

Test build #120827 has finished for PR 28125 at commit e24fe8e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-04-05T05:34:58Z

docs/css/main.css

   Author's custom styles
   ========================================================================== */

+table {


Ah, nice. +1

SparkQA · 2020-04-05T05:50:44Z

Test build #120828 has finished for PR 28125 at commit 98fe6dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-04-05T22:01:30Z

docs/sql-migration-guide.md

+  - In Spark 3.1, grouping_id() returns long values. In Spark version 3.0 and earlier, this function returns int values. To restore the behavior before Spark 3.0, you can set `spark.sql.legacy.integerGroupingId` to `true`.

-  - Since Spark 3.1, SQL UI data adopts the `formatted` mode for the query plan explain results. To restore the behavior before Spark 3.0, you can set `spark.sql.ui.explainMode` to `extended`.
+  - In Spark 3.1, SQL UI data adopts the `formatted` mode for the query plan explain results. To restore the behavior before Spark 3.0, you can set `spark.sql.ui.explainMode` to `extended`.


Please be careful during backporting. this section is only for master branch.

dongjoon-hyun · 2020-04-05T22:24:38Z

docs/sql-migration-guide.md

+  - In Spark version 2.4 and below, you can create map values with map type key via built-in function such as `CreateMap`, `MapFromArrays`, etc. In Spark 3.0, it's not allowed to create map values with map type key with these built-in functions. Users can use `map_entries` function to convert map to array<struct<key, value>> as a workaround. In addition, users can still read map values with map type key from data source or Java/Scala collections, though it is discouraged.

-  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, Spark will throw RuntimeException while duplicated keys are found. Users can set `spark.sql.mapKeyDedupPolicy` to LAST_WIN to deduplicate map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
+  - In In Spark version 2.4 and below, you  can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, for example, map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. In Spark 3.0, Spark throws `RuntimeException` when duplicated keys are found. You can set `spark.sql.mapKeyDedupPolicy` to `LAST_WIN` to deduplicate map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (for example, Parquet), the behavior is undefined.


nit. you can (double space) -> you can (single space).

dongjoon-hyun · 2020-04-05T22:27:16Z

docs/sql-migration-guide.md

+  - In Spark version 2.4 and below, the `current_timestamp` function returns a timestamp with millisecond resolution only. In Spark 3.0, the function can return the result with microsecond resolution if the underlying clock available on the system offers such resolution.

-  - Since Spark 3.0, 0-argument Java UDF is executed in the executor side identically with other UDFs. In Spark version 2.4 and earlier, 0-argument Java UDF alone was executed in the driver side, and the result was propagated to executors, which might be more performant in some cases but caused inconsistency with a correctness issue in some cases.
+  - In Spark 3.0, a 0-argument Java UDF is executed in the executor side identically with other UDFs. In Spark version 2.4 and below, 0-argument Java UDF alone was executed in the driver side, and the result was propagated to executors, which might be more performant in some cases but caused inconsistency with a correctness issue in some cases.


a or the for the second 0-argument Java UDF because we add a for the first one?

dongjoon-hyun · 2020-04-05T22:32:33Z

docs/sql-migration-guide.md

+  - In Spark 3.0, Proleptic Gregorian calendar is used in parsing, formatting, and converting dates and timestamps as well as in extracting sub-components like years, days and so on. Spark 3.0 uses Java 8 API classes from the `java.time` packages that are based on [ISO chronology](https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html). In Spark version 2.4 and below, those operations are performed using the hybrid calendar ([Julian + Gregorian](https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html). The changes impact on the results for dates before October 15, 1582 (Gregorian) and affect on the following Spark 3.0 API:

-  - Since Spark 3.0, Proleptic Gregorian calendar is used in parsing, formatting, and converting dates and timestamps as well as in extracting sub-components like years, days and etc. Spark 3.0 uses Java 8 API classes from the java.time packages that based on ISO chronology (https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html). In Spark version 2.4 and earlier, those operations are performed by using the hybrid calendar (Julian + Gregorian, see https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html). The changes impact on the results for dates before October 15, 1582 (Gregorian) and affect on the following Spark 3.0 API:
+    * Parsing/formatting of timestamp/date strings. This effects on CSV/JSON datasources and on the `unix_timestamp`, `date_format`, `to_unix_timestamp`, `from_unixtime`, `to_date`, `to_timestamp` functions when patterns specified by users is used for parsing and formatting. In Spark 3.0, we define our own pattern strings in `sql-ref-datetime-pattern.md`, which is implemented via `java.time.format.DateTimeFormatter` under the hood. New implementation performs strict checking of its input. For example, the `2015-07-22 10:00:00` timestamp cannot be parse if pattern is `yyyy-MM-dd` because the parser does not consume whole input. Another example is the `31/01/2015 00:00` input cannot be parsed by the `dd/MM/yyyy hh:mm` pattern because `hh` supposes hours in the range `1-12`. In Spark version 2.4 and below, `java.text.SimpleDateFormat` is used for timestamp/date string conversions, and the supported patterns are described in [simpleDateFormat](https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html). The old behavior can be restored by setting `spark.sql.legacy.timeParserPolicy` to `LEGACY`.


Shall we mention briefly the performance difference in Parquet/ORC datasources and the related conf in addition?

cc @cloud-fan , @MaxGekk , @bersprockets

I am working on improving the performance #28067, #28119 . We can mention the performance difference but later when we will have final numbers, I think.

Even at the final stage, I don't think we can mention here quantitively (x times slower or x%s slower). The speed-up PRs are only mitigating the slowdown, isn't it?

This doc majorly focuses on the behavior changes. Performance-related numbers are sensitive to the workloads and environments. Not sure how significant it is after these PRs.

dongjoon-hyun · 2020-04-05T22:34:22Z

docs/sql-migration-guide.md

+  - In Spark 3.0, `TIMESTAMP` literals are converted to strings using the SQL config `spark.sql.session.timeZone`. In Spark version 2.4 and below, the conversion uses the default time zone of the Java virtual machine.

-  - Since Spark 3.0, `TIMESTAMP` literals are converted to strings using the SQL config `spark.sql.session.timeZone`. In Spark version 2.4 and earlier, the conversion uses the default time zone of the Java virtual machine.
+  - In Spark 3.0, Spark casts `String` to `Date/TimeStamp` in binary comparisons with dates/timestamps. The previous behavior of casting `Date/Timestamp` to `String` can be restored by setting `spark.sql.legacy.typeCoercion.datetimeToString.enabled` to `true`.


TimeStamp -> Timestamp?

dongjoon-hyun · 2020-04-05T22:38:14Z

Thank you, @gatorsmile .

SparkQA · 2020-04-07T22:05:02Z

Test build #120937 has finished for PR 28125 at commit ffe4584.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-04-08T03:28:29Z

Merged to master and branch-3.0

This PR is to audit the migration guides in Spark 3.0 release: - correct the grammar errors - clean up some items - replace HTML table by markdown table N/A No Screenshot: ![screencapture-127-0-0-1-4000-sql-migration-guide-html-2020-04-04-21_36_29](https://user-images.githubusercontent.com/11567269/78467043-9477d800-76bd-11ea-8ab0-3d51ea5e9fa5.png) ![Screen Shot 2020-04-04 at 9 28 13 PM](https://user-images.githubusercontent.com/11567269/78467045-98a3f580-76bd-11ea-9e4b-927bf12e683a.png) ![Screen Shot 2020-04-04 at 9 28 02 PM](https://user-images.githubusercontent.com/11567269/78467046-98a3f580-76bd-11ea-8ea3-9f13cb8d200b.png) ![Screen Shot 2020-04-04 at 9 21 40 PM](https://user-images.githubusercontent.com/11567269/78467047-993c8c00-76bd-11ea-8c29-91afc68eb590.png) Closes #28125 from gatorsmile/updateMigrationGuide3.0. Authored-by: gatorsmile <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit a3d8394) Signed-off-by: HyukjinKwon <[email protected]>

### What changes were proposed in this pull request? This PR is to audit the migration guides in Spark 3.0 release: - correct the grammar errors - clean up some items - replace HTML table by markdown table ### Why are the changes needed? N/A ### Does this PR introduce any user-facing change? No ### How was this patch tested? Screenshot: ![screencapture-127-0-0-1-4000-sql-migration-guide-html-2020-04-04-21_36_29](https://user-images.githubusercontent.com/11567269/78467043-9477d800-76bd-11ea-8ab0-3d51ea5e9fa5.png) ![Screen Shot 2020-04-04 at 9 28 13 PM](https://user-images.githubusercontent.com/11567269/78467045-98a3f580-76bd-11ea-9e4b-927bf12e683a.png) ![Screen Shot 2020-04-04 at 9 28 02 PM](https://user-images.githubusercontent.com/11567269/78467046-98a3f580-76bd-11ea-8ea3-9f13cb8d200b.png) ![Screen Shot 2020-04-04 at 9 21 40 PM](https://user-images.githubusercontent.com/11567269/78467047-993c8c00-76bd-11ea-8c29-91afc68eb590.png) Closes apache#28125 from gatorsmile/updateMigrationGuide3.0. Authored-by: gatorsmile <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

gatorsmile added 2 commits April 4, 2020 21:28

update

506fcfb

fix

98fe6dc

HyukjinKwon reviewed Apr 5, 2020

View reviewed changes

gatorsmile force-pushed the updateMigrationGuide3.0 branch from e24fe8e to 98fe6dc Compare April 5, 2020 05:37

srowen approved these changes Apr 5, 2020

View reviewed changes

dongjoon-hyun reviewed Apr 5, 2020

View reviewed changes

maropu approved these changes Apr 5, 2020

View reviewed changes

cloud-fan approved these changes Apr 6, 2020

View reviewed changes

address the comments.

ffe4584

HyukjinKwon approved these changes Apr 8, 2020

View reviewed changes

HyukjinKwon closed this in a3d8394 Apr 8, 2020

[SPARK-31351][DOC] Migration Guide Auditing for Spark 3.0 Release #28125

[SPARK-31351][DOC] Migration Guide Auditing for Spark 3.0 Release #28125

Uh oh!

Conversation

gatorsmile commented Apr 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gatorsmile commented Apr 5, 2020

Uh oh!

SparkQA commented Apr 5, 2020

Uh oh!

SparkQA commented Apr 5, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 5, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Apr 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Apr 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Apr 5, 2020

Uh oh!

SparkQA commented Apr 7, 2020

Uh oh!

HyukjinKwon commented Apr 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

gatorsmile commented Apr 5, 2020 •

edited

Loading

dongjoon-hyun Apr 5, 2020 •

edited

Loading

dongjoon-hyun Apr 7, 2020 •

edited

Loading