Skip to content

Conversation

@gatorsmile
Copy link
Member

@gatorsmile gatorsmile commented Apr 5, 2020

What changes were proposed in this pull request?

This PR is to audit the migration guides in Spark 3.0 release:

  • correct the grammar errors
  • clean up some items
  • replace HTML table by markdown table

Why are the changes needed?

N/A

Does this PR introduce any user-facing change?

No

How was this patch tested?

Screenshot:

screencapture-127-0-0-1-4000-sql-migration-guide-html-2020-04-04-21_36_29
Screen Shot 2020-04-04 at 9 28 13 PM
Screen Shot 2020-04-04 at 9 28 02 PM
Screen Shot 2020-04-04 at 9 21 40 PM

@gatorsmile
Copy link
Member Author

cc @cloud-fan @srowen @HyukjinKwon @dongjoon-hyun @maropu

@SparkQA
Copy link

SparkQA commented Apr 5, 2020

Test build #120826 has finished for PR 28125 at commit 506fcfb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 5, 2020

Test build #120827 has finished for PR 28125 at commit e24fe8e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Author's custom styles
========================================================================== */

table {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nice. +1

@gatorsmile gatorsmile force-pushed the updateMigrationGuide3.0 branch from e24fe8e to 98fe6dc Compare April 5, 2020 05:37
@SparkQA
Copy link

SparkQA commented Apr 5, 2020

Test build #120828 has finished for PR 28125 at commit 98fe6dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

- In Spark 3.1, grouping_id() returns long values. In Spark version 3.0 and earlier, this function returns int values. To restore the behavior before Spark 3.0, you can set `spark.sql.legacy.integerGroupingId` to `true`.

- Since Spark 3.1, SQL UI data adopts the `formatted` mode for the query plan explain results. To restore the behavior before Spark 3.0, you can set `spark.sql.ui.explainMode` to `extended`.
- In Spark 3.1, SQL UI data adopts the `formatted` mode for the query plan explain results. To restore the behavior before Spark 3.0, you can set `spark.sql.ui.explainMode` to `extended`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please be careful during backporting. this section is only for master branch.

- In Spark version 2.4 and below, you can create map values with map type key via built-in function such as `CreateMap`, `MapFromArrays`, etc. In Spark 3.0, it's not allowed to create map values with map type key with these built-in functions. Users can use `map_entries` function to convert map to array<struct<key, value>> as a workaround. In addition, users can still read map values with map type key from data source or Java/Scala collections, though it is discouraged.

- In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, Spark will throw RuntimeException while duplicated keys are found. Users can set `spark.sql.mapKeyDedupPolicy` to LAST_WIN to deduplicate map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
- In In Spark version 2.4 and below, you can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, for example, map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. In Spark 3.0, Spark throws `RuntimeException` when duplicated keys are found. You can set `spark.sql.mapKeyDedupPolicy` to `LAST_WIN` to deduplicate map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (for example, Parquet), the behavior is undefined.
Copy link
Member

@dongjoon-hyun dongjoon-hyun Apr 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. you can (double space) -> you can (single space).

- In Spark version 2.4 and below, the `current_timestamp` function returns a timestamp with millisecond resolution only. In Spark 3.0, the function can return the result with microsecond resolution if the underlying clock available on the system offers such resolution.

- Since Spark 3.0, 0-argument Java UDF is executed in the executor side identically with other UDFs. In Spark version 2.4 and earlier, 0-argument Java UDF alone was executed in the driver side, and the result was propagated to executors, which might be more performant in some cases but caused inconsistency with a correctness issue in some cases.
- In Spark 3.0, a 0-argument Java UDF is executed in the executor side identically with other UDFs. In Spark version 2.4 and below, 0-argument Java UDF alone was executed in the driver side, and the result was propagated to executors, which might be more performant in some cases but caused inconsistency with a correctness issue in some cases.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a or the for the second 0-argument Java UDF because we add a for the first one?

- In Spark 3.0, Proleptic Gregorian calendar is used in parsing, formatting, and converting dates and timestamps as well as in extracting sub-components like years, days and so on. Spark 3.0 uses Java 8 API classes from the `java.time` packages that are based on [ISO chronology](https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html). In Spark version 2.4 and below, those operations are performed using the hybrid calendar ([Julian + Gregorian](https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html). The changes impact on the results for dates before October 15, 1582 (Gregorian) and affect on the following Spark 3.0 API:

- Since Spark 3.0, Proleptic Gregorian calendar is used in parsing, formatting, and converting dates and timestamps as well as in extracting sub-components like years, days and etc. Spark 3.0 uses Java 8 API classes from the java.time packages that based on ISO chronology (https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html). In Spark version 2.4 and earlier, those operations are performed by using the hybrid calendar (Julian + Gregorian, see https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html). The changes impact on the results for dates before October 15, 1582 (Gregorian) and affect on the following Spark 3.0 API:
* Parsing/formatting of timestamp/date strings. This effects on CSV/JSON datasources and on the `unix_timestamp`, `date_format`, `to_unix_timestamp`, `from_unixtime`, `to_date`, `to_timestamp` functions when patterns specified by users is used for parsing and formatting. In Spark 3.0, we define our own pattern strings in `sql-ref-datetime-pattern.md`, which is implemented via `java.time.format.DateTimeFormatter` under the hood. New implementation performs strict checking of its input. For example, the `2015-07-22 10:00:00` timestamp cannot be parse if pattern is `yyyy-MM-dd` because the parser does not consume whole input. Another example is the `31/01/2015 00:00` input cannot be parsed by the `dd/MM/yyyy hh:mm` pattern because `hh` supposes hours in the range `1-12`. In Spark version 2.4 and below, `java.text.SimpleDateFormat` is used for timestamp/date string conversions, and the supported patterns are described in [simpleDateFormat](https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html). The old behavior can be restored by setting `spark.sql.legacy.timeParserPolicy` to `LEGACY`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we mention briefly the performance difference in Parquet/ORC datasources and the related conf in addition?

cc @cloud-fan , @MaxGekk , @bersprockets

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am working on improving the performance #28067, #28119 . We can mention the performance difference but later when we will have final numbers, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

@dongjoon-hyun dongjoon-hyun Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even at the final stage, I don't think we can mention here quantitively (x times slower or x%s slower). The speed-up PRs are only mitigating the slowdown, isn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc majorly focuses on the behavior changes. Performance-related numbers are sensitive to the workloads and environments. Not sure how significant it is after these PRs.

- In Spark 3.0, `TIMESTAMP` literals are converted to strings using the SQL config `spark.sql.session.timeZone`. In Spark version 2.4 and below, the conversion uses the default time zone of the Java virtual machine.

- Since Spark 3.0, `TIMESTAMP` literals are converted to strings using the SQL config `spark.sql.session.timeZone`. In Spark version 2.4 and earlier, the conversion uses the default time zone of the Java virtual machine.
- In Spark 3.0, Spark casts `String` to `Date/TimeStamp` in binary comparisons with dates/timestamps. The previous behavior of casting `Date/Timestamp` to `String` can be restored by setting `spark.sql.legacy.typeCoercion.datetimeToString.enabled` to `true`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TimeStamp -> Timestamp?

@dongjoon-hyun
Copy link
Member

Thank you, @gatorsmile .

@SparkQA
Copy link

SparkQA commented Apr 7, 2020

Test build #120937 has finished for PR 28125 at commit ffe4584.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master and branch-3.0

HyukjinKwon pushed a commit that referenced this pull request Apr 8, 2020
This PR is to audit the migration guides in Spark 3.0 release:

- correct the grammar errors
- clean up some items
- replace HTML table by markdown table

N/A

No

Screenshot:

![screencapture-127-0-0-1-4000-sql-migration-guide-html-2020-04-04-21_36_29](https://user-images.githubusercontent.com/11567269/78467043-9477d800-76bd-11ea-8ab0-3d51ea5e9fa5.png)
![Screen Shot 2020-04-04 at 9 28 13 PM](https://user-images.githubusercontent.com/11567269/78467045-98a3f580-76bd-11ea-9e4b-927bf12e683a.png)
![Screen Shot 2020-04-04 at 9 28 02 PM](https://user-images.githubusercontent.com/11567269/78467046-98a3f580-76bd-11ea-8ea3-9f13cb8d200b.png)
![Screen Shot 2020-04-04 at 9 21 40 PM](https://user-images.githubusercontent.com/11567269/78467047-993c8c00-76bd-11ea-8c29-91afc68eb590.png)

Closes #28125 from gatorsmile/updateMigrationGuide3.0.

Authored-by: gatorsmile <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit a3d8394)
Signed-off-by: HyukjinKwon <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?
This PR is to audit the migration guides in Spark 3.0 release:

- correct the grammar errors
- clean up some items
- replace HTML table by markdown table

### Why are the changes needed?
N/A

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Screenshot:

![screencapture-127-0-0-1-4000-sql-migration-guide-html-2020-04-04-21_36_29](https://user-images.githubusercontent.com/11567269/78467043-9477d800-76bd-11ea-8ab0-3d51ea5e9fa5.png)
![Screen Shot 2020-04-04 at 9 28 13 PM](https://user-images.githubusercontent.com/11567269/78467045-98a3f580-76bd-11ea-9e4b-927bf12e683a.png)
![Screen Shot 2020-04-04 at 9 28 02 PM](https://user-images.githubusercontent.com/11567269/78467046-98a3f580-76bd-11ea-8ea3-9f13cb8d200b.png)
![Screen Shot 2020-04-04 at 9 21 40 PM](https://user-images.githubusercontent.com/11567269/78467047-993c8c00-76bd-11ea-8c29-91afc68eb590.png)

Closes apache#28125 from gatorsmile/updateMigrationGuide3.0.

Authored-by: gatorsmile <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants