-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-8848] [SQL] [WIP] Refactors Parquet write path to follow Parquet format spec #7679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8848] [SQL] [WIP] Refactors Parquet write path to follow Parquet format spec #7679
Conversation
|
Test build #38479 has finished for PR 7679 at commit
|
9f35aac to
5285243
Compare
|
Test build #38504 has finished for PR 7679 at commit
|
|
Test build #38538 has finished for PR 7679 at commit
|
70af37c to
fc35aa4
Compare
|
Test build #38540 has finished for PR 7679 at commit
|
001c076 to
1e11412
Compare
|
Test build #38543 has finished for PR 7679 at commit
|
|
Test build #38544 has finished for PR 7679 at commit
|
|
Test build #38547 has finished for PR 7679 at commit
|
bb34aa4 to
a6c1502
Compare
|
@rtreffer It would be great if you can help reviewing decimal related parts of this PR. I further refactored the original decimal writing code, which is now moved to |
|
Test build #38706 has finished for PR 7679 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually a bug, should be 18 (CatalystSchemaConverter.MAX_PRECISION_FOR_INT64 below) rather than 8.
|
Test build #38713 has finished for PR 7679 at commit
|
|
Test build #38733 has finished for PR 7679 at commit
|
0f12eb0 to
136c9c2
Compare
|
Test build #38835 has finished for PR 7679 at commit
|
|
@liancheng I can transfer tables from mysql -> parquet, including unsigned bigint -> DECIMAL(20) (YEAH!). |
|
@rtreffer Cool, thanks for the review! I know that there still lacks sufficient compatibility tests for decimals. Will try to add more comprehensive Parquet compatibility tests during Spark 1.5 QA phase (starting next week). |
136c9c2 to
d84fe92
Compare
|
Test build #39208 has finished for PR 7679 at commit
|
|
Test build #39212 has finished for PR 7679 at commit
|
|
Test build #39218 has finished for PR 7679 at commit
|
f5006bf to
5127b8d
Compare
|
Test build #39362 has finished for PR 7679 at commit
|
|
Test build #39422 has finished for PR 7679 at commit
|
|
Test build #39423 has finished for PR 7679 at commit
|
|
Test build #39546 has finished for PR 7679 at commit
|
|
1.5 code freeze deadline already passed, and this issue wasn't targeted to 1.5 anyway, so I'm not going to get this merged to branch-1.5. The other thing is that, I squeezed too many changes into this single PR. Will split it into multiple ones to ease review. But I'm leaving it open for now to make sure all changes merge cleanly and pass tests. |
|
I'm closing this. Will break it into several smaller PRs. |
This PR refactors Parquet write path to follow parquet-format spec. It's a successor of PR #7679, but with less non-essential changes. Major changes include: 1. Replaces `RowWriteSupport` and `MutableRowWriteSupport` with `CatalystWriteSupport` - Writes Parquet data using standard layout defined in parquet-format Specifically, we are now writing ... - ... arrays and maps in standard 3-level structure with proper annotations and field names - ... decimals as `INT32` and `INT64` whenever possible, and taking `FIXED_LEN_BYTE_ARRAY` as the final fallback - Supports legacy mode which is compatible with Spark 1.4 and prior versions The legacy mode is by default off, and can be turned on by flipping SQL option `spark.sql.parquet.writeLegacyFormat` to `true`. - Eliminates per value data type dispatching costs via prebuilt composed writer functions 1. Cleans up the last pieces of old Parquet support code As pointed out by rxin previously, we probably want to rename all those `Catalyst*` Parquet classes to `Parquet*` for clarity. But I'd like to do this in a follow-up PR to minimize code review noises in this one. Author: Cheng Lian <[email protected]> Closes #8988 from liancheng/spark-8848/standard-parquet-write-path.
This PR refactors Parquet write path to follow parquet-format spec.
Major changes include:
RowWriteSupportandMutableRowWriteSupportwithParquetWriteSupportArrayData,MapData, andSpecificMutableRowinternally to minimize boxing costsCatalyst*classes underparquetpackage toParquet*Although the original names conform to Parquet data model conventions, they are not intuitive to Spark SQL developers. Considering this piece of code will be read by more SQL devs rather than Parquet devs, we decided to rename them.
5. Renames
spark.sql.parquet.followParquetFormatSpectospark.sql.parquet.writeLegacyParquetFormat, and turns it off by default.As pointed out by @rdblue, the original option name looks confusing since there's no intuitive reasons to not follow the spec.
6. Addresses some PR comments in #6617 made by @rdblue
TODO
More tests for standard mode and turn on standard mode on by default
Fixes Parquet log redirection
The old Parquet log redirection code path was buggy and only tries to suppress Parquet logs (written using
java.util.Log). This PR simply removes it together with old Parquet code. Better solution should be using SLF4J to redirect Parquet internal logs. This might be done in a follow-up PR though.