-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-8848] [SQL] Refactors Parquet write path to follow parquet-format #8988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8848] [SQL] Refactors Parquet write path to follow parquet-format #8988
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is not necessary anymore since we don't support unlimited decimal precision now.
|
Test build #43258 has finished for PR 8988 at commit
|
|
Test build #43272 has finished for PR 8988 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether this method is useful enough to be added as methods of all complex data types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not.
|
Fixed a bug related to UDT: an exception is thrown when reading Parquet files containing UDT values under standard mode. Regression tests are added in In 1.5 and earlier versions, when reading Parquet files containing UDT values, we pass schema containing UDT to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expands UDTs early so that CatalystRowConverter always receive a Catalyst schema without UDTs.
|
Test build #43284 has finished for PR 8988 at commit
|
|
The last Jenkins build failure was caused by artifact download failure. |
|
retest this please |
|
Test build #43288 has finished for PR 8988 at commit
|
|
Test build #43291 has finished for PR 8988 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is because Parquet doesn't allow writing empty fields. (But empty groups are OK.) The same applies to similar code below.
|
@liancheng As we discussed offline, we should turn the legacy mode off by default, which is compatible for 1.4 and prior versions. |
|
Test build #43355 has finished for PR 8988 at commit
|
|
@davies Thanks for the review. Turned legacy mode off by default, and made it a public option. Other offline comments are also addressed. |
|
The last build failure was caused by a flaky artifact downloading failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of code is irrelevant to this PR, but it has been dead for a while, so remove it.
|
Test build #43361 has finished for PR 8988 at commit
|
|
Test build #43360 has finished for PR 8988 at commit
|
|
Test build #43365 has finished for PR 8988 at commit
|
|
Test build #43372 has finished for PR 8988 at commit
|
|
@davies The last build failure was because Hive only recognizes decimals written as |
|
Fixed failed test Hive test cases by enabling legacy mode explicitly within those two test cases. |
|
Test build #43387 has finished for PR 8988 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be 1.5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default decimal type will be (10, 0), should we use a larger scale (or the numbers will be rounded)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nwm, we already specify the precision and scale.
|
LGTM, with except some minor comments, could you also update the PR description? (1.5 -> 1.4) |
|
@davies All comments addressed. Thanks! |
|
Test build #43417 has finished for PR 8988 at commit
|
|
The last build failure was caused by #8983, which broke master and has just been reverted. |
|
retest this please |
|
A note about interoperability: Hive 1.2.1 can read Parquet arrays and maps written in standard format. However, it still doesn't recognize Parquet decimals stored as
Legacy mode is turned off by default in this PR. This PR hasn't implemented this option yet. If we prefer this approach, I can do it in another PR. We probably want this option to be I'd vote for 2. |
|
Since we already have an option for being compatible with Hive (the legacy mode), then we should not worry that (do not need to change anything in this PR). Hive 1.2 and Spark 1.4 will exists for a long time, If we plan to be compatible with them out of box (without any configurations), then we can't move forward. Parquet format 2 will have the same issue (compatibility). |
|
@davies Although Hive doesn't write using standard Parquet format, it can read standard LIST and MAP. It just doesn't recognize compact decimals. So even if we turn off legacy mode, we can still interoperate with Hive as long as no compact decimals are written (either by disabling it explicit using extra SQL option, or by writing decimals with precision larger than 18). The benefit of adding an extra option is that we can still let Spark write standard Parquet files by default. |
|
Test build #43427 has finished for PR 8988 at commit
|
|
Merged to master, thanks @davies for the detailed review! Finally fixed all the Parquet compatibility issues after 6 months! |
This PR refactors Parquet write path to follow parquet-format spec. It's a successor of PR #7679, but with less non-essential changes.
Major changes include:
RowWriteSupportandMutableRowWriteSupportwithCatalystWriteSupportWrites Parquet data using standard layout defined in parquet-format
Specifically, we are now writing ...
INT32andINT64whenever possible, and takingFIXED_LEN_BYTE_ARRAYas the final fallbackSupports legacy mode which is compatible with Spark 1.4 and prior versions
The legacy mode is by default off, and can be turned on by flipping SQL option
spark.sql.parquet.writeLegacyFormattotrue.Eliminates per value data type dispatching costs via prebuilt composed writer functions
As pointed out by @rxin previously, we probably want to rename all those
Catalyst*Parquet classes toParquet*for clarity. But I'd like to do this in a follow-up PR to minimize code review noises in this one.