[SPARK-8848] [SQL] [WIP] Refactors Parquet write path to follow Parquet format spec #7679

liancheng · 2015-07-26T17:52:27Z

This PR refactors Parquet write path to follow parquet-format spec.

Major changes include:

Replaces RowWriteSupport and MutableRowWriteSupport with ParquetWriteSupport

Uses ArrayData, MapData, and SpecificMutableRow internally to minimize boxing costs
Eliminates per value data type dispatching costs via prebuilt composed writer functions
Supports legacy mode which is compatible with Parquet format used by Spark 1.4 and prior versions

Migrates large decimal precision write support introduced in [SPARK-4176] [SQL] Supports decimal types with precision > 18 in Parquet #7455
Removes more old Parquet support code
Renames Catalyst* classes under parquet package to Parquet*

Although the original names conform to Parquet data model conventions, they are not intuitive to Spark SQL developers. Considering this piece of code will be read by more SQL devs rather than Parquet devs, we decided to rename them.
5. Renames spark.sql.parquet.followParquetFormatSpec to spark.sql.parquet.writeLegacyParquetFormat, and turns it off by default.

As pointed out by @rdblue, the original option name looks confusing since there's no intuitive reasons to not follow the spec.
6. Addresses some PR comments in #6617 made by @rdblue

TODO

More tests for standard mode and turn on standard mode on by default
Fixes Parquet log redirection

The old Parquet log redirection code path was buggy and only tries to suppress Parquet logs (written using java.util.Log). This PR simply removes it together with old Parquet code. Better solution should be using SLF4J to redirect Parquet internal logs. This might be done in a follow-up PR though.

SparkQA · 2015-07-26T18:04:07Z

Test build #38479 has finished for PR 7679 at commit 9f35aac.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-27T05:18:23Z

Test build #38504 has finished for PR 7679 at commit 5285243.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-27T12:06:56Z

Test build #38538 has finished for PR 7679 at commit 70af37c.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2015-07-27T12:32:58Z

Test build #38540 has finished for PR 7679 at commit fc35aa4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-27T13:22:11Z

Test build #38543 has finished for PR 7679 at commit 1e11412.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-27T13:55:27Z

Test build #38544 has finished for PR 7679 at commit 925ce84.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-27T15:31:35Z

Test build #38547 has finished for PR 7679 at commit bb34aa4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-07-28T15:27:43Z

@rtreffer It would be great if you can help reviewing decimal related parts of this PR. I further refactored the original decimal writing code, which is now moved to CatalystWriteSupport. Another important fact to notice is that we've recently removed unlimited decimal type and restricted max decimal precision to 38.

SparkQA · 2015-07-28T16:38:57Z

Test build #38706 has finished for PR 7679 at commit a6c1502.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-07-28T17:10:32Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystRowConverter.scala

This was actually a bug, should be 18 (CatalystSchemaConverter.MAX_PRECISION_FOR_INT64 below) rather than 8.

SparkQA · 2015-07-28T17:18:27Z

Test build #38713 has finished for PR 7679 at commit 149ac5e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-07-28T17:32:57Z

@rdblue This should be my last major piece of refactoring related to parquet-format spec.

@viirya @saucam There are relatively few people in the community who are familiar with all of Spark SQL, Parquet, and Scala. So please feel free to comment :)

Thank you all in advance!

SparkQA · 2015-07-28T19:22:13Z

Test build #38733 has finished for PR 7679 at commit b40848a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-29T11:42:12Z

Test build #38835 has finished for PR 7679 at commit 136c9c2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rtreffer · 2015-07-31T09:11:28Z

@liancheng I can transfer tables from mysql -> parquet, including unsigned bigint -> DECIMAL(20) (YEAH!).
I couldn't find any problems by reading the code (yet).

liancheng · 2015-07-31T09:26:09Z

@rtreffer Cool, thanks for the review! I know that there still lacks sufficient compatibility tests for decimals. Will try to add more comprehensive Parquet compatibility tests during Spark 1.5 QA phase (starting next week).

SparkQA · 2015-07-31T10:12:55Z

Test build #39208 has finished for PR 7679 at commit d84fe92.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-31T10:52:46Z

Test build #39212 has finished for PR 7679 at commit bb64487.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-31T18:18:41Z

Test build #39218 has finished for PR 7679 at commit f5006bf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-01T18:26:33Z

Test build #39362 has finished for PR 7679 at commit 5127b8d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ing is flipped)

SparkQA · 2015-08-02T11:09:42Z

Test build #39422 has finished for PR 7679 at commit 66419b7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-02T11:23:42Z

Test build #39423 has finished for PR 7679 at commit 6bda94b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-03T15:09:51Z

Test build #39546 has finished for PR 7679 at commit 679888a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-08-07T01:32:46Z

1.5 code freeze deadline already passed, and this issue wasn't targeted to 1.5 anyway, so I'm not going to get this merged to branch-1.5. The other thing is that, I squeezed too many changes into this single PR. Will split it into multiple ones to ease review. But I'm leaving it open for now to make sure all changes merge cleanly and pass tests.

liancheng · 2015-08-25T11:43:58Z

I'm closing this. Will break it into several smaller PRs.

This PR refactors Parquet write path to follow parquet-format spec. It's a successor of PR #7679, but with less non-essential changes. Major changes include: 1. Replaces `RowWriteSupport` and `MutableRowWriteSupport` with `CatalystWriteSupport` - Writes Parquet data using standard layout defined in parquet-format Specifically, we are now writing ... - ... arrays and maps in standard 3-level structure with proper annotations and field names - ... decimals as `INT32` and `INT64` whenever possible, and taking `FIXED_LEN_BYTE_ARRAY` as the final fallback - Supports legacy mode which is compatible with Spark 1.4 and prior versions The legacy mode is by default off, and can be turned on by flipping SQL option `spark.sql.parquet.writeLegacyFormat` to `true`. - Eliminates per value data type dispatching costs via prebuilt composed writer functions 1. Cleans up the last pieces of old Parquet support code As pointed out by rxin previously, we probably want to rename all those `Catalyst*` Parquet classes to `Parquet*` for clarity. But I'd like to do this in a follow-up PR to minimize code review noises in this one. Author: Cheng Lian <[email protected]> Closes #8988 from liancheng/spark-8848/standard-parquet-write-path.

liancheng force-pushed the spark-8848-parquet-write-support branch from 9f35aac to 5285243 Compare July 27, 2015 04:52

liancheng force-pushed the spark-8848-parquet-write-support branch from 70af37c to fc35aa4 Compare July 27, 2015 12:09

liancheng force-pushed the spark-8848-parquet-write-support branch from 001c076 to 1e11412 Compare July 27, 2015 12:47

liancheng mentioned this pull request Jul 27, 2015

[SPARK-4176] [SQL] Supports decimal types with precision > 18 in Parquet #7455

Closed

2 tasks

liancheng force-pushed the spark-8848-parquet-write-support branch from bb34aa4 to a6c1502 Compare July 28, 2015 14:59

liancheng reviewed Jul 28, 2015
View reviewed changes

liancheng mentioned this pull request Jul 28, 2015

[SPARK-6777] [SQL] Implements backwards compatibility rules in CatalystSchemaConverter #6617

Closed

1 task

liancheng force-pushed the spark-8848-parquet-write-support branch from 0f12eb0 to 136c9c2 Compare July 29, 2015 10:06

liancheng force-pushed the spark-8848-parquet-write-support branch from 136c9c2 to d84fe92 Compare July 31, 2015 10:02

liancheng added 3 commits August 1, 2015 22:39

Refactors Parquet write support to follow Parquet format spec

e9305bd

Cleans up code only used by old RowWriteSupport

b465661

Fixes test failures

821e9ec

liancheng added 12 commits August 1, 2015 22:40

Fixes writing UDT

2a1e884

Optimizes writing structs

e9638f0

Fixes writing empty arrays and maps

a2aeba5

Migrates large decimal precision support

678ccd4

Fixes writing negative decimal values

b9f93db

Minor comment updates

0e0a957

Fixes array type conversion in legacy mode

2859132

Minor refactoring

62c4829

Fixes compilation error introduced while rebasing

b37fe77

Writes arrays using ArrayData

0d61b3b

Writes maps using MapData

f901a16

Retrieves data from SpecializedGetters

5127b8d

liancheng force-pushed the spark-8848-parquet-write-support branch from f5006bf to 5127b8d Compare August 1, 2015 16:05

liancheng added 5 commits August 2, 2015 15:49

Renames "Catalyst*" classes to "Parquet*"

1f1d4af

Renames root Parquet message name

0395e95

Makes implicit arguments in ParquetSchemaSuite explicit

23e523d

Renames followParquetFormatSpec to writeLegacyParquetFormat (its mean…

66419b7

…ing is flipped)

Renames analysisRequire to checkConversionRequirement

6bda94b

Simplifies ParquetSchemaConverter and updates outdated comments

679888a

liancheng closed this Aug 25, 2015

liancheng mentioned this pull request Sep 2, 2015

[SPARK-10400] [SQL] Renames SQLConf.PARQUET_FOLLOW_PARQUET_FORMAT_SPEC #8566

Closed

liancheng mentioned this pull request Oct 5, 2015

[SPARK-8848] [SQL] Refactors Parquet write path to follow parquet-format #8988

Closed

liancheng deleted the spark-8848-parquet-write-support branch October 6, 2015 16:45

[SPARK-8848] [SQL] [WIP] Refactors Parquet write path to follow Parquet format spec #7679

[SPARK-8848] [SQL] [WIP] Refactors Parquet write path to follow Parquet format spec #7679

Uh oh!

Conversation

liancheng commented Jul 26, 2015

Uh oh!

SparkQA commented Jul 26, 2015

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

liancheng commented Jul 28, 2015

Uh oh!

SparkQA commented Jul 28, 2015

Uh oh!

liancheng Jul 28, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 28, 2015

Uh oh!

liancheng commented Jul 28, 2015

Uh oh!

SparkQA commented Jul 28, 2015

Uh oh!

SparkQA commented Jul 29, 2015

Uh oh!

rtreffer commented Jul 31, 2015

Uh oh!

liancheng commented Jul 31, 2015

Uh oh!

SparkQA commented Jul 31, 2015

Uh oh!

SparkQA commented Jul 31, 2015

Uh oh!

SparkQA commented Jul 31, 2015

Uh oh!

SparkQA commented Aug 1, 2015

Uh oh!

SparkQA commented Aug 2, 2015

Uh oh!

SparkQA commented Aug 2, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

liancheng commented Aug 7, 2015

Uh oh!

liancheng commented Aug 25, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants