[SPARK-8848] [SQL] Refactors Parquet write path to follow parquet-format #8988

liancheng · 2015-10-05T22:56:15Z

This PR refactors Parquet write path to follow parquet-format spec. It's a successor of PR #7679, but with less non-essential changes.

Major changes include:

Replaces RowWriteSupport and MutableRowWriteSupport with CatalystWriteSupport

Writes Parquet data using standard layout defined in parquet-format

Specifically, we are now writing ...
- ... arrays and maps in standard 3-level structure with proper annotations and field names
- ... decimals as INT32 and INT64 whenever possible, and taking FIXED_LEN_BYTE_ARRAY as the final fallback
Supports legacy mode which is compatible with Spark 1.4 and prior versions

The legacy mode is by default off, and can be turned on by flipping SQL option spark.sql.parquet.writeLegacyFormat to true.
Eliminates per value data type dispatching costs via prebuilt composed writer functions

Cleans up the last pieces of old Parquet support code

As pointed out by @rxin previously, we probably want to rename all those Catalyst* Parquet classes to Parquet* for clarity. But I'd like to do this in a follow-up PR to minimize code review noises in this one.

liancheng · 2015-10-06T00:20:48Z

.../main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala

This method is not necessary anymore since we don't support unlimited decimal precision now.

SparkQA · 2015-10-06T01:04:22Z

Test build #43258 has finished for PR 8988 at commit d1583f8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-10-06T07:30:35Z

Test build #43272 has finished for PR 8988 at commit 6fd20f7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-10-06T18:08:17Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala

Not sure whether this method is useful enough to be added as methods of all complex data types.

liancheng · 2015-10-06T18:15:00Z

Fixed a bug related to UDT: an exception is thrown when reading Parquet files containing UDT values under standard mode. Regression tests are added in UserDefinedTypeSuite.

In 1.5 and earlier versions, when reading Parquet files containing UDT values, we pass schema containing UDT to CatalystRowConverter and translate UDTs into corresponding Parquet types there to create value converters. The problem is that, CatalystRowConverter isn't aware of standard/legacy mode, and always generate a Parquet type in legacy format. The last commit fixes this issue by expanding UDTs early in CatalystRowMaterializer, so that CatalystRowConverter doesn't need to worry about UDTs anymore.

liancheng · 2015-10-06T18:17:56Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala

Expands UDTs early so that CatalystRowConverter always receive a Catalyst schema without UDTs.

SparkQA · 2015-10-06T18:29:21Z

Test build #43284 has finished for PR 8988 at commit d395bfd.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-10-06T18:43:38Z

The last Jenkins build failure was caused by artifact download failure.

liancheng · 2015-10-06T18:43:42Z

retest this please

SparkQA · 2015-10-06T20:45:01Z

Test build #43288 has finished for PR 8988 at commit d395bfd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-10-06T21:02:20Z

Test build #43291 has finished for PR 8988 at commit a680f09.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-10-07T19:05:37Z

...src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystWriteSupport.scala

Note that this is because Parquet doesn't allow writing empty fields. (But empty groups are OK.) The same applies to similar code below.

davies · 2015-10-07T23:38:50Z

@liancheng As we discussed offline, we should turn the legacy mode off by default, which is compatible for 1.4 and prior versions.

SparkQA · 2015-10-07T23:54:52Z

Test build #43355 has finished for PR 8988 at commit 5b08a20.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

… UnsafeRow

liancheng · 2015-10-08T00:02:21Z

@davies Thanks for the review. Turned legacy mode off by default, and made it a public option. Other offline comments are also addressed.

liancheng · 2015-10-08T00:24:12Z

The last build failure was caused by a flaky artifact downloading failure.

liancheng · 2015-10-08T00:40:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

This part of code is irrelevant to this PR, but it has been dead for a while, so remove it.

SparkQA · 2015-10-08T01:52:42Z

Test build #43361 has finished for PR 8988 at commit 2bc5ebc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-10-08T02:05:01Z

Test build #43360 has finished for PR 8988 at commit f03ef93.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-10-08T02:09:24Z

Test build #43365 has finished for PR 8988 at commit c542ae9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-10-08T02:56:55Z

Test build #43372 has finished for PR 8988 at commit af50f9c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-10-08T06:20:15Z

@davies The last build failure was because Hive only recognizes decimals written as FIXED_LEN_BYTE_ARRAY. This might be a good reason for turning legacy mode on by default?

liancheng · 2015-10-08T06:23:17Z

Fixed failed test Hive test cases by enabling legacy mode explicitly within those two test cases.

SparkQA · 2015-10-08T08:30:47Z

Test build #43387 has finished for PR 8988 at commit e67d0b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2015-10-08T17:41:29Z

.../main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala

This should be 1.5

davies · 2015-10-08T18:08:46Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala

The default decimal type will be (10, 0), should we use a larger scale (or the numbers will be rounded)?

nwm, we already specify the precision and scale.

davies · 2015-10-08T18:15:12Z

LGTM, with except some minor comments, could you also update the PR description? (1.5 -> 1.4)

liancheng · 2015-10-08T18:46:16Z

@davies All comments addressed. Thanks!

SparkQA · 2015-10-08T20:22:27Z

Test build #43417 has finished for PR 8988 at commit fb6ee9f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-10-08T20:53:00Z

The last build failure was caused by #8983, which broke master and has just been reverted.

liancheng · 2015-10-08T20:53:05Z

retest this please

liancheng · 2015-10-08T21:09:37Z

A note about interoperability:

Hive 1.2.1 can read Parquet arrays and maps written in standard format. However, it still doesn't recognize Parquet decimals stored as INT32 or INT64 (HIVE-12069). There are two options to workaround this issue:

Always turn on legacy mode if the written Parquet files are supposed to be used together with Hive.

Legacy mode is turned off by default in this PR.
2. Add a separate SQL option spark.sql.parquet.writeCompactDecimal to indicate whether decimals can be written as INT32 and INT64.

This PR hasn't implemented this option yet. If we prefer this approach, I can do it in another PR. We probably want this option to be false by default.

I'd vote for 2.

@davies @marmbrus @rxin Thoughts?

davies · 2015-10-08T21:31:58Z

Since we already have an option for being compatible with Hive (the legacy mode), then we should not worry that (do not need to change anything in this PR).

Hive 1.2 and Spark 1.4 will exists for a long time, If we plan to be compatible with them out of box (without any configurations), then we can't move forward.

Parquet format 2 will have the same issue (compatibility).

liancheng · 2015-10-08T22:05:32Z

@davies Although Hive doesn't write using standard Parquet format, it can read standard LIST and MAP. It just doesn't recognize compact decimals. So even if we turn off legacy mode, we can still interoperate with Hive as long as no compact decimals are written (either by disabling it explicit using extra SQL option, or by writing decimals with precision larger than 18). The benefit of adding an extra option is that we can still let Spark write standard Parquet files by default.

SparkQA · 2015-10-08T22:41:54Z

Test build #43427 has finished for PR 8988 at commit fb6ee9f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-10-08T23:21:38Z

Merged to master, thanks @davies for the detailed review!

Finally fixed all the Parquet compatibility issues after 6 months!

Refactors Parquet write path to follow parquet-format

d1583f8

liancheng reviewed Oct 6, 2015
View reviewed changes

Optimizes writing low precision decimals

6fd20f7

Fixes bug in reading UDTs under standard mode

d395bfd

liancheng reviewed Oct 6, 2015
View reviewed changes

Stops using MutableRow when writing maps to avoid boxing

a680f09

liancheng reviewed Oct 7, 2015
View reviewed changes

Removes more dead code

5b08a20

liancheng added 2 commits October 7, 2015 16:54

Fixes a bug when writing small decimals coming from rows that are not…

f03ef93

… UnsafeRow

Addresses comments

2bc5ebc

Fixes comment typo and removes an unused import

c542ae9

liancheng reviewed Oct 8, 2015
View reviewed changes

One more dead code :)

af50f9c

Fixes test cases

e67d0b1

davies reviewed Oct 8, 2015
View reviewed changes

.../main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala Outdated

Copy link

Contributor

davies Oct 8, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be 1.5

Addresses comments

db79fb6

davies reviewed Oct 8, 2015
View reviewed changes

Removes redundant test case

fb6ee9f

asfgit closed this in 02149ff Oct 8, 2015

liancheng deleted the spark-8848/standard-parquet-write-path branch October 8, 2015 23:21

[SPARK-8848] [SQL] Refactors Parquet write path to follow parquet-format #8988

[SPARK-8848] [SQL] Refactors Parquet write path to follow parquet-format #8988

Uh oh!

Conversation

liancheng commented Oct 5, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 6, 2015

Uh oh!

SparkQA commented Oct 6, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liancheng commented Oct 6, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 6, 2015

Uh oh!

liancheng commented Oct 6, 2015

Uh oh!

liancheng commented Oct 6, 2015

Uh oh!

SparkQA commented Oct 6, 2015

Uh oh!

SparkQA commented Oct 6, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davies commented Oct 7, 2015

Uh oh!

SparkQA commented Oct 7, 2015

Uh oh!

liancheng commented Oct 8, 2015

Uh oh!

liancheng commented Oct 8, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 8, 2015

Uh oh!

SparkQA commented Oct 8, 2015

Uh oh!

SparkQA commented Oct 8, 2015

Uh oh!

SparkQA commented Oct 8, 2015

Uh oh!

liancheng commented Oct 8, 2015

Uh oh!

liancheng commented Oct 8, 2015

Uh oh!

SparkQA commented Oct 8, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davies commented Oct 8, 2015

Uh oh!

liancheng commented Oct 8, 2015

Uh oh!

SparkQA commented Oct 8, 2015

Uh oh!

liancheng commented Oct 8, 2015

Uh oh!

liancheng commented Oct 8, 2015

Uh oh!

liancheng commented Oct 8, 2015

Uh oh!

davies commented Oct 8, 2015

Uh oh!

liancheng commented Oct 8, 2015

Uh oh!

SparkQA commented Oct 8, 2015

Uh oh!