Skip to content

Conversation

@liancheng
Copy link
Contributor

This PR refactors Parquet write path to follow parquet-format spec. It's a successor of PR #7679, but with less non-essential changes.

Major changes include:

  1. Replaces RowWriteSupport and MutableRowWriteSupport with CatalystWriteSupport
  • Writes Parquet data using standard layout defined in parquet-format

    Specifically, we are now writing ...

    • ... arrays and maps in standard 3-level structure with proper annotations and field names
    • ... decimals as INT32 and INT64 whenever possible, and taking FIXED_LEN_BYTE_ARRAY as the final fallback
  • Supports legacy mode which is compatible with Spark 1.4 and prior versions

    The legacy mode is by default off, and can be turned on by flipping SQL option spark.sql.parquet.writeLegacyFormat to true.

  • Eliminates per value data type dispatching costs via prebuilt composed writer functions

  1. Cleans up the last pieces of old Parquet support code

As pointed out by @rxin previously, we probably want to rename all those Catalyst* Parquet classes to Parquet* for clarity. But I'd like to do this in a follow-up PR to minimize code review noises in this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is not necessary anymore since we don't support unlimited decimal precision now.

@SparkQA
Copy link

SparkQA commented Oct 6, 2015

Test build #43258 has finished for PR 8988 at commit d1583f8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 6, 2015

Test build #43272 has finished for PR 8988 at commit 6fd20f7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether this method is useful enough to be added as methods of all complex data types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not.

@liancheng
Copy link
Contributor Author

Fixed a bug related to UDT: an exception is thrown when reading Parquet files containing UDT values under standard mode. Regression tests are added in UserDefinedTypeSuite.

In 1.5 and earlier versions, when reading Parquet files containing UDT values, we pass schema containing UDT to CatalystRowConverter and translate UDTs into corresponding Parquet types there to create value converters. The problem is that, CatalystRowConverter isn't aware of standard/legacy mode, and always generate a Parquet type in legacy format. The last commit fixes this issue by expanding UDTs early in CatalystRowMaterializer, so that CatalystRowConverter doesn't need to worry about UDTs anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expands UDTs early so that CatalystRowConverter always receive a Catalyst schema without UDTs.

@SparkQA
Copy link

SparkQA commented Oct 6, 2015

Test build #43284 has finished for PR 8988 at commit d395bfd.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

The last Jenkins build failure was caused by artifact download failure.

@liancheng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 6, 2015

Test build #43288 has finished for PR 8988 at commit d395bfd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 6, 2015

Test build #43291 has finished for PR 8988 at commit a680f09.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is because Parquet doesn't allow writing empty fields. (But empty groups are OK.) The same applies to similar code below.

@davies
Copy link
Contributor

davies commented Oct 7, 2015

@liancheng As we discussed offline, we should turn the legacy mode off by default, which is compatible for 1.4 and prior versions.

@SparkQA
Copy link

SparkQA commented Oct 7, 2015

Test build #43355 has finished for PR 8988 at commit 5b08a20.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

@davies Thanks for the review. Turned legacy mode off by default, and made it a public option. Other offline comments are also addressed.

@liancheng
Copy link
Contributor Author

The last build failure was caused by a flaky artifact downloading failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of code is irrelevant to this PR, but it has been dead for a while, so remove it.

@SparkQA
Copy link

SparkQA commented Oct 8, 2015

Test build #43361 has finished for PR 8988 at commit 2bc5ebc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 8, 2015

Test build #43360 has finished for PR 8988 at commit f03ef93.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 8, 2015

Test build #43365 has finished for PR 8988 at commit c542ae9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 8, 2015

Test build #43372 has finished for PR 8988 at commit af50f9c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

@davies The last build failure was because Hive only recognizes decimals written as FIXED_LEN_BYTE_ARRAY. This might be a good reason for turning legacy mode on by default?

@liancheng
Copy link
Contributor Author

Fixed failed test Hive test cases by enabling legacy mode explicitly within those two test cases.

@SparkQA
Copy link

SparkQA commented Oct 8, 2015

Test build #43387 has finished for PR 8988 at commit e67d0b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be 1.5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default decimal type will be (10, 0), should we use a larger scale (or the numbers will be rounded)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nwm, we already specify the precision and scale.

@davies
Copy link
Contributor

davies commented Oct 8, 2015

LGTM, with except some minor comments, could you also update the PR description? (1.5 -> 1.4)

@liancheng
Copy link
Contributor Author

@davies All comments addressed. Thanks!

@SparkQA
Copy link

SparkQA commented Oct 8, 2015

Test build #43417 has finished for PR 8988 at commit fb6ee9f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

The last build failure was caused by #8983, which broke master and has just been reverted.

@liancheng
Copy link
Contributor Author

retest this please

@liancheng
Copy link
Contributor Author

A note about interoperability:

Hive 1.2.1 can read Parquet arrays and maps written in standard format. However, it still doesn't recognize Parquet decimals stored as INT32 or INT64 (HIVE-12069). There are two options to workaround this issue:

  1. Always turn on legacy mode if the written Parquet files are supposed to be used together with Hive.

Legacy mode is turned off by default in this PR.
2. Add a separate SQL option spark.sql.parquet.writeCompactDecimal to indicate whether decimals can be written as INT32 and INT64.

This PR hasn't implemented this option yet. If we prefer this approach, I can do it in another PR. We probably want this option to be false by default.

I'd vote for 2.

@davies @marmbrus @rxin Thoughts?

@davies
Copy link
Contributor

davies commented Oct 8, 2015

Since we already have an option for being compatible with Hive (the legacy mode), then we should not worry that (do not need to change anything in this PR).

Hive 1.2 and Spark 1.4 will exists for a long time, If we plan to be compatible with them out of box (without any configurations), then we can't move forward.

Parquet format 2 will have the same issue (compatibility).

@liancheng
Copy link
Contributor Author

@davies Although Hive doesn't write using standard Parquet format, it can read standard LIST and MAP. It just doesn't recognize compact decimals. So even if we turn off legacy mode, we can still interoperate with Hive as long as no compact decimals are written (either by disabling it explicit using extra SQL option, or by writing decimals with precision larger than 18). The benefit of adding an extra option is that we can still let Spark write standard Parquet files by default.

@SparkQA
Copy link

SparkQA commented Oct 8, 2015

Test build #43427 has finished for PR 8988 at commit fb6ee9f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 02149ff Oct 8, 2015
@liancheng
Copy link
Contributor Author

Merged to master, thanks @davies for the detailed review!

Finally fixed all the Parquet compatibility issues after 6 months!

@liancheng liancheng deleted the spark-8848/standard-parquet-write-path branch October 8, 2015 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants