Skip to content

Conversation

@liancheng
Copy link
Contributor

This PR refactors Parquet write path to follow parquet-format spec.

Major changes include:

  1. Replaces RowWriteSupport and MutableRowWriteSupport with ParquetWriteSupport
  • Uses ArrayData, MapData, and SpecificMutableRow internally to minimize boxing costs
  • Eliminates per value data type dispatching costs via prebuilt composed writer functions
  • Supports legacy mode which is compatible with Parquet format used by Spark 1.4 and prior versions
  1. Migrates large decimal precision write support introduced in [SPARK-4176] [SQL] Supports decimal types with precision > 18 in Parquet #7455
  2. Removes more old Parquet support code
  3. Renames Catalyst* classes under parquet package to Parquet*

Although the original names conform to Parquet data model conventions, they are not intuitive to Spark SQL developers. Considering this piece of code will be read by more SQL devs rather than Parquet devs, we decided to rename them.
5. Renames spark.sql.parquet.followParquetFormatSpec to spark.sql.parquet.writeLegacyParquetFormat, and turns it off by default.

As pointed out by @rdblue, the original option name looks confusing since there's no intuitive reasons to not follow the spec.
6. Addresses some PR comments in #6617 made by @rdblue

TODO

  • More tests for standard mode and turn on standard mode on by default

  • Fixes Parquet log redirection

    The old Parquet log redirection code path was buggy and only tries to suppress Parquet logs (written using java.util.Log). This PR simply removes it together with old Parquet code. Better solution should be using SLF4J to redirect Parquet internal logs. This might be done in a follow-up PR though.

@SparkQA
Copy link

SparkQA commented Jul 26, 2015

Test build #38479 has finished for PR 7679 at commit 9f35aac.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng liancheng force-pushed the spark-8848-parquet-write-support branch from 9f35aac to 5285243 Compare July 27, 2015 04:52
@SparkQA
Copy link

SparkQA commented Jul 27, 2015

Test build #38504 has finished for PR 7679 at commit 5285243.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 27, 2015

Test build #38538 has finished for PR 7679 at commit 70af37c.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@liancheng liancheng force-pushed the spark-8848-parquet-write-support branch from 70af37c to fc35aa4 Compare July 27, 2015 12:09
@SparkQA
Copy link

SparkQA commented Jul 27, 2015

Test build #38540 has finished for PR 7679 at commit fc35aa4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng liancheng force-pushed the spark-8848-parquet-write-support branch from 001c076 to 1e11412 Compare July 27, 2015 12:47
@SparkQA
Copy link

SparkQA commented Jul 27, 2015

Test build #38543 has finished for PR 7679 at commit 1e11412.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 27, 2015

Test build #38544 has finished for PR 7679 at commit 925ce84.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 27, 2015

Test build #38547 has finished for PR 7679 at commit bb34aa4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

@rtreffer It would be great if you can help reviewing decimal related parts of this PR. I further refactored the original decimal writing code, which is now moved to CatalystWriteSupport. Another important fact to notice is that we've recently removed unlimited decimal type and restricted max decimal precision to 38.

@SparkQA
Copy link

SparkQA commented Jul 28, 2015

Test build #38706 has finished for PR 7679 at commit a6c1502.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually a bug, should be 18 (CatalystSchemaConverter.MAX_PRECISION_FOR_INT64 below) rather than 8.

@SparkQA
Copy link

SparkQA commented Jul 28, 2015

Test build #38713 has finished for PR 7679 at commit 149ac5e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

@rdblue This should be my last major piece of refactoring related to parquet-format spec.

@viirya @saucam There are relatively few people in the community who are familiar with all of Spark SQL, Parquet, and Scala. So please feel free to comment :)

Thank you all in advance!

@SparkQA
Copy link

SparkQA commented Jul 28, 2015

Test build #38733 has finished for PR 7679 at commit b40848a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 29, 2015

Test build #38835 has finished for PR 7679 at commit 136c9c2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rtreffer
Copy link
Contributor

@liancheng I can transfer tables from mysql -> parquet, including unsigned bigint -> DECIMAL(20) (YEAH!).
I couldn't find any problems by reading the code (yet).

@liancheng
Copy link
Contributor Author

@rtreffer Cool, thanks for the review! I know that there still lacks sufficient compatibility tests for decimals. Will try to add more comprehensive Parquet compatibility tests during Spark 1.5 QA phase (starting next week).

@liancheng liancheng force-pushed the spark-8848-parquet-write-support branch from 136c9c2 to d84fe92 Compare July 31, 2015 10:02
@SparkQA
Copy link

SparkQA commented Jul 31, 2015

Test build #39208 has finished for PR 7679 at commit d84fe92.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 31, 2015

Test build #39212 has finished for PR 7679 at commit bb64487.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 31, 2015

Test build #39218 has finished for PR 7679 at commit f5006bf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng liancheng force-pushed the spark-8848-parquet-write-support branch from f5006bf to 5127b8d Compare August 1, 2015 16:05
@SparkQA
Copy link

SparkQA commented Aug 1, 2015

Test build #39362 has finished for PR 7679 at commit 5127b8d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 2, 2015

Test build #39422 has finished for PR 7679 at commit 66419b7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 2, 2015

Test build #39423 has finished for PR 7679 at commit 6bda94b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #39546 has finished for PR 7679 at commit 679888a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

1.5 code freeze deadline already passed, and this issue wasn't targeted to 1.5 anyway, so I'm not going to get this merged to branch-1.5. The other thing is that, I squeezed too many changes into this single PR. Will split it into multiple ones to ease review. But I'm leaving it open for now to make sure all changes merge cleanly and pass tests.

@liancheng
Copy link
Contributor Author

I'm closing this. Will break it into several smaller PRs.

@liancheng liancheng closed this Aug 25, 2015
@liancheng liancheng deleted the spark-8848-parquet-write-support branch October 6, 2015 16:45
asfgit pushed a commit that referenced this pull request Oct 8, 2015
This PR refactors Parquet write path to follow parquet-format spec.  It's a successor of PR #7679, but with less non-essential changes.

Major changes include:

1.  Replaces `RowWriteSupport` and `MutableRowWriteSupport` with `CatalystWriteSupport`

    - Writes Parquet data using standard layout defined in parquet-format

      Specifically, we are now writing ...

      - ... arrays and maps in standard 3-level structure with proper annotations and field names
      - ... decimals as `INT32` and `INT64` whenever possible, and taking `FIXED_LEN_BYTE_ARRAY` as the final fallback

    - Supports legacy mode which is compatible with Spark 1.4 and prior versions

      The legacy mode is by default off, and can be turned on by flipping SQL option `spark.sql.parquet.writeLegacyFormat` to `true`.

    - Eliminates per value data type dispatching costs via prebuilt composed writer functions

1.  Cleans up the last pieces of old Parquet support code

As pointed out by rxin previously, we probably want to rename all those `Catalyst*` Parquet classes to `Parquet*` for clarity.  But I'd like to do this in a follow-up PR to minimize code review noises in this one.

Author: Cheng Lian <[email protected]>

Closes #8988 from liancheng/spark-8848/standard-parquet-write-path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants