[SPARK-2446][SQL] Add BinaryType support to Parquet I/O. #1373

ueshin · 2014-07-11T09:22:30Z

To support BinaryType, the following changes are needed:

Make StringType use OriginalType.UTF8
Add BinaryType using PrimitiveTypeName.BINARY without OriginalType

SparkQA · 2014-07-11T09:27:25Z

QA tests have started for PR 1373. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16565/consoleFull

SparkQA · 2014-07-11T11:04:33Z

QA results for PR 1373:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16565/consoleFull

marmbrus · 2014-07-11T17:39:21Z

Thanks for the patch! One quick question: will this change the behavior when loading in string data that was saved with previous versions of Spark SQL?

ueshin · 2014-07-11T20:51:12Z

@marmbrus Yes, I think so.
But this new behavior is the same as Avro, Thrift and the next Hive (0.14).
To load the string data saved with previous versions, Cast to StringType will be needed.

marmbrus · 2014-07-14T22:41:43Z

Okay, I figured that was the case, but you are right that compatibility with other systems is the right thing to do here.

marmbrus · 2014-07-14T22:47:43Z

Thanks for adding this. I've merge this into master, but not branch-1.0 due to the change in semantics. I also updated the commit message to include a disclaimer about the new semantics.

Note that this commit changes the semantics when loading in data that was created with prior versions of Spark SQL. Before, we were writing out strings as Binary data without adding any other annotations. Thus, when data is read in from prior versions, data that was StringType will now become BinaryType. Users that need strings can CAST that column to a String. It was decided that while this breaks compatibility, it does make us compatible with other systems (Hive, Thrift, etc) and adds support for Binary data, so this is the right decision long term. To support `BinaryType`, the following changes are needed: - Make `StringType` use `OriginalType.UTF8` - Add `BinaryType` using `PrimitiveTypeName.BINARY` without `OriginalType` Author: Takuya UESHIN <[email protected]> Closes apache#1373 from ueshin/issues/SPARK-2446 and squashes the following commits: ecacb92 [Takuya UESHIN] Add BinaryType support to Parquet I/O. 616e04a [Takuya UESHIN] Make StringType use OriginalType.UTF8.

ueshin added 2 commits July 11, 2014 18:12

Make StringType use OriginalType.UTF8.

616e04a

Add BinaryType support to Parquet I/O.

ecacb92

asfgit closed this in 9fe693b Jul 14, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-2446][SQL] Add BinaryType support to Parquet I/O. #1373

[SPARK-2446][SQL] Add BinaryType support to Parquet I/O. #1373

Uh oh!

ueshin commented Jul 11, 2014

Uh oh!

SparkQA commented Jul 11, 2014

Uh oh!

SparkQA commented Jul 11, 2014

Uh oh!

marmbrus commented Jul 11, 2014

Uh oh!

ueshin commented Jul 11, 2014

Uh oh!

marmbrus commented Jul 14, 2014

Uh oh!

marmbrus commented Jul 14, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-2446][SQL] Add BinaryType support to Parquet I/O. #1373

[SPARK-2446][SQL] Add BinaryType support to Parquet I/O. #1373

Uh oh!

Conversation

ueshin commented Jul 11, 2014

Uh oh!

SparkQA commented Jul 11, 2014

Uh oh!

SparkQA commented Jul 11, 2014

Uh oh!

marmbrus commented Jul 11, 2014

Uh oh!

ueshin commented Jul 11, 2014

Uh oh!

marmbrus commented Jul 14, 2014

Uh oh!

marmbrus commented Jul 14, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants