Skip to content

Conversation

@ueshin
Copy link
Member

@ueshin ueshin commented Jul 11, 2014

To support BinaryType, the following changes are needed:

  • Make StringType use OriginalType.UTF8
  • Add BinaryType using PrimitiveTypeName.BINARY without OriginalType

@SparkQA
Copy link

SparkQA commented Jul 11, 2014

QA tests have started for PR 1373. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16565/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 11, 2014

QA results for PR 1373:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16565/consoleFull

@marmbrus
Copy link
Contributor

Thanks for the patch! One quick question: will this change the behavior when loading in string data that was saved with previous versions of Spark SQL?

@ueshin
Copy link
Member Author

ueshin commented Jul 11, 2014

@marmbrus Yes, I think so.
But this new behavior is the same as Avro, Thrift and the next Hive (0.14).
To load the string data saved with previous versions, Cast to StringType will be needed.

@marmbrus
Copy link
Contributor

Okay, I figured that was the case, but you are right that compatibility with other systems is the right thing to do here.

@asfgit asfgit closed this in 9fe693b Jul 14, 2014
@marmbrus
Copy link
Contributor

Thanks for adding this. I've merge this into master, but not branch-1.0 due to the change in semantics. I also updated the commit message to include a disclaimer about the new semantics.

xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
Note that this commit changes the semantics when loading in data that was created with prior versions of Spark SQL.  Before, we were writing out strings as Binary data without adding any other annotations. Thus, when data is read in from prior versions, data that was StringType will now become BinaryType.  Users that need strings can CAST that column to a String.  It was decided that while this breaks compatibility, it does make us compatible with other systems (Hive, Thrift, etc) and adds support for Binary data, so this is the right decision long term.

To support `BinaryType`, the following changes are needed:
- Make `StringType` use `OriginalType.UTF8`
- Add `BinaryType` using `PrimitiveTypeName.BINARY` without `OriginalType`

Author: Takuya UESHIN <[email protected]>

Closes apache#1373 from ueshin/issues/SPARK-2446 and squashes the following commits:

ecacb92 [Takuya UESHIN] Add BinaryType support to Parquet I/O.
616e04a [Takuya UESHIN] Make StringType use OriginalType.UTF8.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants