[SPARK-17477][SQL] SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type #15264

wgtmac · 2016-09-27T18:09:54Z

What changes were proposed in this pull request?

Using SparkSession in Spark 2.0 to read a Hive table which is stored as parquet files and if there has been a schema evolution from int to long of a column, we will get java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt. To be specific, if there are some old parquet files using int for the column while some new parquet files use long and the Hive metastore uses Long as its type, the aforementioned exception will be thrown. Because Hive and Presto deem this kind of schema evolution is valid, this PR allows writing a int value when its table schema is long in hive metastore.

This is for non-vectorized parquet reader only.

How was this patch tested?

Manual test to create parquet files with int type in the schema and create hive table using long as its type. Then perform spark.sql("select * from table") to query all data from this table.

…> Long when parquet files have Int as its type while hive metastore has Long as its type

AmplabJenkins · 2016-09-27T18:12:16Z

Can one of the admins verify this patch?

HyukjinKwon · 2016-09-27T23:31:20Z

Please refer the discussion in #15155 (for other reviewers).
@wgtmac BTW, you don't have to open a PR again when you want to update your PR. Just push more commits or rebase if there is a conflict.

wgtmac · 2016-09-27T23:40:32Z

@HyukjinKwon Yup. I made a mistake in managing my branches so that I decided to create the PR again. Sorry for this confusion...

sameeragarwal · 2016-09-28T22:08:05Z

@HyukjinKwon thanks for taking a look at the previous patch. I suggested this fix to @wgtmac offline but perhaps I didn't quite understand the "old"/"new" file issue that you mentioned. Can you please elaborate it here?

HyukjinKwon · 2016-09-28T22:57:56Z

Hi @sameeragarwal , I just meant if we want to read new and old Parquet files (one having int and one having long for the same column) described in the JIRA and PR description, we could do one of the followings.

Read it with the "inferred" schema. ( -1 if this PR tries to deal with this)
We are picking up a single Parquet file to read Spark-side schema (in most cases because I remember we disabled writing summery file as defulat). In this case, it is ambiguous to decide which one is "new" and "old". So, sometimes it'd be failed to read long as int (downcasting) and sometime it'd succeed to read int as long (upcasting).
Read it with the schema from user (I am neutral with this)
it'd turn into the subset of SPARK-16544. In this case, I submitted a PR already [SPARK-16544][SQL][WIP] Support for conversion from compatible schema for Parquet data source when data types are not matched #14215 but I decided to close for a better approach. If this looks good, I'd like to bring and re-open my old PR. I guess the approach here is virtually the same with my old one.
Read it with merged schemas. (-1 if this PR tries to deal with this).
This case would fase SPARK-15516.

If this PR is dealing with only the second case, we should take care of more types. I mean, I guess we don't want multiple PRs and JIRAs for each type, each datasources and vecterized/non-vecterized readers. I might be wrong but this was what I thought.

BTW, if this change looks okay, I'd try to reopen my previous one rathet than making duplicated efforts.

HyukjinKwon · 2016-09-28T22:59:20Z

I wouldn't stay against but rather neutral if you strongly feel this one is a right fix. We could go merging this and then introduce a general upcasting logic later.

sameeragarwal · 2016-09-29T00:49:46Z

That's a great explanation. Yes, this patch is really targeted at fixing the second issue and it does seem like a subset of SPARK-16544. If you have cycles towards working on a broader fix, it'd be great to have your patch instead; otherwise we can just merge this smaller fix for now (along with a small test case).

HyukjinKwon · 2016-09-29T00:56:05Z

@sameeragarwal Thanks. One thing I'd like to sure is, though, do you think this is the right place to fix? I just want to be very sure on this before I reopen my old pr and proceed further.

(BTW, I don't mind merging it first as I think it'd take a bit of time to make the general approach merged anyway.)

sameeragarwal · 2016-09-29T01:11:52Z

^ @davies what do you think? Would ParquetRowConverter be a good place to handle this schema evolution between logical and physical types?

sameeragarwal · 2016-09-29T21:18:29Z

@HyukjinKwon @wgtmac I confirmed with Davies that ParquetRowConverter should be the right place for this fix. Given that it's not a regression and will only be released in 2.1, I'm more inclined to close this patch in favor of #14215 which has a much broader scope. Please let me know what you guys think.

wgtmac · 2016-09-29T21:26:57Z

@sameeragarwal I agree. I'm looking forward to a comprehensive patch for this. Thanks!

HyukjinKwon · 2016-09-29T23:35:47Z

@wgtmac @sameeragarwal Yup. For me I dont't mind keeping open or closed. Please feel free to review #14215 then. I will rebase it and then proceed further. Thank you both.

HyukjinKwon · 2017-02-02T13:14:12Z

BTW, I just wonder if this PR is closable if we want to do this with more types :).

sameeragarwal · 2017-02-02T19:01:54Z

@HyukjinKwon I agree. Would you have cycles to re-open #14215 by any chance? This is something that'd be great to have that in 2.2.

HyukjinKwon · 2017-02-03T02:15:55Z

Yeap, I will try to get that back after finishing up few issues I am currently working on. I just realised that it'd take a bit of time for me to proceed (as I noticed we need a more careful touch for it). Please feel free to take over it if anyone is interested in it. Otherwise, let me try to proceed even if it takes a while.

[SPARK-17477][SQL] SparkSQL cannot handle schema evolution from Int -…

aae648a

…> Long when parquet files have Int as its type while hive metastore has Long as its type

modified comment to pass scalastyle test

1f6ea81

HyukjinKwon mentioned this pull request Sep 29, 2016

[SPARK-16544][SQL][WIP] Support for conversion from compatible schema for Parquet data source when data types are not matched #14215

Closed

2 tasks

HyukjinKwon mentioned this pull request Feb 15, 2017

[BUILD] Close stale PRs #16937

Closed

asfgit closed this in ed338f7 Feb 17, 2017

[SPARK-17477][SQL] SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type #15264

[SPARK-17477][SQL] SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type #15264

Uh oh!

Conversation

wgtmac commented Sep 27, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Sep 27, 2016

Uh oh!

HyukjinKwon commented Sep 27, 2016

Uh oh!

wgtmac commented Sep 27, 2016

Uh oh!

sameeragarwal commented Sep 28, 2016

Uh oh!

HyukjinKwon commented Sep 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Sep 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sameeragarwal commented Sep 29, 2016

Uh oh!

HyukjinKwon commented Sep 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sameeragarwal commented Sep 29, 2016

Uh oh!

sameeragarwal commented Sep 29, 2016

Uh oh!

wgtmac commented Sep 29, 2016

Uh oh!

HyukjinKwon commented Sep 29, 2016

Uh oh!

HyukjinKwon commented Feb 2, 2017

Uh oh!

sameeragarwal commented Feb 2, 2017

Uh oh!

HyukjinKwon commented Feb 3, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HyukjinKwon commented Sep 28, 2016 •

edited

Loading

HyukjinKwon commented Sep 28, 2016 •

edited

Loading

HyukjinKwon commented Sep 29, 2016 •

edited

Loading