Skip to content

Conversation

@wgtmac
Copy link
Member

@wgtmac wgtmac commented Sep 27, 2016

What changes were proposed in this pull request?

Using SparkSession in Spark 2.0 to read a Hive table which is stored as parquet files and if there has been a schema evolution from int to long of a column, we will get java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt. To be specific, if there are some old parquet files using int for the column while some new parquet files use long and the Hive metastore uses Long as its type, the aforementioned exception will be thrown. Because Hive and Presto deem this kind of schema evolution is valid, this PR allows writing a int value when its table schema is long in hive metastore.

This is for non-vectorized parquet reader only.

How was this patch tested?

Manual test to create parquet files with int type in the schema and create hive table using long as its type. Then perform spark.sql("select * from table") to query all data from this table.

…> Long when parquet files have Int as its type while hive metastore has Long as its type
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

Please refer the discussion in #15155 (for other reviewers).
@wgtmac BTW, you don't have to open a PR again when you want to update your PR. Just push more commits or rebase if there is a conflict.

@wgtmac
Copy link
Member Author

wgtmac commented Sep 27, 2016

@HyukjinKwon Yup. I made a mistake in managing my branches so that I decided to create the PR again. Sorry for this confusion...

@sameeragarwal
Copy link
Member

@HyukjinKwon thanks for taking a look at the previous patch. I suggested this fix to @wgtmac offline but perhaps I didn't quite understand the "old"/"new" file issue that you mentioned. Can you please elaborate it here?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 28, 2016

Hi @sameeragarwal , I just meant if we want to read new and old Parquet files (one having int and one having long for the same column) described in the JIRA and PR description, we could do one of the followings.

  • Read it with the "inferred" schema. ( -1 if this PR tries to deal with this)
    We are picking up a single Parquet file to read Spark-side schema (in most cases because I remember we disabled writing summery file as defulat). In this case, it is ambiguous to decide which one is "new" and "old". So, sometimes it'd be failed to read long as int (downcasting) and sometime it'd succeed to read int as long (upcasting).
  • Read it with the schema from user (I am neutral with this)
    it'd turn into the subset of SPARK-16544. In this case, I submitted a PR already [SPARK-16544][SQL][WIP] Support for conversion from compatible schema for Parquet data source when data types are not matched #14215 but I decided to close for a better approach. If this looks good, I'd like to bring and re-open my old PR. I guess the approach here is virtually the same with my old one.
  • Read it with merged schemas. (-1 if this PR tries to deal with this).
    This case would fase SPARK-15516.

If this PR is dealing with only the second case, we should take care of more types. I mean, I guess we don't want multiple PRs and JIRAs for each type, each datasources and vecterized/non-vecterized readers. I might be wrong but this was what I thought.

BTW, if this change looks okay, I'd try to reopen my previous one rathet than making duplicated efforts.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 28, 2016

I wouldn't stay against but rather neutral if you strongly feel this one is a right fix. We could go merging this and then introduce a general upcasting logic later.

@sameeragarwal
Copy link
Member

That's a great explanation. Yes, this patch is really targeted at fixing the second issue and it does seem like a subset of SPARK-16544. If you have cycles towards working on a broader fix, it'd be great to have your patch instead; otherwise we can just merge this smaller fix for now (along with a small test case).

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 29, 2016

@sameeragarwal Thanks. One thing I'd like to sure is, though, do you think this is the right place to fix? I just want to be very sure on this before I reopen my old pr and proceed further.

(BTW, I don't mind merging it first as I think it'd take a bit of time to make the general approach merged anyway.)

@sameeragarwal
Copy link
Member

^ @davies what do you think? Would ParquetRowConverter be a good place to handle this schema evolution between logical and physical types?

@sameeragarwal
Copy link
Member

@HyukjinKwon @wgtmac I confirmed with Davies that ParquetRowConverter should be the right place for this fix. Given that it's not a regression and will only be released in 2.1, I'm more inclined to close this patch in favor of #14215 which has a much broader scope. Please let me know what you guys think.

@wgtmac
Copy link
Member Author

wgtmac commented Sep 29, 2016

@sameeragarwal I agree. I'm looking forward to a comprehensive patch for this. Thanks!

@HyukjinKwon
Copy link
Member

@wgtmac @sameeragarwal Yup. For me I dont't mind keeping open or closed. Please feel free to review #14215 then. I will rebase it and then proceed further. Thank you both.

@HyukjinKwon
Copy link
Member

BTW, I just wonder if this PR is closable if we want to do this with more types :).

@sameeragarwal
Copy link
Member

@HyukjinKwon I agree. Would you have cycles to re-open #14215 by any chance? This is something that'd be great to have that in 2.2.

@HyukjinKwon
Copy link
Member

Yeap, I will try to get that back after finishing up few issues I am currently working on. I just realised that it'd take a bit of time for me to proceed (as I noticed we need a more careful touch for it). Please feel free to take over it if anyone is interested in it. Otherwise, let me try to proceed even if it takes a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants