-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17477][SQL] SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type #15264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…> Long when parquet files have Int as its type while hive metastore has Long as its type
|
Can one of the admins verify this patch? |
|
@HyukjinKwon Yup. I made a mistake in managing my branches so that I decided to create the PR again. Sorry for this confusion... |
|
@HyukjinKwon thanks for taking a look at the previous patch. I suggested this fix to @wgtmac offline but perhaps I didn't quite understand the "old"/"new" file issue that you mentioned. Can you please elaborate it here? |
|
Hi @sameeragarwal , I just meant if we want to read new and old Parquet files (one having int and one having long for the same column) described in the JIRA and PR description, we could do one of the followings.
If this PR is dealing with only the second case, we should take care of more types. I mean, I guess we don't want multiple PRs and JIRAs for each type, each datasources and vecterized/non-vecterized readers. I might be wrong but this was what I thought. BTW, if this change looks okay, I'd try to reopen my previous one rathet than making duplicated efforts. |
|
I wouldn't stay against but rather neutral if you strongly feel this one is a right fix. We could go merging this and then introduce a general upcasting logic later. |
|
That's a great explanation. Yes, this patch is really targeted at fixing the second issue and it does seem like a subset of SPARK-16544. If you have cycles towards working on a broader fix, it'd be great to have your patch instead; otherwise we can just merge this smaller fix for now (along with a small test case). |
|
@sameeragarwal Thanks. One thing I'd like to sure is, though, do you think this is the right place to fix? I just want to be very sure on this before I reopen my old pr and proceed further. (BTW, I don't mind merging it first as I think it'd take a bit of time to make the general approach merged anyway.) |
|
^ @davies what do you think? Would |
|
@HyukjinKwon @wgtmac I confirmed with Davies that |
|
@sameeragarwal I agree. I'm looking forward to a comprehensive patch for this. Thanks! |
|
@wgtmac @sameeragarwal Yup. For me I dont't mind keeping open or closed. Please feel free to review #14215 then. I will rebase it and then proceed further. Thank you both. |
|
BTW, I just wonder if this PR is closable if we want to do this with more types :). |
|
@HyukjinKwon I agree. Would you have cycles to re-open #14215 by any chance? This is something that'd be great to have that in 2.2. |
|
Yeap, I will try to get that back after finishing up few issues I am currently working on. I just realised that it'd take a bit of time for me to proceed (as I noticed we need a more careful touch for it). Please feel free to take over it if anyone is interested in it. Otherwise, let me try to proceed even if it takes a while. |
What changes were proposed in this pull request?
Using SparkSession in Spark 2.0 to read a Hive table which is stored as parquet files and if there has been a schema evolution from int to long of a column, we will get java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt. To be specific, if there are some old parquet files using int for the column while some new parquet files use long and the Hive metastore uses Long as its type, the aforementioned exception will be thrown. Because Hive and Presto deem this kind of schema evolution is valid, this PR allows writing a int value when its table schema is long in hive metastore.
This is for non-vectorized parquet reader only.
How was this patch tested?
Manual test to create parquet files with int type in the schema and create hive table using long as its type. Then perform spark.sql("select * from table") to query all data from this table.