Skip to content

Conversation

@wgtmac
Copy link
Member

@wgtmac wgtmac commented Sep 19, 2016

What changes were proposed in this pull request?

Using SparkSession in Spark 2.0 to read a Hive table which is stored as parquet files and if there has been a schema evolution from int to long of a column, we will get java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt. To be specific, if there are some old parquet files using int for the column while some new parquet files use long and the Hive metastore uses Long as its type, the aforementioned exception will be thrown. Because Hive and Presto deem this kind of schema evolution is valid, this PR allows writing a int value when its table schema is long in hive metastore.

This is for non-vectorized parquet, will create a separate JIRA for vectorized parquet reader and come up with a fix later.

How was this patch tested?

Manual test to create parquet files with int type in the schema and create hive table using long as its type. Then perform spark.sql("select * from table") to query all data from this table.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

Do you mind if I ask to fix the title to be complete withoit ... ?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 20, 2016

-1 : As far as I know, we are picking up a single Parquet file to read Spark-side schema. In this case, it is ambiguous to decide which one is "new" and "old". So, sometimes it'd be failed to read long as int and sometime it'd succeed to read int as long.

I guess we need to enable merging schemas option to support to infer schema from Parquet first but we are not supporting merging schemas with upcasting - SPARK-15516. So, IMHO, SPARK-15516 blocks this.

If we talk about the case of setting the schema explicitly in this case, then, it'd turn into the subset of SPARK-16544. In this case, I submitted a PR already #14215 but I decided to close for a better approach. If this looks good, I'd like to bring and re-open my old PR. I guess the approach here is virtually the same with my old one.

@wgtmac
Copy link
Member Author

wgtmac commented Sep 20, 2016

@HyukjinKwon Yup this PR is very similar to yours.

For merging parquet schema, it won't work. Think about this: the table contains two parquet files, one has int, one has long. The DataFrame schema uses long (mergeSchema will also result in this case). So when reading the parquet file with Int, we still run into this problem.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 20, 2016

Yea. I meant if we want to read "old"/"new" Parquet files without user-given schema with enabling merging schemas, then, we'd face SPARK-15516 first. This is why I thought that JIRA blocks this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants