[SPARK-17477][SQL] SparkSQL cannot handle schema evolution from Int -… #15155

wgtmac · 2016-09-19T22:30:21Z

What changes were proposed in this pull request?

Using SparkSession in Spark 2.0 to read a Hive table which is stored as parquet files and if there has been a schema evolution from int to long of a column, we will get java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt. To be specific, if there are some old parquet files using int for the column while some new parquet files use long and the Hive metastore uses Long as its type, the aforementioned exception will be thrown. Because Hive and Presto deem this kind of schema evolution is valid, this PR allows writing a int value when its table schema is long in hive metastore.

This is for non-vectorized parquet, will create a separate JIRA for vectorized parquet reader and come up with a fix later.

How was this patch tested?

Manual test to create parquet files with int type in the schema and create hive table using long as its type. Then perform spark.sql("select * from table") to query all data from this table.

AmplabJenkins · 2016-09-19T22:32:15Z

Can one of the admins verify this patch?

HyukjinKwon · 2016-09-19T23:51:54Z

Do you mind if I ask to fix the title to be complete withoit ... ?

HyukjinKwon · 2016-09-20T00:37:42Z

-1 : As far as I know, we are picking up a single Parquet file to read Spark-side schema. In this case, it is ambiguous to decide which one is "new" and "old". So, sometimes it'd be failed to read long as int and sometime it'd succeed to read int as long.

I guess we need to enable merging schemas option to support to infer schema from Parquet first but we are not supporting merging schemas with upcasting - SPARK-15516. So, IMHO, SPARK-15516 blocks this.

If we talk about the case of setting the schema explicitly in this case, then, it'd turn into the subset of SPARK-16544. In this case, I submitted a PR already #14215 but I decided to close for a better approach. If this looks good, I'd like to bring and re-open my old PR. I guess the approach here is virtually the same with my old one.

wgtmac · 2016-09-20T03:37:07Z

@HyukjinKwon Yup this PR is very similar to yours.

For merging parquet schema, it won't work. Think about this: the table contains two parquet files, one has int, one has long. The DataFrame schema uses long (mergeSchema will also result in this case). So when reading the parquet file with Int, we still run into this problem.

HyukjinKwon · 2016-09-20T03:49:41Z

Yea. I meant if we want to read "old"/"new" Parquet files without user-given schema with enabling merging schemas, then, we'd face SPARK-15516 first. This is why I thought that JIRA blocks this case.

wgtmac closed this Sep 26, 2016

wgtmac force-pushed the master branch from d1cc8ca to d810415 Compare September 26, 2016 20:42

HyukjinKwon mentioned this pull request Sep 27, 2016

[SPARK-17477][SQL] SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type #15264

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-17477][SQL] SparkSQL cannot handle schema evolution from Int -… #15155

[SPARK-17477][SQL] SparkSQL cannot handle schema evolution from Int -… #15155

Uh oh!

wgtmac commented Sep 19, 2016

Uh oh!

AmplabJenkins commented Sep 19, 2016

Uh oh!

HyukjinKwon commented Sep 19, 2016

Uh oh!

HyukjinKwon commented Sep 20, 2016 •

edited

Loading

Uh oh!

wgtmac commented Sep 20, 2016

Uh oh!

HyukjinKwon commented Sep 20, 2016 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-17477][SQL] SparkSQL cannot handle schema evolution from Int -… #15155

[SPARK-17477][SQL] SparkSQL cannot handle schema evolution from Int -… #15155

Uh oh!

Conversation

wgtmac commented Sep 19, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Sep 19, 2016

Uh oh!

HyukjinKwon commented Sep 19, 2016

Uh oh!

HyukjinKwon commented Sep 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wgtmac commented Sep 20, 2016

Uh oh!

HyukjinKwon commented Sep 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HyukjinKwon commented Sep 20, 2016 •

edited

Loading

HyukjinKwon commented Sep 20, 2016 •

edited

Loading