[SPARK-16544][SQL][WIP] Support for conversion from compatible schema for Parquet data source when data types are not matched #14215

HyukjinKwon · 2016-07-15T04:09:10Z

What changes were proposed in this pull request?

This PR adds schema compatibility for Parquet.

Currently if user-given schema is different with the Parquet schema, it throws an exception even when the user-given schema is compatible with Parquet schema.

For example, executing the codes below:

val path = "/tmp/test.parquet"
val data = (1 to 4).map(Tuple1(_))
spark.createDataFrame(data).toDF("a").write.parquet(path)
val schema = StructType(StructField("a", LongType, true) :: Nil)
spark.read.schema(schema).parquet(path).show()

throws an exception as below:

org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 
...

This PR lets Parqet supports this schema compatibility.

Schema compatibility for NumericType except DecimalType.
Schema compatibility for other AtomicType.

How was this patch tested?

Unit tests in ParquetIOSuite.

HyukjinKwon · 2016-07-15T04:10:40Z

Hi @gatorsmile @dongjoon-hyun @liancheng , currently this deals with only NumericType except DecimalType for upcasting only for non-vectorized reader.

Before proceeding further, I want to be sure that this approach looks good. Could I ask some feedback please (should this be maybe handled as single PR and be other follow-ups for the other stuff?) ?

SparkQA · 2016-07-15T05:49:07Z

Test build #62365 has finished for PR 14215 at commit b45f2ea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-07-15T06:25:30Z

Currently, the error message is still confusing.

org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0

Could we first improve the error handling? Detecting the schema mismatching and issue an appropriate error message. You know, this is not only for Parquet. The other data sources face the same issue.

Regarding the current implementation, it is very specific to Parquet. I am wondering if the other data sources face the same issue? We need a better design for resolving all of them.

HyukjinKwon · 2016-07-15T07:00:41Z

I see, yes I will think of a better way to fix the message. Yea it is still happening across other data sources and this implementation is currently specific to Parquet.

However, I just wonder if we can implement them step by step. Actually, I kind of put possibly generalizable things together in ParquetSchemaCompatibility. For example, ORC is doing this very similarly with Parquet, HiveInspectors.scala#L630-L649.

I just want to do this bit by bit rather than changing a bunch of codes at once (it is also because changing the codes like that would make it really hard to be reviewed and, to be honest, I believe it does not really get reviewed for really long time)..

BTW, does that look okay anyway (I mean converting the value before setting the value to the row)?

HyukjinKwon · 2016-07-15T07:07:35Z

For handling messages, I will open a separate PR soon!

HyukjinKwon · 2016-09-06T06:59:31Z

I am closing this for now. I will reopen or suggest better way later.

HyukjinKwon · 2016-09-29T05:12:22Z

I am reopening this. Please refer the discussion in #15264

SparkQA · 2016-09-29T07:31:39Z

Test build #66086 has finished for PR 14215 at commit b45f2ea.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

… when data types are not matched

SparkQA · 2016-09-30T17:34:11Z

Test build #66177 has finished for PR 14215 at commit 371a067.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wgtmac · 2016-10-07T00:15:24Z

@HyukjinKwon Do you have a timeline for this patch?
Also, what's your plan on vectorized parquet reader?

HyukjinKwon · 2016-10-07T00:32:03Z

@wgtmac Thanks for pinging. I think I can proceed this on this weekend. I haven't looked into vectorized one closely yet. If you have already looked into that, I think it'd also make sense not to deal with the vectorized one but in another PR you might open.

HyukjinKwon · 2016-10-07T00:33:29Z

@wgtmac BTW, as you might already know, my plan and though is, to implement each first and then unify them within a common parent at the end if possible and it makes sense. I would like to avoid a lot of changes in a single PR.

wgtmac · 2016-10-07T19:45:42Z

@HyukjinKwon yep, keep each PR as small as possible is a good idea. BTW, may I know the target version of your non-vectorize fix? Our production job is in need of this fix.

Separating vectorized and non-vectorized one also makes sense to me. Since you're working on non-vectorized one, I will take a look at vectorized side when I have time but not sure if I can make it. I'll keep an eye on your progress and feel free to add me as a subscriber into your relevant fixes. Thanks!

HyukjinKwon · 2016-10-09T05:21:19Z

@wgtmac I hope this one is merged into 2.1 but I believe I am not supposed to decide it. I will anyway take out of the vectorized one described in the PR then.

HyukjinKwon · 2016-10-09T14:56:49Z

@wgtmac Sorry, I will try to make this complete this within this week. I was busy for some reasons.

wgtmac · 2016-10-09T17:24:58Z

@HyukjinKwon no problem. Take your time.

HyukjinKwon · 2016-11-07T04:39:31Z

Hm, I am trying to make another clean version but it seems taking a bit of time. I will close this and open again when I am ready. Please feel free to take over this meanwhile.

HyukjinKwon closed this Sep 6, 2016

HyukjinKwon mentioned this pull request Sep 10, 2016

[SPARK-17477]: SparkSQL cannot handle schema evolution from Int -> Lo… #15035

Closed

HyukjinKwon mentioned this pull request Sep 20, 2016

[SPARK-17477][SQL] SparkSQL cannot handle schema evolution from Int -… #15155

Closed

HyukjinKwon mentioned this pull request Sep 28, 2016

[SPARK-17477][SQL] SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type #15264

Closed

HyukjinKwon reopened this Sep 29, 2016

Support for conversion from compatible schema for Parquet data source…

371a067

… when data types are not matched

HyukjinKwon force-pushed the SPARK-16544 branch from b45f2ea to 371a067 Compare September 30, 2016 15:39

HyukjinKwon closed this Nov 7, 2016

[SPARK-16544][SQL][WIP] Support for conversion from compatible schema for Parquet data source when data types are not matched #14215

[SPARK-16544][SQL][WIP] Support for conversion from compatible schema for Parquet data source when data types are not matched #14215

Uh oh!

Conversation

HyukjinKwon commented Jul 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Jul 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jul 15, 2016

Uh oh!

gatorsmile commented Jul 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Jul 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Jul 15, 2016

Uh oh!

HyukjinKwon commented Sep 6, 2016

Uh oh!

HyukjinKwon commented Sep 29, 2016

Uh oh!

SparkQA commented Sep 29, 2016

Uh oh!

SparkQA commented Sep 30, 2016

Uh oh!

wgtmac commented Oct 7, 2016

Uh oh!

HyukjinKwon commented Oct 7, 2016

Uh oh!

HyukjinKwon commented Oct 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wgtmac commented Oct 7, 2016

Uh oh!

HyukjinKwon commented Oct 9, 2016

Uh oh!

HyukjinKwon commented Oct 9, 2016

Uh oh!

wgtmac commented Oct 9, 2016

Uh oh!

HyukjinKwon commented Nov 7, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HyukjinKwon commented Jul 15, 2016 •

edited

Loading

HyukjinKwon commented Jul 15, 2016 •

edited

Loading

gatorsmile commented Jul 15, 2016 •

edited

Loading

HyukjinKwon commented Jul 15, 2016 •

edited

Loading

HyukjinKwon commented Oct 7, 2016 •

edited

Loading