Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jul 15, 2016

What changes were proposed in this pull request?

This PR adds schema compatibility for Parquet.

Currently if user-given schema is different with the Parquet schema, it throws an exception even when the user-given schema is compatible with Parquet schema.

For example, executing the codes below:

val path = "/tmp/test.parquet"
val data = (1 to 4).map(Tuple1(_))
spark.createDataFrame(data).toDF("a").write.parquet(path)
val schema = StructType(StructField("a", LongType, true) :: Nil)
spark.read.schema(schema).parquet(path).show()

throws an exception as below:

org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 
...

This PR lets Parqet supports this schema compatibility.

  • Schema compatibility for NumericType except DecimalType.
  • Schema compatibility for other AtomicType.

How was this patch tested?

Unit tests in ParquetIOSuite.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jul 15, 2016

Hi @gatorsmile @dongjoon-hyun @liancheng , currently this deals with only NumericType except DecimalType for upcasting only for non-vectorized reader.

Before proceeding further, I want to be sure that this approach looks good. Could I ask some feedback please (should this be maybe handled as single PR and be other follow-ups for the other stuff?) ?

@SparkQA
Copy link

SparkQA commented Jul 15, 2016

Test build #62365 has finished for PR 14215 at commit b45f2ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

gatorsmile commented Jul 15, 2016

Currently, the error message is still confusing.

org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 

Could we first improve the error handling? Detecting the schema mismatching and issue an appropriate error message. You know, this is not only for Parquet. The other data sources face the same issue.

Regarding the current implementation, it is very specific to Parquet. I am wondering if the other data sources face the same issue? We need a better design for resolving all of them.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jul 15, 2016

I see, yes I will think of a better way to fix the message. Yea it is still happening across other data sources and this implementation is currently specific to Parquet.

However, I just wonder if we can implement them step by step. Actually, I kind of put possibly generalizable things together in ParquetSchemaCompatibility. For example, ORC is doing this very similarly with Parquet, HiveInspectors.scala#L630-L649.

I just want to do this bit by bit rather than changing a bunch of codes at once (it is also because changing the codes like that would make it really hard to be reviewed and, to be honest, I believe it does not really get reviewed for really long time)..

BTW, does that look okay anyway (I mean converting the value before setting the value to the row)?

@HyukjinKwon
Copy link
Member Author

For handling messages, I will open a separate PR soon!

@HyukjinKwon
Copy link
Member Author

I am closing this for now. I will reopen or suggest better way later.

@HyukjinKwon
Copy link
Member Author

I am reopening this. Please refer the discussion in #15264

@HyukjinKwon HyukjinKwon reopened this Sep 29, 2016
@SparkQA
Copy link

SparkQA commented Sep 29, 2016

Test build #66086 has finished for PR 14215 at commit b45f2ea.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 30, 2016

Test build #66177 has finished for PR 14215 at commit 371a067.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wgtmac
Copy link
Member

wgtmac commented Oct 7, 2016

@HyukjinKwon Do you have a timeline for this patch?
Also, what's your plan on vectorized parquet reader?

@HyukjinKwon
Copy link
Member Author

@wgtmac Thanks for pinging. I think I can proceed this on this weekend. I haven't looked into vectorized one closely yet. If you have already looked into that, I think it'd also make sense not to deal with the vectorized one but in another PR you might open.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 7, 2016

@wgtmac BTW, as you might already know, my plan and though is, to implement each first and then unify them within a common parent at the end if possible and it makes sense. I would like to avoid a lot of changes in a single PR.

@wgtmac
Copy link
Member

wgtmac commented Oct 7, 2016

@HyukjinKwon yep, keep each PR as small as possible is a good idea. BTW, may I know the target version of your non-vectorize fix? Our production job is in need of this fix.

Separating vectorized and non-vectorized one also makes sense to me. Since you're working on non-vectorized one, I will take a look at vectorized side when I have time but not sure if I can make it. I'll keep an eye on your progress and feel free to add me as a subscriber into your relevant fixes. Thanks!

@HyukjinKwon
Copy link
Member Author

@wgtmac I hope this one is merged into 2.1 but I believe I am not supposed to decide it. I will anyway take out of the vectorized one described in the PR then.

@HyukjinKwon
Copy link
Member Author

@wgtmac Sorry, I will try to make this complete this within this week. I was busy for some reasons.

@wgtmac
Copy link
Member

wgtmac commented Oct 9, 2016

@HyukjinKwon no problem. Take your time.

@HyukjinKwon
Copy link
Member Author

Hm, I am trying to make another clean version but it seems taking a bit of time. I will close this and open again when I am ready. Please feel free to take over this meanwhile.

@HyukjinKwon HyukjinKwon closed this Nov 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants