-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16544][SQL][WIP] Support for conversion from compatible schema for Parquet data source when data types are not matched #14215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @gatorsmile @dongjoon-hyun @liancheng , currently this deals with only Before proceeding further, I want to be sure that this approach looks good. Could I ask some feedback please (should this be maybe handled as single PR and be other follow-ups for the other stuff?) ? |
|
Test build #62365 has finished for PR 14215 at commit
|
|
Currently, the error message is still confusing. Could we first improve the error handling? Detecting the schema mismatching and issue an appropriate error message. You know, this is not only for Parquet. The other data sources face the same issue. Regarding the current implementation, it is very specific to Parquet. I am wondering if the other data sources face the same issue? We need a better design for resolving all of them. |
|
I see, yes I will think of a better way to fix the message. Yea it is still happening across other data sources and this implementation is currently specific to Parquet. However, I just wonder if we can implement them step by step. Actually, I kind of put possibly generalizable things together in I just want to do this bit by bit rather than changing a bunch of codes at once (it is also because changing the codes like that would make it really hard to be reviewed and, to be honest, I believe it does not really get reviewed for really long time).. BTW, does that look okay anyway (I mean converting the value before setting the value to the row)? |
|
For handling messages, I will open a separate PR soon! |
|
I am closing this for now. I will reopen or suggest better way later. |
|
I am reopening this. Please refer the discussion in #15264 |
|
Test build #66086 has finished for PR 14215 at commit
|
… when data types are not matched
b45f2ea to
371a067
Compare
|
Test build #66177 has finished for PR 14215 at commit
|
|
@HyukjinKwon Do you have a timeline for this patch? |
|
@wgtmac Thanks for pinging. I think I can proceed this on this weekend. I haven't looked into vectorized one closely yet. If you have already looked into that, I think it'd also make sense not to deal with the vectorized one but in another PR you might open. |
|
@wgtmac BTW, as you might already know, my plan and though is, to implement each first and then unify them within a common parent at the end if possible and it makes sense. I would like to avoid a lot of changes in a single PR. |
|
@HyukjinKwon yep, keep each PR as small as possible is a good idea. BTW, may I know the target version of your non-vectorize fix? Our production job is in need of this fix. Separating vectorized and non-vectorized one also makes sense to me. Since you're working on non-vectorized one, I will take a look at vectorized side when I have time but not sure if I can make it. I'll keep an eye on your progress and feel free to add me as a subscriber into your relevant fixes. Thanks! |
|
@wgtmac I hope this one is merged into 2.1 but I believe I am not supposed to decide it. I will anyway take out of the vectorized one described in the PR then. |
|
@wgtmac Sorry, I will try to make this complete this within this week. I was busy for some reasons. |
|
@HyukjinKwon no problem. Take your time. |
|
Hm, I am trying to make another clean version but it seems taking a bit of time. I will close this and open again when I am ready. Please feel free to take over this meanwhile. |
What changes were proposed in this pull request?
This PR adds schema compatibility for Parquet.
Currently if user-given schema is different with the Parquet schema, it throws an exception even when the user-given schema is compatible with Parquet schema.
For example, executing the codes below:
throws an exception as below:
This PR lets Parqet supports this schema compatibility.
NumericTypeexceptDecimalType.AtomicType.How was this patch tested?
Unit tests in
ParquetIOSuite.