-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12300][SQL][PYSPARK] fix schema inferance on local collections #10275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12300][SQL][PYSPARK] fix schema inferance on local collections #10275
Conversation
|
Test build #47608 has finished for PR 10275 at commit
|
|
Test build #47611 has finished for PR 10275 at commit
|
|
cc @nchammas - this is the fix for the local version of SPARK-2870 |
|
ping @davies if you have a chance to look at this. |
|
Test build #48491 has finished for PR 10275 at commit
|
|
LGTM, merging this into master and 1.6, thanks! |
Current schema inference for local python collections halts as soon as there are no NullTypes. This is different than when we specify a sampling ratio of 1.0 on a distributed collection. This could result in incomplete schema information. Author: Holden Karau <[email protected]> Closes #10275 from holdenk/SPARK-12300-fix-schmea-inferance-on-local-collections. (cherry picked from commit d1ca634) Signed-off-by: Davies Liu <[email protected]>
|
@holdenk Could you take a look at the test failure? I also hit this issue in my local environment without any code change. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48605/consoleFull It sounds like this is caused by the inferred schema: Thanks! |
|
I'm on vacation with less than great internet but I'll try and repro locally. |
|
ok reproed and I've got a fix (seems just a test issue from a parallel change in how missing fields can be done). |
Current schema inference for local python collections halts as soon as there are no NullTypes. This is different than when we specify a sampling ratio of 1.0 on a distributed collection. This could result in incomplete schema information.