Skip to content

Conversation

@holdenk
Copy link
Contributor

@holdenk holdenk commented Dec 12, 2015

Current schema inference for local python collections halts as soon as there are no NullTypes. This is different than when we specify a sampling ratio of 1.0 on a distributed collection. This could result in incomplete schema information.

@SparkQA
Copy link

SparkQA commented Dec 12, 2015

Test build #47608 has finished for PR 10275 at commit 6e463fd.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk holdenk changed the title [SPARK-12300][SQL][PYSPARK] fix schmea inferance on local collections [SPARK-12300][SQL][PYSPARK] fix schema inferance on local collections Dec 12, 2015
@SparkQA
Copy link

SparkQA commented Dec 12, 2015

Test build #47611 has finished for PR 10275 at commit b863cc9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Dec 14, 2015

cc @nchammas - this is the fix for the local version of SPARK-2870

@holdenk
Copy link
Contributor Author

holdenk commented Dec 30, 2015

ping @davies if you have a chance to look at this.

@SparkQA
Copy link

SparkQA commented Dec 30, 2015

Test build #48491 has finished for PR 10275 at commit 3f2b825.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Dec 30, 2015

LGTM, merging this into master and 1.6, thanks!

asfgit pushed a commit that referenced this pull request Dec 30, 2015
Current schema inference for local python collections halts as soon as there are no NullTypes. This is different than when we specify a sampling ratio of 1.0 on a distributed collection. This could result in incomplete schema information.

Author: Holden Karau <[email protected]>

Closes #10275 from holdenk/SPARK-12300-fix-schmea-inferance-on-local-collections.

(cherry picked from commit d1ca634)
Signed-off-by: Davies Liu <[email protected]>
@asfgit asfgit closed this in d1ca634 Dec 30, 2015
@gatorsmile
Copy link
Member

@holdenk Could you take a look at the test failure? I also hit this issue in my local environment without any code change.

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48605/consoleFull
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48609/consoleFull

It sounds like this is caused by the inferred schema:

py4j.protocol.Py4JJavaError: An error occurred while calling o1979.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 211.0 failed 1 times, most recent failure: Lost task 2.0 in stage 211.0 (TID 886, localhost): java.lang.ArrayIndexOutOfBoundsException: 1
    at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.genericGet(rows.scala:227)
    at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getAs(rows.scala:35)
    at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.isNullAt(rows.scala:36)
    at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.isNullAt(rows.scala:221)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

Thanks!

@holdenk
Copy link
Contributor Author

holdenk commented Jan 3, 2016

I'm on vacation with less than great internet but I'll try and repro locally.

@holdenk
Copy link
Contributor Author

holdenk commented Jan 3, 2016

ok reproed and I've got a fix (seems just a test issue from a parallel change in how missing fields can be done).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants