SPARK-1630: Make PythonRDD handle NULL elements and strings gracefully #554

kalpit · 2014-04-25T17:48:10Z

Have added a unit test that validates the fix. We no longer NPE.

rxin · 2014-04-25T17:50:09Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

It's more obvious to just do

if (other == null) { } else { }

then a pattern matching.

AmplabJenkins · 2014-04-25T17:52:55Z

Can one of the admins verify this patch?

rxin · 2014-04-25T17:54:30Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

rxin · 2014-04-25T17:54:56Z

Thanks, @kalpit. This looks pretty good. I left a couple comments on style.

rxin · 2014-04-25T18:47:30Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

the indent is off here (2-space indent)

Can you update this one also?

rxin · 2014-04-25T18:47:41Z

Thanks - just one more tiny thing about indent ...

mateiz · 2014-04-25T22:26:58Z

Jenkins, test this please

AmplabJenkins · 2014-04-25T22:27:58Z

Merged build triggered.

AmplabJenkins · 2014-04-25T22:28:06Z

Merged build started.

mateiz · 2014-04-25T22:28:06Z

I'm curious, when did you get nulls in practice? Wouldn't it be better to pass a null to Python and have it display as None?

AmplabJenkins · 2014-04-25T23:24:08Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-25T23:24:08Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14500/

kalpit · 2014-04-26T17:08:19Z

@mateiz I ran into this when my custom RDD produced nulls for some elements within a partition/split (during compute()).

It would indeed be better to pass a null to Python and have it display it as None. One solution is to a pick a TOKEN that we write into the tmp file and then translate it to a "None" during read. This, however, is not failsafe because there is a remote possibility of string data being identical to the TOKEN. Perhaps we could address that by fencing regular data by a special character and treating data lacking that fence as tokens.

In any case, the above solution (or an alternative) would be a relatively larger change, and I preferred fixing at least the NPEs in PythonRDD for short term (stack trace is in JIRA ticket).

What do you think ?

mateiz · 2014-04-26T23:31:47Z

But that means that the NPEs are only happening with your custom RDD, right? They won't happen for regular Spark users.

I think we should pass None here. One way to do it is to select a negative length (e.g. -3) to represent null, and pass that to Python. We already use other negative lengths for other special flags.

kalpit · 2014-04-27T03:34:50Z

I suspect that the NPEs will happen for any PySpark User who has an RDD that returns null for some input "x" based on the lambda/transform. Check out the test case I added to "PythonRDDSuite.scala" to reproduce the NPE.

I considered the idea of using negative length (-4) to pass "None" to python (PythonRDD.SpecialLengths -1 to -3 are taken). The tricky part however is that the read() method returns an array of bytes based on the length. Existing code treats empty array as end of data/stream. So I am not sure how we would communicate "None" to python. Thoughts ?

mateiz · 2014-04-28T17:39:41Z

Lambdas in Python that return None will work fine because we use pickling for all data after that. The only way this problem can happen is if a Java RDD has null in it. Do you have an example in Python only (with the current PySpark) where this happens?

kalpit · 2014-04-28T18:34:44Z

I see your point. I don't have a Python-only use-case that can trigger the NPE.

My custom RDD implementation had a corner-case in which RDD's compute() method returned a "null" in the iterator stream. I have fixed my custom RDD implementation to not do that, so I don't run into this NPE anymore. However, should anyone else out there ever implement a custom RDD of similar nature (has nulls for some elements in a partition's iterator stream) and tries accessing such an RDD from PySpark, he/she would run into the NPE, so I thought it would be nicer if we handled nulls in the stream gracefully.

mateiz · 2014-04-28T20:30:48Z

Yeah, but in that case I think we have to figure out a way with the lengths. I haven't had time to look into it, but basically the UTF decoder in Python needs to deal with negative lengths sent from Scala.

kanzhang · 2014-05-08T20:52:44Z

I considered the idea of using negative length (-4) to pass "None" to python (PythonRDD.SpecialLengths -1 to -3 are taken). The tricky part however is that the read() method returns an array of bytes based on the length. Existing code treats empty array as end of data/stream. So I am not sure how we would communicate "None" to python. Thoughts ?

@kalpit pls take a look at #644, where I propose to use null to signal end of stream instead of an empty array.

…che#554. SPARK-1056. Fix header comment in Executor to not imply that it's only u... ...sed for Mesos and Standalone. Author: Sandy Ryza <[email protected]> == Merge branch commits == commit 1f2443d902a26365a5c23e4af9077e1539ed2eab Author: Sandy Ryza <[email protected]> Date: Thu Feb 6 15:03:50 2014 -0800 SPARK-1056. Fix header comment in Executor to not imply that it's only used for Mesos and Standalone

JoshRosen · 2014-07-29T06:39:45Z

Hi @kalpit,

Since this PR has been superseded by #644, do you mind closing it? Thanks!

AmplabJenkins · 2014-08-06T02:22:51Z

Can one of the admins verify this patch?

mateiz · 2014-09-05T00:53:09Z

I've closed this since it was fixed separately. Thanks for sending a patch here.

SparkQA · 2014-09-05T23:46:59Z

Can one of the admins verify this patch?

Perform apt-get update before install

…er.isInternalError` (apache#554) ### What changes were proposed in this pull request? Handle null input for `SparkThrowableHelper.isInternalError` method. ### Why are the changes needed? The `SparkThrowableHelper.isInternalError` method doesn't handle null input, and it could lead to NullPointerException. It happens when a `SparkException` without `errorClass` is invoked `isInternalError`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add 2 assertions to current test cases to cover this issue. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47946 from jshmchenxi/SPARK-49480/null-pointer-is-internal-error. Authored-by: Xi Chen <[email protected]> (cherry picked from commit cef3c86) Signed-off-by: Wenchen Fan <[email protected]> Co-authored-by: Xi Chen <[email protected]>

SPARK-1630: Make PythonRDD handle Null elements and strings gracefully

ff036d3

rxin reviewed Apr 25, 2014
View reviewed changes

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala Outdated

Copy link

Contributor

rxin Apr 25, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here also

SPARK-1630: Incorporated code-review feedback

8a4a0f9

rxin reviewed Apr 25, 2014
View reviewed changes

SPARK-1630: Fixed indentation

dddda9e

davies mentioned this pull request Jul 23, 2014

[SPARK-1630] Turn Null of Java/Scala into None of Python #1551

Closed

asfgit closed this in d112a6c Sep 21, 2014

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Perform apt-get update before install (apache#554)

78ef2ab

Perform apt-get update before install

SPARK-1630: Make PythonRDD handle NULL elements and strings gracefully #554

SPARK-1630: Make PythonRDD handle NULL elements and strings gracefully #554

Uh oh!

Conversation

kalpit commented Apr 25, 2014

Uh oh!

rxin Apr 25, 2014

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Apr 25, 2014

Uh oh!

rxin Apr 25, 2014

Choose a reason for hiding this comment

Uh oh!

rxin commented Apr 25, 2014

Uh oh!

rxin Apr 25, 2014

Choose a reason for hiding this comment

Uh oh!

rxin Apr 25, 2014

Choose a reason for hiding this comment

Uh oh!

rxin commented Apr 25, 2014

Uh oh!

mateiz commented Apr 25, 2014

Uh oh!

AmplabJenkins commented Apr 25, 2014

Uh oh!

AmplabJenkins commented Apr 25, 2014

Uh oh!

mateiz commented Apr 25, 2014

Uh oh!

AmplabJenkins commented Apr 25, 2014

Uh oh!

AmplabJenkins commented Apr 25, 2014

Uh oh!

kalpit commented Apr 26, 2014

Uh oh!

mateiz commented Apr 26, 2014

Uh oh!

kalpit commented Apr 27, 2014

Uh oh!

mateiz commented Apr 28, 2014

Uh oh!

kalpit commented Apr 28, 2014

Uh oh!

mateiz commented Apr 28, 2014

Uh oh!

kanzhang commented May 8, 2014

Uh oh!

JoshRosen commented Jul 29, 2014

Uh oh!

AmplabJenkins commented Aug 6, 2014

Uh oh!

mateiz commented Sep 5, 2014

Uh oh!

SparkQA commented Sep 5, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants