Skip to content

Conversation

@kalpit
Copy link
Contributor

@kalpit kalpit commented Apr 25, 2014

Have added a unit test that validates the fix. We no longer NPE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more obvious to just do

if (other == null) {
} else {
}

then a pattern matching.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here also

@rxin
Copy link
Contributor

rxin commented Apr 25, 2014

Thanks, @kalpit. This looks pretty good. I left a couple comments on style.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the indent is off here (2-space indent)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update this one also?

@rxin
Copy link
Contributor

rxin commented Apr 25, 2014

Thanks - just one more tiny thing about indent ...

@mateiz
Copy link
Contributor

mateiz commented Apr 25, 2014

Jenkins, test this please

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@mateiz
Copy link
Contributor

mateiz commented Apr 25, 2014

I'm curious, when did you get nulls in practice? Wouldn't it be better to pass a null to Python and have it display as None?

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14500/

@kalpit
Copy link
Contributor Author

kalpit commented Apr 26, 2014

@mateiz I ran into this when my custom RDD produced nulls for some elements within a partition/split (during compute()).

It would indeed be better to pass a null to Python and have it display it as None. One solution is to a pick a TOKEN that we write into the tmp file and then translate it to a "None" during read. This, however, is not failsafe because there is a remote possibility of string data being identical to the TOKEN. Perhaps we could address that by fencing regular data by a special character and treating data lacking that fence as tokens.

In any case, the above solution (or an alternative) would be a relatively larger change, and I preferred fixing at least the NPEs in PythonRDD for short term (stack trace is in JIRA ticket).

What do you think ?

@mateiz
Copy link
Contributor

mateiz commented Apr 26, 2014

But that means that the NPEs are only happening with your custom RDD, right? They won't happen for regular Spark users.

I think we should pass None here. One way to do it is to select a negative length (e.g. -3) to represent null, and pass that to Python. We already use other negative lengths for other special flags.

@kalpit
Copy link
Contributor Author

kalpit commented Apr 27, 2014

I suspect that the NPEs will happen for any PySpark User who has an RDD that returns null for some input "x" based on the lambda/transform. Check out the test case I added to "PythonRDDSuite.scala" to reproduce the NPE.

I considered the idea of using negative length (-4) to pass "None" to python (PythonRDD.SpecialLengths -1 to -3 are taken). The tricky part however is that the read() method returns an array of bytes based on the length. Existing code treats empty array as end of data/stream. So I am not sure how we would communicate "None" to python. Thoughts ?

@mateiz
Copy link
Contributor

mateiz commented Apr 28, 2014

Lambdas in Python that return None will work fine because we use pickling for all data after that. The only way this problem can happen is if a Java RDD has null in it. Do you have an example in Python only (with the current PySpark) where this happens?

@kalpit
Copy link
Contributor Author

kalpit commented Apr 28, 2014

I see your point. I don't have a Python-only use-case that can trigger the NPE.

My custom RDD implementation had a corner-case in which RDD's compute() method returned a "null" in the iterator stream. I have fixed my custom RDD implementation to not do that, so I don't run into this NPE anymore. However, should anyone else out there ever implement a custom RDD of similar nature (has nulls for some elements in a partition's iterator stream) and tries accessing such an RDD from PySpark, he/she would run into the NPE, so I thought it would be nicer if we handled nulls in the stream gracefully.

@mateiz
Copy link
Contributor

mateiz commented Apr 28, 2014

Yeah, but in that case I think we have to figure out a way with the lengths. I haven't had time to look into it, but basically the UTF decoder in Python needs to deal with negative lengths sent from Scala.

@kanzhang
Copy link
Contributor

kanzhang commented May 8, 2014

I considered the idea of using negative length (-4) to pass "None" to python (PythonRDD.SpecialLengths -1 to -3 are taken). The tricky part however is that the read() method returns an array of bytes based on the length. Existing code treats empty array as end of data/stream. So I am not sure how we would communicate "None" to python. Thoughts ?

@kalpit pls take a look at #644, where I propose to use null to signal end of stream instead of an empty array.

pwendell pushed a commit to pwendell/spark that referenced this pull request May 12, 2014
…che#554.

SPARK-1056. Fix header comment in Executor to not imply that it's only u...

...sed for Mesos and Standalone.

Author: Sandy Ryza <[email protected]>

== Merge branch commits ==

commit 1f2443d902a26365a5c23e4af9077e1539ed2eab
Author: Sandy Ryza <[email protected]>
Date:   Thu Feb 6 15:03:50 2014 -0800

    SPARK-1056. Fix header comment in Executor to not imply that it's only used for Mesos and Standalone
@JoshRosen
Copy link
Contributor

Hi @kalpit,

Since this PR has been superseded by #644, do you mind closing it? Thanks!

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@mateiz
Copy link
Contributor

mateiz commented Sep 5, 2014

I've closed this since it was fixed separately. Thanks for sending a patch here.

@SparkQA
Copy link

SparkQA commented Sep 5, 2014

Can one of the admins verify this patch?

@asfgit asfgit closed this in d112a6c Sep 21, 2014
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
Perform apt-get update before install
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
…er.isInternalError` (apache#554)

### What changes were proposed in this pull request?

Handle null input for `SparkThrowableHelper.isInternalError` method.

### Why are the changes needed?

The `SparkThrowableHelper.isInternalError` method doesn't handle null input, and it could lead to NullPointerException. It happens when a `SparkException` without `errorClass` is invoked `isInternalError`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Add 2 assertions to current test cases to cover this issue.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47946 from jshmchenxi/SPARK-49480/null-pointer-is-internal-error.

Authored-by: Xi Chen <[email protected]>

(cherry picked from commit cef3c86)

Signed-off-by: Wenchen Fan <[email protected]>
Co-authored-by: Xi Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants