[SPARK-19163][PYTHON][SQL] Delay _judf initialization to the call #16536

zero323 · 2017-01-10T21:57:13Z

What changes were proposed in this pull request?

Defer UserDefinedFunction._judf initialization to the first call. This prevents unintended SparkSession initialization. This allows users to define and import UDF without creating a context / session as a side effect.

SPARK-19163

How was this patch tested?

Unit tests.

SparkQA · 2017-01-10T22:43:40Z

Test build #71163 has finished for PR 16536 at commit 2ca2557.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-12T10:34:41Z

Test build #71254 has finished for PR 16536 at commit c936813.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-13T22:49:32Z

Test build #71351 has finished for PR 16536 at commit 641ec1d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-13T22:56:14Z

Test build #71350 has finished for PR 16536 at commit b124bb4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-19T23:50:40Z

Test build #71679 has finished for PR 16536 at commit 57735b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-20T01:26:33Z

Test build #71688 has finished for PR 16536 at commit abb3726.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2017-01-20T17:21:04Z

+1

Looks good to me.

holdenk

Thanks for working on this! I think this is going to be useful for Python UDF libraries. I've got a few questions - let me know what your thoughts are :)

holdenk · 2017-01-25T23:55:10Z

python/pyspark/sql/functions.py

Maybe add a comment explaining the purposes of this, just for future readers of the code.

holdenk · 2017-01-25T23:57:56Z

python/pyspark/sql/tests.py

This seems like a good test but maybe a bit too focused on testing the implementation specifics?

Maybe it might more sense to also have a test which verifies creating a UDF doesn't create a SparkSession since that is the intended purposes (we don't really care about delaying the initialization of _judfy that much per-se but we do care about verifying that we don't eagerly create the SparkSession on import). What do you think?

I thought about it but I have this impression, maybe incorrect, that we avoid creating new contexts to keep total execution time manageable. If you think this justifies a separate TestCase I am more than fine with that (SPARK-19224 and [PYSPARK] Python tests organization , right?).

If not, we could mock this, and put assert on the number of calls.

I think a seperate test case and would able to be pretty light weight since it doesn't need to create a SparkContext or anything which traditionally takes longer to set up. What do you think?

@holdenk Separate case it is. As long as implementation is correct an overhead is negligible.

Let's keep these tests, to make sure that _judf is initialized when necessary.

holdenk · 2017-01-26T00:00:15Z

python/pyspark/sql/tests.py

there is a assertIsInstance function that could simplify this.

holdenk · 2017-01-26T00:20:35Z

python/pyspark/sql/functions.py

So there isn't any lock around this - I suspect we aren't too likely to have concurrent calls to this - but just to be safe we should maybe think through what would happen if this is does happen (and then leave a comment about it)?

Could you elaborate a bit? I am not sure if I understand the issue.

Assignment is atomic (so we don't have to worry about corruption), for any practical purpose operation is idempotent (we can return expressions using different Java objects but as far as I am concerned this is just a detail of implementation), access to Py4J is thread safe and as far as I remember function registries are synchronized. I there any issue i missed here?

Thanks for looking into this.

I think @holdenk's concern is that this would allow concurrent calls to _create_udf. That would create two UserDefinedPythonFunction objects, but I don't see anything on the Scala side that is concerning about that.

@rdblue I get this part, and this is a possible scenario. Question is if this justifies preventive lock. As far as I am aware there should be no correctness issues here. SparkSession already locks during initialization so we are safe there.

Yeah, I don't think it should require a lock. I think concurrent calls are very unlikely and safe.

I stress tested this a bit and I haven't found any abnormalities but I found a small problem with __call__ on the way. Fixed now.

Ok, I'd maybe just leave a comment saying that we've left out the lock since double creation is both unlikely and OK.

@holdenk Done.

SparkQA · 2017-01-26T11:53:20Z

Test build #72021 has finished for PR 16536 at commit 16245e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-26T19:03:32Z

Test build #72034 has finished for PR 16536 at commit a5567b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-26T20:01:44Z

Test build #72039 has finished for PR 16536 at commit 29ffc57.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zero323 · 2017-01-26T20:30:40Z

python/pyspark/sql/tests.py

@holdenk I believe that for a full test udf would have to create SparkContext. But mock is cheap.

You can create a testcase without the Spark base and verify that creating a UDF doesn't create a SparkContext. This does not require making a SparkContext.

@holdenk Do you mean something like SparkContext._active_spark_context is None? Then we need to make sure it is tear down if it was initialized after all, right? Isn't mocking cleaner?

class UDFInitializationTestCase(unittest.TestCase): def tearDown(self): if SparkSession._instantiatedSession is not None: SparkSession._instantiatedSession.stop() if SparkContext._active_spark_context is not None: SparkContext._active_spark_contex.stop() def test_udf_context_access(self): from pyspark.sql.functions import UserDefinedFunction f = UserDefinedFunction(lambda x: x, StringType()) self.assertIsNone(SparkContext._active_spark_context) self.assertIsNone(SparkSession._instantiatedSession)

zero323 · 2017-01-30T21:21:36Z

python/pyspark/sql/tests.py

And add a separate test case checking SparkContext and SparkSession state.

SparkQA · 2017-01-30T21:26:55Z

Test build #72170 has finished for PR 16536 at commit ee953a9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class UDFInitializationTests(unittest.TestCase):

SparkQA · 2017-01-30T21:43:03Z

Test build #72172 has finished for PR 16536 at commit 489ef54.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class UDFInitializationTests(unittest.TestCase):

SparkQA · 2017-01-30T21:48:58Z

Test build #72173 has finished for PR 16536 at commit 1a8280d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-01-30T23:11:30Z

The changes look good to me, I'll take a quick pass at the formatting to make sure - but otherwise I'll try and merge this tomorrow :)

holdenk

Ok it looks good, just one minor comment about having getOrCreate which acquires a lock in the hot path for __call__. Thanks for adding the tests :)

holdenk · 2017-01-31T23:58:49Z

python/pyspark/sql/functions.py

So by switching this to getOrCreate we put a lock acquisition in the path of __call__ which is maybe not ideal. We could maybe fix this by getting _judf first (e.g. judf = self._judf)? (Although it should be a mostly uncontended lock so it shouldn't be that bad, but if we ended up having a multi-threaded PySpark DataFrame UDF application this could maybe degrade a little bit).

@holdenk Sounds reasonable.

Though I am not sure if it really matters here. If _instantiatedContext is not None we'll do the same thing, otherwise we fall back to initialization in _judf.

zero323 · 2017-02-01T00:30:23Z

@holdenk I have one more suggestion. Shouldn't we replace

    def _create_judf(self):
        from pyspark.sql import SparkSession

        sc = SparkContext.getOrCreate()
        spark = SparkSession.builder.getOrCreate()

with

    def _create_judf(self):
        from pyspark.sql import SparkSession

        spark = SparkSession.builder.getOrCreate()
        sc = spark.sparkContext

I left it as is but I think it could be cleaner.

SparkQA · 2017-02-01T00:55:32Z

Test build #72218 has finished for PR 16536 at commit cb496f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-01T01:00:12Z

Test build #72219 has finished for PR 16536 at commit 9332da3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-02-01T01:05:20Z

@zero323 that sounds like a good improvement.

zero323 · 2017-02-01T01:33:42Z

@holdenk Done :)

holdenk · 2017-02-01T01:38:57Z

Great, I'll wait for jenkins then :)

SparkQA · 2017-02-01T01:53:17Z

Test build #72222 has finished for PR 16536 at commit 923b88d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-02-01T02:04:11Z

Going to go ahead and merge. Still need to sort out the JIRA permissions so will take a bit for me to get that updated for you.

zero323 · 2017-02-01T02:16:26Z

Thanks a bunch @holdenk

## What changes were proposed in this pull request? Defer `UserDefinedFunction._judf` initialization to the first call. This prevents unintended `SparkSession` initialization. This allows users to define and import UDF without creating a context / session as a side effect. [SPARK-19163](https://issues.apache.org/jira/browse/SPARK-19163) ## How was this patch tested? Unit tests. Author: zero323 <[email protected]> Closes apache#16536 from zero323/SPARK-19163.

zero323 force-pushed the SPARK-19163 branch from 2ca2557 to c936813 Compare January 12, 2017 10:01

zero323 changed the title ~~[SPARK-19163][PYTHON][SQL][WIP] Delay _judf initialization to the __call__~~ [SPARK-19163][PYTHON][SQL]Delay _judf initialization to the __call__ Jan 12, 2017

zero323 changed the title ~~[SPARK-19163][PYTHON][SQL]Delay _judf initialization to the __call__~~ [SPARK-19163][PYTHON][SQL] Delay _judf initialization to the __call__ Jan 12, 2017

zero323 force-pushed the SPARK-19163 branch from c936813 to b124bb4 Compare January 13, 2017 22:00

zero323 force-pushed the SPARK-19163 branch from 57735b2 to abb3726 Compare January 20, 2017 00:48

holdenk reviewed Jan 26, 2017

View reviewed changes

zero323 force-pushed the SPARK-19163 branch from abb3726 to 16245e4 Compare January 26, 2017 11:17

zero323 commented Jan 26, 2017

View reviewed changes

zero323 mentioned this pull request Jan 27, 2017

[SPARK-19161][PYTHON][SQL] Improving UDF Docstrings #16534

Closed

zero323 added 3 commits January 30, 2017 21:15

Delay _judf initialization to the __call__

22b967c

Move name resolution logic from _create_judf to __init__

7b252c6

Use getOrCreate instead of _active_spark_context

fb009e0

zero323 force-pushed the SPARK-19163 branch from 29ffc57 to ee953a9 Compare January 30, 2017 20:55

Add UDFInitializationTests

489ef54

zero323 force-pushed the SPARK-19163 branch from ee953a9 to 489ef54 Compare January 30, 2017 21:08

zero323 commented Jan 30, 2017

View reviewed changes

holdenk reviewed Feb 1, 2017

View reviewed changes

zero323 added 2 commits February 1, 2017 01:16

Add comment describing multithreaded initalization

7333f29

Use _active_spark_context in place of getOrCreate

cb496f3

zero323 force-pushed the SPARK-19163 branch from 1a8280d to cb496f3 Compare February 1, 2017 00:17

Add a comment explaining the purpose of _judf_placeholder

9332da3

Use spark.sparkContext instead of separate _judf

923b88d

asfgit closed this in 9063835 Feb 1, 2017

zero323 deleted the SPARK-19163 branch February 1, 2017 11:17

[SPARK-19163][PYTHON][SQL] Delay _judf initialization to the __call__ #16536

[SPARK-19163][PYTHON][SQL] Delay _judf initialization to the __call__ #16536

Uh oh!

Conversation

zero323 commented Jan 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jan 10, 2017

Uh oh!

SparkQA commented Jan 12, 2017

Uh oh!

SparkQA commented Jan 13, 2017

Uh oh!

SparkQA commented Jan 13, 2017

Uh oh!

SparkQA commented Jan 19, 2017

Uh oh!

SparkQA commented Jan 20, 2017

Uh oh!

rdblue commented Jan 20, 2017

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zero323 Jan 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 26, 2017

Uh oh!

SparkQA commented Jan 26, 2017

Uh oh!

SparkQA commented Jan 26, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zero323 Jan 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 30, 2017

Uh oh!

SparkQA commented Jan 30, 2017

Uh oh!

[SPARK-19163][PYTHON][SQL] Delay _judf initialization to the call #16536

[SPARK-19163][PYTHON][SQL] Delay _judf initialization to the call #16536

zero323 commented Jan 10, 2017 •

edited

Loading

zero323 Jan 30, 2017 •

edited

Loading

zero323 Jan 30, 2017 •

edited

Loading