[SPARK-24822][PySpark] Python support for barrier execution mode #22011

jiangxb1987 · 2018-08-06T18:09:51Z

What changes were proposed in this pull request?

This PR add python support for barrier execution mode, thus enable launch a job containing barrier stage(s) from PySpark.

We just forked the existing RDDBarrier and RDD.barrier() in Python api.

How was this patch tested?

Manually tested:

>>> rdd = sc.parallelize([1, 2, 3, 4])
>>> def f(iterator): yield sum(iterator)
... 
>>> rdd.barrier().mapPartitions(f).isBarrier() == True
True

Unit tests will be added in a follow-up PR that implements BarrierTaskContext on python side.

SparkQA · 2018-08-06T18:15:49Z

Test build #94302 has finished for PR 22011 at commit ec2f668.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class JavaRDDBarrier[T: ClassTag](javaRdd: JavaRDD[T])
class RDDBarrier(object):

SparkQA · 2018-08-06T22:20:39Z

Test build #94308 has finished for PR 22011 at commit b0b2f86.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2018-08-07T00:42:51Z

python/pyspark/rdd.py

If we expose a package private method to get the annotated RDD with isBarrier=True in RDDBarrier, we can implement mapPartitions easily here:

jBarrierRdd = self._jrdd.rdd.barrier().barrierRdd.javaRdd pyBarrierRdd = RDD(self._jrdd.rdd.barrier().barrierRdd.javaRdd) pyBarrierRdd.mapPartitions(f, preservesPartitioning)

mengxr · 2018-08-07T00:44:59Z

core/src/main/scala/org/apache/spark/api/java/JavaRDDBarrier.scala

This is not necessary to implement Python support.

HyukjinKwon · 2018-08-07T03:56:44Z

python/pyspark/rdd.py

I don't know why we didn't mark the version so far here but we really should .. versionadded:: 2.4.0 here or

@since(2.4) def barrier(self): ...

HyukjinKwon · 2018-08-07T03:59:11Z

python/pyspark/rdd.py

ditto let's add .. versionadded:: 2.4.0 at the end. I guess optionally add them to each API here exposed as well.

HyukjinKwon · 2018-08-07T04:04:38Z

python/pyspark/rdd.py

nit: RDDBarrier -> RDD barrier

HyukjinKwon · 2018-08-07T04:07:21Z

python/pyspark/rdd.py

shall we match the documentation, or why is it different?

FWIW, for coding block, just `blabla` should be good enough. Nicer if linked properly by like :class:`ClassName`.

felixcheung · 2018-08-07T07:19:40Z

python/pyspark/rdd.py

SparkQA · 2018-08-09T22:28:34Z

Test build #94512 has finished for PR 22011 at commit 1ee8025.

This patch fails from timeout after a configured wait of `300m`.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-09T22:33:39Z

Test build #94514 has finished for PR 22011 at commit d508fc5.

This patch fails from timeout after a configured wait of `300m`.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2018-08-10T00:12:09Z

retest this please

mengxr · 2018-08-10T00:12:10Z

test this please

cloud-fan · 2018-08-10T00:41:57Z

python/pyspark/rdd.py

+        """
+        def func(s, iterator):
+            return f(iterator)
+        jBarrierRdd = self._jrdd.rdd().barrier().toJavaRDD()


This will materialize the java RDD, which means the map functions before and after barrier will be executed by 2 python workers.

We should not materialize the java RDD here, but just set a isBarrier flag in the pythhon PipelinedRDD.

SparkQA · 2018-08-10T05:09:52Z

Test build #94530 has finished for PR 22011 at commit d508fc5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-10T12:16:46Z

Test build #94549 has finished for PR 22011 at commit ea2330b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2018-08-10T14:09:09Z

test this please

mengxr · 2018-08-10T14:10:05Z

@jiangxb1987 Please mention that tests will be added in a follow-up PR that implements BarrierTaskContext.

cloud-fan · 2018-08-10T14:54:49Z

python/pyspark/rdd.py

+        """
+        return RDDBarrier(self)
+
+    def isBarrier(self):


do we have this API in the JVM RDD?

In scala RDD there is a private[spark] isBarrier() function, we don't add this to JavaRDD

SparkQA · 2018-08-10T19:03:00Z

Test build #94565 has finished for PR 22011 at commit ea2330b.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-10T21:30:20Z

Test build #94575 has finished for PR 22011 at commit cf38531.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2018-08-11T00:57:21Z

retest this please

SparkQA · 2018-08-11T05:41:32Z

Test build #94590 has finished for PR 22011 at commit cf38531.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2018-08-11T07:43:22Z

retest this please

SparkQA · 2018-08-11T12:19:38Z

Test build #94600 has finished for PR 22011 at commit cf38531.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-11T13:45:11Z

LGTM, merging to master!

mengxr reviewed Aug 7, 2018

View reviewed changes

core/src/main/scala/org/apache/spark/api/java/JavaRDDBarrier.scala Outdated

Copy link

Contributor

mengxr Aug 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary to implement Python support.

HyukjinKwon reviewed Aug 7, 2018

View reviewed changes

python/pyspark/rdd.py Outdated

Copy link

Member

HyukjinKwon Aug 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: RDDBarrier -> RDD barrier

HyukjinKwon reviewed Aug 7, 2018

View reviewed changes

felixcheung reviewed Aug 7, 2018

View reviewed changes

python/pyspark/rdd.py Outdated

Copy link

Member

felixcheung Aug 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring?

jiangxb1987 added 3 commits August 9, 2018 21:53

init.

c3210f5

update

4140472

update

1ee8025

jiangxb1987 force-pushed the python branch from b0b2f86 to 1ee8025 Compare August 9, 2018 17:25

jiangxb1987 added 2 commits August 10, 2018 01:27

update

b6f4847

update

d508fc5

jiangxb1987 changed the title ~~[WIP][SPARK-24822][PySpark] Python support for barrier execution mode~~ [SPARK-24822][PySpark] Python support for barrier execution mode Aug 9, 2018

cloud-fan reviewed Aug 10, 2018

View reviewed changes

HyukjinKwon mentioned this pull request Aug 10, 2018

[SPARK-24886][INFRA] Fix the testing script to increase timeout for Jenkins build (from 300m to 340m) #21845

Closed

update

ea2330b

cloud-fan reviewed Aug 10, 2018

View reviewed changes

update

cf38531

asfgit closed this in 4855d5c Aug 11, 2018

[SPARK-24822][PySpark] Python support for barrier execution mode #22011

[SPARK-24822][PySpark] Python support for barrier execution mode #22011

Uh oh!

Conversation

jiangxb1987 commented Aug 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 6, 2018

Uh oh!

SparkQA commented Aug 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Aug 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 9, 2018

Uh oh!

SparkQA commented Aug 9, 2018

Uh oh!

jiangxb1987 commented Aug 10, 2018

Uh oh!

mengxr commented Aug 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 10, 2018

Uh oh!

SparkQA commented Aug 10, 2018

Uh oh!

mengxr commented Aug 10, 2018

Uh oh!

mengxr commented Aug 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 10, 2018

Uh oh!

SparkQA commented Aug 10, 2018

Uh oh!

jiangxb1987 commented Aug 11, 2018

Uh oh!

SparkQA commented Aug 11, 2018

Uh oh!

jiangxb1987 commented Aug 11, 2018

Uh oh!

SparkQA commented Aug 11, 2018

Uh oh!

cloud-fan commented Aug 11, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jiangxb1987 commented Aug 6, 2018 •

edited

Loading

HyukjinKwon Aug 7, 2018 •

edited

Loading