-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16861][PYSPARK][CORE] Refactor PySpark accumulator API on top of Accumulator V2 #14467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
3c1ea65
1d538fe
4bc43c0
46fa97d
736f6ce
4756853
4b1b872
a4d87e8
cc5f435
5fcaa5a
04a1d37
2f0af6a
6169c3c
b29d8cd
fca20c0
45ec1ef
76f1fac
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -173,9 +173,8 @@ def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, | |
| # they will be passed back to us through a TCP server | ||
| self._accumulatorServer = accumulators._start_update_server() | ||
| (host, port) = self._accumulatorServer.server_address | ||
| self._javaAccumulator = self._jsc.accumulator( | ||
| self._jvm.java.util.ArrayList(), | ||
| self._jvm.PythonAccumulatorParam(host, port)) | ||
| self._javaAccumulator = self._jvm.PythonAccumulatorV2(host, port) | ||
| self._jsc.sc().register(self._javaAccumulator) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I cannot fully understand why an accumulator is created for every instance of SparkContext . I see it is used when the attribute
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So in general you would have one SparkContext and many RDDs. The accumulator here doesn't represent a specific accumulator rather a general mechanism for all of the Python accumulators are built on top of. The design is certainly a bit confusing if you try and think of it as a regular accumulator - I found it helped to look at how the scala side "merge" is implemented. |
||
|
|
||
| self.pythonExec = os.environ.get("PYSPARK_PYTHON", 'python') | ||
| self.pythonVer = "%d.%d" % sys.version_info[:2] | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a bit of a side note - we could consider using the callback server here if we wanted to enable it in general rather than just for streaming once Py4J has its performance improvements in.