-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder integration in PySpark documentation #29491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @Fokko, @holdenk, @nchammas, @srowen, @dongjoon-hyun, @viirya, @BryanCutler, @ueshin, @felixcheung FYI. Could you guys help review this when you guys find some time? I am going to proof-read by myself few more times but it would be nicer if I can have some feedback from you guys. |
|
cc @rohitmishr1484 and @fhoering who are also working on other pages. |
|
This is a cool DOC feature to improve the adoption of PySpark! |
| Unidecode==0.04.19 | ||
| sphinx | ||
| pydata_sphinx_theme | ||
| ipython |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there other implications to this, like, our python packaging now requires ipython when installing pyspark? (sorry ignorant question)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, it means nothing special and this file is used nowhere within Spark. It's just for some dev people but I doubt if this is actually often used. We should leverage this file and standardize the dependencies somehow like @nchammas and @dongjoon-hyun tried before but .. I currently don't have a good idea about how to handle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The packages that are being installed, when installing pyspark are defined in the setup.py: https://github.com/apache/spark/blob/master/python/setup.py#L206
The defacto way is to add an extras_require called dev where you add the development requirements. Preferably with a version attached to it, so you know that Jenkins and the local dev environments are the same.
Happy to migrate this to the setup.py if you like.
|
Test build #127693 has finished for PR 29491 at commit
|
nchammas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick review of the copy.
| @@ -0,0 +1 @@ | |||
| openjdk-8-jre | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do / where is it used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used for Binder to install the dependencies. They launch a docker image that runs a Jupyter server and JRE is required to run PySpark.
There looks no specific way to add a comment so I couldn't add. One possible way is #29491 (comment) but I don't have a preference for now. Let me know if you prefer this way.
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "You can also apply a Python native function against each group by using pandas APIs." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ew! 😄
Wouldn't it be better / more useful to show people how to use a plain Python UDF without Pandas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid promoting plain Python UDFs for now .. since pandas UDFs can do all what Python UDFs do. There was even a discussion about deprecating the plain Python UDFs somewhere before.
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few typos found:
PySpark DataFrame is lazily evaludated and implemented on thetop of RDD.
thetop -> the top
DataFrame.collect() collects the distributed data to the driver side as Python premitive representation.
premitive?
most of column-weise operations return Columns
column-weise -> column-wise?
Suggestion:
PySpark DataFrame also provides the conversion back to a pandas DataFrame in order to leverage pandas APIs.
Should we mention toPandas will collect data back to the driver and could throw OOM too? Especially you mention toPandas after a few APIs that can avoid an out-of-memory exception.
See also "Spark SQL, DataFrames and Datasets Guide" in Apache Spark documentation.
This can be also a link to the doc?
|
The notebook link and demo docs are also updated. Should be ready for a review. |
|
Test build #127733 has finished for PR 29491 at commit
|
| # List of patterns, relative to source directory, that match files and | ||
| # directories to ignore when looking for source files. | ||
| exclude_patterns = ['_build'] | ||
| exclude_patterns = ['_build', '.DS_Store', '**.ipynb_checkpoints'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to include the .DS_Store in here? Feels like nasty to add OS-specific files here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an excluding pattern :D.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @HyukjinKwon Ran the notebook locally and it worked out of the box. I've added a few style suggestions on the text, and some minor typo's. Maybe good to mention that you need to use Spark 3.0.
Furthermore, I always do the following with date/datetime:
# The import
from datetime import datetime, date
df = spark.createDataFrame([
(1, 2., 'string1', date(2000, 1, 1), datetime(2000, 1, 1, 12, 0)),
(2, 3., 'string2', date(2000, 2, 1), datetime(2000, 1, 2, 12, 0)),
(3, 4., 'string3', date(2000, 3, 1), datetime(2000, 1, 3, 12, 0))
], schema='a long, b double, c string, d date, e timestamp')
df
pandas_df = pd.DataFrame({
'a': [1, 2, 3],
'b': [2., 3., 4.],
'c': ['string1', 'string2', 'string3'],
'd': [date(2000, 1, 1), date(2000, 2, 1), date(2000, 3, 1)],
'e': [datetime(2000, 1, 1, 12, 0), datetime(2000, 1, 2, 12, 0), datetime(2000, 1, 3, 12, 0)]
})
df = spark.createDataFrame(pandas_df)
df
rdd = spark.sparkContext.parallelize([
(1, 2., 'string1', date(2000, 1, 1), datetime(2000, 1, 1, 12, 0)),
(2, 3., 'string2', date(2000, 2, 1), datetime(2000, 1, 2, 12, 0)),
(3, 4., 'string3', date(2000, 3, 1), datetime(2000, 1, 3, 12, 0))
])
df = spark.createDataFrame(rdd, schema=['a', 'b', 'c', 'd', 'e'])
df
This makes the code itself less verbose, as datetime.datetime eats up a lot of chars. Apart from that it is looking good, was able to run the whole notebook locally.
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "In order to avoid throwing an out-of-memory exception, use `DataFrame.take()` or `DataFrame.tail()`." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Thank you for your review @Fokko! |
| # TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes. | ||
| # See also https://github.com/sphinx-doc/sphinx/issues/7551. | ||
| pip3 install flake8 'sphinx<3.1.0' numpy pydata_sphinx_theme | ||
| pip3 install flake8 'sphinx<3.1.0' numpy pydata_sphinx_theme ipython nbsphinx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a requirements.txt somewhere instead?
| @@ -0,0 +1 @@ | |||
| openjdk-8-jre | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason for JDK8 instead of 11?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we're using JDK 8 by default in OSS side :-).
Also, we should set the configurations such as -Dio.netty.tryReflectionSetAccessible=true to make it working with Arrow optimization.
| # See also https://github.com/sphinx-doc/sphinx/issues/7551. | ||
| # We should use the latest Sphinx version once this is fixed. | ||
| ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.0.4 numpy==1.18.1 pydata_sphinx_theme==0.3.1" | ||
| ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.0.4 numpy==1.18.1 pydata_sphinx_theme==0.3.1 ipython==7.16.1 nbsphinx==0.7.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes me think we should have a shared requirements.txt somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tried that in #27928 but couldn't get buy-in from a release manager.
Also discussed briefly here: #29491 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's run it separately - there's already a JIRA for it SPARK-31167 as @nchammas pointed out. It's a bit complicated than I thought.
…st of dict ### What changes were proposed in this pull request? As discussed in #29491 (comment) and in SPARK-32686, this PR un-deprecates Spark's ability to infer a DataFrame schema from a list of dictionaries. The ability is Pythonic and matches functionality offered by Pandas. ### Why are the changes needed? This change clarifies to users that this behavior is supported and is not going away in the near future. ### Does this PR introduce _any_ user-facing change? Yes. There used to be a `UserWarning` for this, but now there isn't. ### How was this patch tested? I tested this manually. Before: ```python >>> spark.createDataFrame(spark.sparkContext.parallelize([{'a': 5}])) /Users/nchamm/Documents/GitHub/nchammas/spark/python/pyspark/sql/session.py:388: UserWarning: Using RDD of dict to inferSchema is deprecated. Use pyspark.sql.Row instead warnings.warn("Using RDD of dict to inferSchema is deprecated. " DataFrame[a: bigint] >>> spark.createDataFrame([{'a': 5}]) .../python/pyspark/sql/session.py:378: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead warnings.warn("inferring schema from dict is deprecated," DataFrame[a: bigint] ``` After: ```python >>> spark.createDataFrame(spark.sparkContext.parallelize([{'a': 5}])) DataFrame[a: bigint] >>> spark.createDataFrame([{'a': 5}]) DataFrame[a: bigint] ``` Closes #29510 from nchammas/SPARK-32686-df-dict-infer-schema. Authored-by: Nicholas Chammas <[email protected]> Signed-off-by: Bryan Cutler <[email protected]>
ca89dbf to
8e08cd1
Compare
8e08cd1 to
0e114b6
Compare
|
I believe all comments are addressed and it's ready for a review, or possibly good to go. |
|
Test build #127881 has finished for PR 29491 at commit
|
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a quick look through the notebook and looks great. Thanks for doing this @HyukjinKwon , it will be great for new PySpark users!
|
Thank you @BryanCutler and all! Merged to master! |


What changes were proposed in this pull request?
This PR proposes to:
Note that Binder turns a Git repo into a collection of interactive notebooks. It works based on Docker image. Once somebody builds, other people can reuse the image against a specific commit.
Therefore, if we run Binder with the images based on released tags in Spark, virtually all users can instantly launch the Jupyter notebooks.
I made a simple demo to make it easier to review. Please see:
When reviewing the notebook file itself, please give my direct feedback which I will appreciate and address.
Another way might be:
.ipynbfile:.ipynbfile here in a GitHub comment. Then, I will push a commit with that file with crediting correctly, of course.References:
Why are the changes needed?
To improve PySpark's usability. The current quickstart for Python users are very friendly.
Does this PR introduce any user-facing change?
Yes, it will add a documentation page, and expose a live notebook to PySpark users.
How was this patch tested?
Manually tested, and GitHub Actions builds will test.