[SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder integration in PySpark documentation #29491

HyukjinKwon · 2020-08-20T12:53:52Z

What changes were proposed in this pull request?

This PR proposes to:

add a notebook with a Binder integration which allows users to try PySpark in a live notebook. Please try this here.
reuse this notebook as a quickstart guide in PySpark documentation.

Note that Binder turns a Git repo into a collection of interactive notebooks. It works based on Docker image. Once somebody builds, other people can reuse the image against a specific commit.
Therefore, if we run Binder with the images based on released tags in Spark, virtually all users can instantly launch the Jupyter notebooks.

I made a simple demo to make it easier to review. Please see:

Main page. Note that the link ("Live Notebook") in the main page wouldn't work since this PR is not merged yet.
Quickstart page

When reviewing the notebook file itself, please give my direct feedback which I will appreciate and address.
Another way might be:

open here.
edit / change / update the notebook. Please feel free to change as whatever you want. I can apply as are or slightly update more when I apply to this PR.
download it as a .ipynb file:
upload the .ipynb file here in a GitHub comment. Then, I will push a commit with that file with crediting correctly, of course.
alternatively, push a commit into this PR right away if that's easier for you (if you're a committer).

References:

Why are the changes needed?

To improve PySpark's usability. The current quickstart for Python users are very friendly.

Does this PR introduce any user-facing change?

Yes, it will add a documentation page, and expose a live notebook to PySpark users.

How was this patch tested?

Manually tested, and GitHub Actions builds will test.

HyukjinKwon · 2020-08-20T12:56:24Z

cc @Fokko, @holdenk, @nchammas, @srowen, @dongjoon-hyun, @viirya, @BryanCutler, @ueshin, @felixcheung FYI. Could you guys help review this when you guys find some time? I am going to proof-read by myself few more times but it would be nicer if I can have some feedback from you guys.

HyukjinKwon · 2020-08-20T12:58:47Z

cc @rohitmishr1484 and @fhoering who are also working on other pages.

gatorsmile · 2020-08-20T15:26:12Z

This is a cool DOC feature to improve the adoption of PySpark!

apt.txt

srowen · 2020-08-20T15:39:01Z

dev/requirements.txt

 Unidecode==0.04.19
 sphinx
 pydata_sphinx_theme
+ipython


Are there other implications to this, like, our python packaging now requires ipython when installing pyspark? (sorry ignorant question)

Nope, it means nothing special and this file is used nowhere within Spark. It's just for some dev people but I doubt if this is actually often used. We should leverage this file and standardize the dependencies somehow like @nchammas and @dongjoon-hyun tried before but .. I currently don't have a good idea about how to handle it.

The packages that are being installed, when installing pyspark are defined in the setup.py: https://github.com/apache/spark/blob/master/python/setup.py#L206

The defacto way is to add an extras_require called dev where you add the development requirements. Preferably with a version attached to it, so you know that Jenkins and the local dev environments are the same.

Happy to migrate this to the setup.py if you like.

postBuild

SparkQA · 2020-08-20T15:45:28Z

Test build #127693 has finished for PR 29491 at commit 185b7ee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nchammas

Quick review of the copy.

nchammas · 2020-08-20T14:15:01Z

apt.txt

@@ -0,0 +1 @@
+openjdk-8-jre


What does this do / where is it used?

It is used for Binder to install the dependencies. They launch a docker image that runs a Jupyter server and JRE is required to run PySpark.

There looks no specific way to add a comment so I couldn't add. One possible way is #29491 (comment) but I don't have a preference for now. Let me know if you prefer this way.

dev/lint-python

docs/README.md

postBuild

python/docs/source/getting_started/quickstart.ipynb

nchammas · 2020-08-20T15:51:27Z

python/docs/source/getting_started/quickstart.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can also apply a Python native function against each group by using pandas APIs."


Ew! 😄

Wouldn't it be better / more useful to show people how to use a plain Python UDF without Pandas?

Let's avoid promoting plain Python UDFs for now .. since pandas UDFs can do all what Python UDFs do. There was even a discussion about deprecating the plain Python UDFs somewhere before.

python/docs/source/getting_started/quickstart.ipynb

HyukjinKwon · 2020-08-21T04:05:42Z

Thank you @srowen and @nchammas. I will try to address them.

dev/lint-python

python/docs/source/conf.py

viirya

Few typos found:

PySpark DataFrame is lazily evaludated and implemented on thetop of RDD.

thetop -> the top

DataFrame.collect() collects the distributed data to the driver side as Python premitive representation.

premitive?

most of column-weise operations return Columns

column-weise -> column-wise?

Suggestion:

PySpark DataFrame also provides the conversion back to a pandas DataFrame in order to leverage pandas APIs.

Should we mention toPandas will collect data back to the driver and could throw OOM too? Especially you mention toPandas after a few APIs that can avoid an out-of-memory exception.

See also "Spark SQL, DataFrames and Datasets Guide" in Apache Spark documentation.

This can be also a link to the doc?

HyukjinKwon · 2020-08-21T07:52:37Z

The notebook link and demo docs are also updated. Should be ready for a review.

python/docs/source/getting_started/quickstart.ipynb

SparkQA · 2020-08-21T11:18:05Z

Test build #127733 has finished for PR 29491 at commit ca89dbf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Fokko · 2020-08-22T09:47:31Z

python/docs/source/conf.py

 # List of patterns, relative to source directory, that match files and
 # directories to ignore when looking for source files.
-exclude_patterns = ['_build']
+exclude_patterns = ['_build', '.DS_Store', '**.ipynb_checkpoints']


Do we really need to include the .DS_Store in here? Feels like nasty to add OS-specific files here.

It's an excluding pattern :D.

Fokko

Great work @HyukjinKwon Ran the notebook locally and it worked out of the box. I've added a few style suggestions on the text, and some minor typo's. Maybe good to mention that you need to use Spark 3.0.

Furthermore, I always do the following with date/datetime:

# The import
from datetime import datetime, date

df = spark.createDataFrame([
    (1, 2., 'string1', date(2000, 1, 1), datetime(2000, 1, 1, 12, 0)),
    (2, 3., 'string2', date(2000, 2, 1), datetime(2000, 1, 2, 12, 0)),
    (3, 4., 'string3', date(2000, 3, 1), datetime(2000, 1, 3, 12, 0))
], schema='a long, b double, c string, d date, e timestamp')
df

pandas_df = pd.DataFrame({
    'a': [1, 2, 3],
    'b': [2., 3., 4.],
    'c': ['string1', 'string2', 'string3'],
    'd': [date(2000, 1, 1), date(2000, 2, 1), date(2000, 3, 1)],
    'e': [datetime(2000, 1, 1, 12, 0), datetime(2000, 1, 2, 12, 0), datetime(2000, 1, 3, 12, 0)]
})
df = spark.createDataFrame(pandas_df)
df

rdd = spark.sparkContext.parallelize([
    (1, 2., 'string1', date(2000, 1, 1), datetime(2000, 1, 1, 12, 0)),
    (2, 3., 'string2', date(2000, 2, 1), datetime(2000, 1, 2, 12, 0)),
    (3, 4., 'string3', date(2000, 3, 1), datetime(2000, 1, 3, 12, 0))
])
df = spark.createDataFrame(rdd, schema=['a', 'b', 'c', 'd', 'e'])
df

This makes the code itself less verbose, as datetime.datetime eats up a lot of chars. Apart from that it is looking good, was able to run the whole notebook locally.

python/docs/source/getting_started/quickstart.ipynb

Fokko · 2020-08-22T10:04:04Z

python/docs/source/getting_started/quickstart.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In order to avoid throwing an out-of-memory exception, use `DataFrame.take()` or `DataFrame.tail()`."


The tail() does not exists:

Maybe your PySpark was lower than 3.0 (?). Seems working fine in Binder and Python 3.0:

This API was added from Spark 3.0.

python/docs/source/getting_started/quickstart.ipynb

HyukjinKwon · 2020-08-22T13:04:00Z

Thank you for your review @Fokko!

holdenk · 2020-08-21T18:05:29Z

.github/workflows/build_and_test.yml

        # TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
        #   See also https://github.com/sphinx-doc/sphinx/issues/7551.
-        pip3 install flake8 'sphinx<3.1.0' numpy pydata_sphinx_theme
+        pip3 install flake8 'sphinx<3.1.0' numpy pydata_sphinx_theme ipython nbsphinx


Should we have a requirements.txt somewhere instead?

apt.txt

holdenk · 2020-08-21T18:06:43Z

apt.txt

@@ -0,0 +1 @@
+openjdk-8-jre


Any particular reason for JDK8 instead of 11?

Because we're using JDK 8 by default in OSS side :-).
Also, we should set the configurations such as -Dio.netty.tryReflectionSetAccessible=true to make it working with Arrow optimization.

holdenk · 2020-08-21T18:07:05Z

dev/create-release/spark-rm/Dockerfile

 #   See also https://github.com/sphinx-doc/sphinx/issues/7551.
 #   We should use the latest Sphinx version once this is fixed.
-ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.0.4 numpy==1.18.1 pydata_sphinx_theme==0.3.1"
+ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.0.4 numpy==1.18.1 pydata_sphinx_theme==0.3.1 ipython==7.16.1 nbsphinx==0.7.1"


this makes me think we should have a shared requirements.txt somewhere.

We tried that in #27928 but couldn't get buy-in from a release manager.

Also discussed briefly here: #29491 (comment)

Yeah, let's run it separately - there's already a JIRA for it SPARK-31167 as @nchammas pointed out. It's a bit complicated than I thought.

…st of dict ### What changes were proposed in this pull request? As discussed in #29491 (comment) and in SPARK-32686, this PR un-deprecates Spark's ability to infer a DataFrame schema from a list of dictionaries. The ability is Pythonic and matches functionality offered by Pandas. ### Why are the changes needed? This change clarifies to users that this behavior is supported and is not going away in the near future. ### Does this PR introduce _any_ user-facing change? Yes. There used to be a `UserWarning` for this, but now there isn't. ### How was this patch tested? I tested this manually. Before: ```python >>> spark.createDataFrame(spark.sparkContext.parallelize([{'a': 5}])) /Users/nchamm/Documents/GitHub/nchammas/spark/python/pyspark/sql/session.py:388: UserWarning: Using RDD of dict to inferSchema is deprecated. Use pyspark.sql.Row instead warnings.warn("Using RDD of dict to inferSchema is deprecated. " DataFrame[a: bigint] >>> spark.createDataFrame([{'a': 5}]) .../python/pyspark/sql/session.py:378: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead warnings.warn("inferring schema from dict is deprecated," DataFrame[a: bigint] ``` After: ```python >>> spark.createDataFrame(spark.sparkContext.parallelize([{'a': 5}])) DataFrame[a: bigint] >>> spark.createDataFrame([{'a': 5}]) DataFrame[a: bigint] ``` Closes #29510 from nchammas/SPARK-32686-df-dict-infer-schema. Authored-by: Nicholas Chammas <[email protected]> Signed-off-by: Bryan Cutler <[email protected]>

HyukjinKwon · 2020-08-25T07:52:24Z

I believe all comments are addressed and it's ready for a review, or possibly good to go.

SparkQA · 2020-08-25T10:41:06Z

Test build #127881 has finished for PR 29491 at commit 0e114b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

I took a quick look through the notebook and looks great. Thanks for doing this @HyukjinKwon , it will be great for new PySpark users!

HyukjinKwon · 2020-08-26T03:23:01Z

Thank you @BryanCutler and all!

Merged to master!

probot-autolabeler bot added BUILD DOCS INFRA PYTHON labels Aug 20, 2020

HyukjinKwon changed the title ~~[SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder integration~~ [SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder integration in PySpark documentation Aug 20, 2020

srowen reviewed Aug 20, 2020

View reviewed changes

nchammas reviewed Aug 20, 2020

View reviewed changes

viirya reviewed Aug 21, 2020

View reviewed changes

dev/lint-python Outdated Show resolved Hide resolved

viirya reviewed Aug 21, 2020

View reviewed changes

dev/lint-python Outdated Show resolved Hide resolved

viirya reviewed Aug 21, 2020

View reviewed changes

python/docs/source/conf.py Outdated Show resolved Hide resolved

viirya reviewed Aug 21, 2020

View reviewed changes

HyukjinKwon commented Aug 21, 2020

View reviewed changes

python/docs/source/getting_started/quickstart.ipynb Show resolved Hide resolved

nchammas mentioned this pull request Aug 21, 2020

[SPARK-32686][PYTHON] Un-deprecate inferring DataFrame schema from list of dict #29510

Closed

Fokko reviewed Aug 22, 2020

View reviewed changes

holdenk reviewed Aug 24, 2020

View reviewed changes

HyukjinKwon added 2 commits August 25, 2020 16:19

Add a quickstart page with Binder integration

5bef51d

Address comments

1b13445

HyukjinKwon force-pushed the SPARK-32204 branch from ca89dbf to 8e08cd1 Compare August 25, 2020 07:48

Fokko and others added 2 commits August 25, 2020 16:51

Address comments

1763003

Place into binder directory

0e114b6

HyukjinKwon force-pushed the SPARK-32204 branch from 8e08cd1 to 0e114b6 Compare August 25, 2020 07:51

BryanCutler approved these changes Aug 25, 2020

View reviewed changes

HyukjinKwon closed this in b541030 Aug 26, 2020

HyukjinKwon deleted the SPARK-32204 branch December 7, 2020 02:05

		@@ -0,0 +1 @@
		openjdk-8-jre

		@@ -0,0 +1 @@
		openjdk-8-jre

[SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder integration in PySpark documentation #29491

[SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder integration in PySpark documentation #29491

Uh oh!

Conversation

HyukjinKwon commented Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Aug 20, 2020

Uh oh!

gatorsmile commented Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Aug 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Aug 20, 2020

Uh oh!

nchammas left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Aug 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Aug 21, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Aug 21, 2020

Uh oh!

Uh oh!

SparkQA commented Aug 21, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Aug 20, 2020 •

edited

Loading

HyukjinKwon commented Aug 20, 2020 •

edited

Loading

gatorsmile commented Aug 20, 2020 •

edited

Loading

HyukjinKwon Aug 21, 2020 •

edited

Loading

HyukjinKwon Aug 21, 2020 •

edited

Loading

Fokko left a comment •

edited

Loading

HyukjinKwon Aug 25, 2020 •

edited

Loading