Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Aug 20, 2020

What changes were proposed in this pull request?

This PR proposes to:

  • add a notebook with a Binder integration which allows users to try PySpark in a live notebook. Please try this here.
  • reuse this notebook as a quickstart guide in PySpark documentation.

Note that Binder turns a Git repo into a collection of interactive notebooks. It works based on Docker image. Once somebody builds, other people can reuse the image against a specific commit.
Therefore, if we run Binder with the images based on released tags in Spark, virtually all users can instantly launch the Jupyter notebooks.


I made a simple demo to make it easier to review. Please see:

  • Main page. Note that the link ("Live Notebook") in the main page wouldn't work since this PR is not merged yet.
  • Quickstart page

When reviewing the notebook file itself, please give my direct feedback which I will appreciate and address.
Another way might be:

  • open here.
  • edit / change / update the notebook. Please feel free to change as whatever you want. I can apply as are or slightly update more when I apply to this PR.
  • download it as a .ipynb file:
    Screen Shot 2020-08-20 at 10 12 19 PM
  • upload the .ipynb file here in a GitHub comment. Then, I will push a commit with that file with crediting correctly, of course.
  • alternatively, push a commit into this PR right away if that's easier for you (if you're a committer).

References:

Why are the changes needed?

To improve PySpark's usability. The current quickstart for Python users are very friendly.

Does this PR introduce any user-facing change?

Yes, it will add a documentation page, and expose a live notebook to PySpark users.

How was this patch tested?

Manually tested, and GitHub Actions builds will test.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Aug 20, 2020

cc @Fokko, @holdenk, @nchammas, @srowen, @dongjoon-hyun, @viirya, @BryanCutler, @ueshin, @felixcheung FYI. Could you guys help review this when you guys find some time? I am going to proof-read by myself few more times but it would be nicer if I can have some feedback from you guys.

@HyukjinKwon
Copy link
Member Author

cc @rohitmishr1484 and @fhoering who are also working on other pages.

@HyukjinKwon HyukjinKwon changed the title [SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder integration [SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder integration in PySpark documentation Aug 20, 2020
@gatorsmile
Copy link
Member

gatorsmile commented Aug 20, 2020

This is a cool DOC feature to improve the adoption of PySpark!

Unidecode==0.04.19
sphinx
pydata_sphinx_theme
ipython
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other implications to this, like, our python packaging now requires ipython when installing pyspark? (sorry ignorant question)

Copy link
Member Author

@HyukjinKwon HyukjinKwon Aug 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it means nothing special and this file is used nowhere within Spark. It's just for some dev people but I doubt if this is actually often used. We should leverage this file and standardize the dependencies somehow like @nchammas and @dongjoon-hyun tried before but .. I currently don't have a good idea about how to handle it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The packages that are being installed, when installing pyspark are defined in the setup.py: https://github.com/apache/spark/blob/master/python/setup.py#L206

The defacto way is to add an extras_require called dev where you add the development requirements. Preferably with a version attached to it, so you know that Jenkins and the local dev environments are the same.

Happy to migrate this to the setup.py if you like.

@SparkQA
Copy link

SparkQA commented Aug 20, 2020

Test build #127693 has finished for PR 29491 at commit 185b7ee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@nchammas nchammas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick review of the copy.

@@ -0,0 +1 @@
openjdk-8-jre
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do / where is it used?

Copy link
Member Author

@HyukjinKwon HyukjinKwon Aug 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used for Binder to install the dependencies. They launch a docker image that runs a Jupyter server and JRE is required to run PySpark.

There looks no specific way to add a comment so I couldn't add. One possible way is #29491 (comment) but I don't have a preference for now. Let me know if you prefer this way.

"cell_type": "markdown",
"metadata": {},
"source": [
"You can also apply a Python native function against each group by using pandas APIs."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ew! 😄

Wouldn't it be better / more useful to show people how to use a plain Python UDF without Pandas?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid promoting plain Python UDFs for now .. since pandas UDFs can do all what Python UDFs do. There was even a discussion about deprecating the plain Python UDFs somewhere before.

@HyukjinKwon
Copy link
Member Author

Thank you @srowen and @nchammas. I will try to address them.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few typos found:

PySpark DataFrame is lazily evaludated and implemented on thetop of RDD.

thetop -> the top

DataFrame.collect() collects the distributed data to the driver side as Python premitive representation.

premitive?

most of column-weise operations return Columns

column-weise -> column-wise?

Suggestion:

PySpark DataFrame also provides the conversion back to a pandas DataFrame in order to leverage pandas APIs.

Should we mention toPandas will collect data back to the driver and could throw OOM too? Especially you mention toPandas after a few APIs that can avoid an out-of-memory exception.

See also "Spark SQL, DataFrames and Datasets Guide" in Apache Spark documentation.

This can be also a link to the doc?

@HyukjinKwon
Copy link
Member Author

The notebook link and demo docs are also updated. Should be ready for a review.

@SparkQA
Copy link

SparkQA commented Aug 21, 2020

Test build #127733 has finished for PR 29491 at commit ca89dbf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
exclude_patterns = ['_build']
exclude_patterns = ['_build', '.DS_Store', '**.ipynb_checkpoints']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to include the .DS_Store in here? Feels like nasty to add OS-specific files here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an excluding pattern :D.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @HyukjinKwon Ran the notebook locally and it worked out of the box. I've added a few style suggestions on the text, and some minor typo's. Maybe good to mention that you need to use Spark 3.0.

Furthermore, I always do the following with date/datetime:

# The import
from datetime import datetime, date
df = spark.createDataFrame([
    (1, 2., 'string1', date(2000, 1, 1), datetime(2000, 1, 1, 12, 0)),
    (2, 3., 'string2', date(2000, 2, 1), datetime(2000, 1, 2, 12, 0)),
    (3, 4., 'string3', date(2000, 3, 1), datetime(2000, 1, 3, 12, 0))
], schema='a long, b double, c string, d date, e timestamp')
df
pandas_df = pd.DataFrame({
    'a': [1, 2, 3],
    'b': [2., 3., 4.],
    'c': ['string1', 'string2', 'string3'],
    'd': [date(2000, 1, 1), date(2000, 2, 1), date(2000, 3, 1)],
    'e': [datetime(2000, 1, 1, 12, 0), datetime(2000, 1, 2, 12, 0), datetime(2000, 1, 3, 12, 0)]
})
df = spark.createDataFrame(pandas_df)
df
rdd = spark.sparkContext.parallelize([
    (1, 2., 'string1', date(2000, 1, 1), datetime(2000, 1, 1, 12, 0)),
    (2, 3., 'string2', date(2000, 2, 1), datetime(2000, 1, 2, 12, 0)),
    (3, 4., 'string3', date(2000, 3, 1), datetime(2000, 1, 3, 12, 0))
])
df = spark.createDataFrame(rdd, schema=['a', 'b', 'c', 'd', 'e'])
df

This makes the code itself less verbose, as datetime.datetime eats up a lot of chars. Apart from that it is looking good, was able to run the whole notebook locally.

"cell_type": "markdown",
"metadata": {},
"source": [
"In order to avoid throwing an out-of-memory exception, use `DataFrame.take()` or `DataFrame.tail()`."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tail() does not exists:
image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe your PySpark was lower than 3.0 (?). Seems working fine in Binder and Python 3.0:
Screen Shot 2020-08-25 at 4 42 02 PM
This API was added from Spark 3.0.

@HyukjinKwon
Copy link
Member Author

Thank you for your review @Fokko!

# TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
# See also https://github.com/sphinx-doc/sphinx/issues/7551.
pip3 install flake8 'sphinx<3.1.0' numpy pydata_sphinx_theme
pip3 install flake8 'sphinx<3.1.0' numpy pydata_sphinx_theme ipython nbsphinx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a requirements.txt somewhere instead?

@@ -0,0 +1 @@
openjdk-8-jre
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason for JDK8 instead of 11?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we're using JDK 8 by default in OSS side :-).
Also, we should set the configurations such as -Dio.netty.tryReflectionSetAccessible=true to make it working with Arrow optimization.

# See also https://github.com/sphinx-doc/sphinx/issues/7551.
# We should use the latest Sphinx version once this is fixed.
ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.0.4 numpy==1.18.1 pydata_sphinx_theme==0.3.1"
ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.0.4 numpy==1.18.1 pydata_sphinx_theme==0.3.1 ipython==7.16.1 nbsphinx==0.7.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes me think we should have a shared requirements.txt somewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tried that in #27928 but couldn't get buy-in from a release manager.

Also discussed briefly here: #29491 (comment)

Copy link
Member Author

@HyukjinKwon HyukjinKwon Aug 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's run it separately - there's already a JIRA for it SPARK-31167 as @nchammas pointed out. It's a bit complicated than I thought.

BryanCutler pushed a commit that referenced this pull request Aug 24, 2020
…st of dict

### What changes were proposed in this pull request?

As discussed in #29491 (comment) and in SPARK-32686, this PR un-deprecates Spark's ability to infer a DataFrame schema from a list of dictionaries. The ability is Pythonic and matches functionality offered by Pandas.

### Why are the changes needed?

This change clarifies to users that this behavior is supported and is not going away in the near future.

### Does this PR introduce _any_ user-facing change?

Yes. There used to be a `UserWarning` for this, but now there isn't.

### How was this patch tested?

I tested this manually.

Before:

```python
>>> spark.createDataFrame(spark.sparkContext.parallelize([{'a': 5}]))
/Users/nchamm/Documents/GitHub/nchammas/spark/python/pyspark/sql/session.py:388: UserWarning: Using RDD of dict to inferSchema is deprecated. Use pyspark.sql.Row instead
  warnings.warn("Using RDD of dict to inferSchema is deprecated. "
DataFrame[a: bigint]

>>> spark.createDataFrame([{'a': 5}])
.../python/pyspark/sql/session.py:378: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead
  warnings.warn("inferring schema from dict is deprecated,"
DataFrame[a: bigint]
```

After:

```python
>>> spark.createDataFrame(spark.sparkContext.parallelize([{'a': 5}]))
DataFrame[a: bigint]

>>> spark.createDataFrame([{'a': 5}])
DataFrame[a: bigint]
```

Closes #29510 from nchammas/SPARK-32686-df-dict-infer-schema.

Authored-by: Nicholas Chammas <[email protected]>
Signed-off-by: Bryan Cutler <[email protected]>
@HyukjinKwon
Copy link
Member Author

I believe all comments are addressed and it's ready for a review, or possibly good to go.

@SparkQA
Copy link

SparkQA commented Aug 25, 2020

Test build #127881 has finished for PR 29491 at commit 0e114b6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick look through the notebook and looks great. Thanks for doing this @HyukjinKwon , it will be great for new PySpark users!

@HyukjinKwon
Copy link
Member Author

Thank you @BryanCutler and all!

Merged to master!

@HyukjinKwon HyukjinKwon deleted the SPARK-32204 branch December 7, 2020 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants