[SPARK-18069][PYTHON] Make PySpark doctests for SQL self-contained #16824

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

HyukjinKwon wants to merge 2 commits into apache:master from HyukjinKwon:SPARK-18069

Member

HyukjinKwon commented Feb 6, 2017

What changes were proposed in this pull request?

This PR proposes to make the examples in Python API documentation as self-contained.

These seem pretty common in Python API documentation, for example, pandas and numpy.

Before

After

How was this patch tested?

Manually tested the doctests.

 ./run-tests.py --python-executables=python2.6 --modules=pyspark-sql
 ./run-tests.py --python-executables=python3.6 --modules=pyspark-sql

Closes #15053

Member Author

HyukjinKwon commented Feb 6, 2017

cc @holden, @srowen and @mortada who were in the PR. Could three of you take a look when you have some time?


          Make doctests self-contained

b3acaad

HyukjinKwon force-pushed the SPARK-18069 branch from 00c8af3 to b3acaad Compare

February 6, 2017 18:19

SparkQA commented Feb 6, 2017

Test build #72461 has finished for PR 16824 at commit 00c8af3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA commented Feb 6, 2017

Test build #72462 has finished for PR 16824 at commit b3acaad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon commented

View reviewed changes

python/pyspark/sql/dataframe.py Outdated

    
                          This overwrites the `how` parameter.

                      :param subset: optional list of column names to consider.

Member Author

HyukjinKwon Feb 6, 2017 •

edited

Loading

Oops, let me remove this extra newline when I happen to push more commits.

HyukjinKwon commented

View reviewed changes

python/pyspark/sql/dataframe.py

    
                      >>> df = spark.createDataFrame([Row(name='Alice', age=2), Row(name='Bob', age=5)])

                      >>> df.schema

                      StructType(List(StructField(age,IntegerType,true),StructField(name,StringType,true)))

                      StructType(List(StructField(age,LongType,true),StructField(name,StringType,true)))

Member Author

HyukjinKwon Feb 7, 2017

This happened to be LongType as it seems numbers become a long type by default via PySpark.

HyukjinKwon commented

View reviewed changes

python/pyspark/sql/dataframe.py

    
                      >>> df.printSchema()

                      root

                       |-- age: integer (nullable = true)

                       |-- age: long (nullable = true)

Member Author

HyukjinKwon Feb 7, 2017

This one too.

HyukjinKwon commented

View reviewed changes

python/pyspark/sql/dataframe.py

    
                      >>> df.explain()

                      == Physical Plan ==

                      Scan ExistingRDD[age#0,name#1]

                      Scan ExistingRDD[age#...,name#...]

Member Author

HyukjinKwon Feb 7, 2017

We now create multiple dataframes so the columns are not always #0 and #1.

HyukjinKwon commented

View reviewed changes

python/pyspark/sql/dataframe.py

    
                      >>> df = spark.createDataFrame([Row(name='Alice', age=2), Row(name='Bob', age=5)])

                      >>> df

                      DataFrame[age: int, name: string]

                      DataFrame[age: bigint, name: string]

Member Author

HyukjinKwon Feb 7, 2017

This is the same. It seems bigint by default if we don't explicitly provide the schema.

HyukjinKwon commented

View reviewed changes

python/pyspark/sql/dataframe.py

    
                      |  2|Alice|

                      |  5|  Bob|

                      |  2|Alice|

                      |  2|Alice|

Member Author

HyukjinKwon Feb 7, 2017

It seems the order is varied dependent on how we create the dataframe (as I saw the related JIRA I can point out if anyone wants). I haven't looked into this deeper as this variant seems not related to the tests in repartition API.

HyukjinKwon commented

View reviewed changes

python/pyspark/sql/session.py

    
                      >>> rdd = sc.parallelize([Row(field1=1, field2="row1"), Row(field1=2, field2="row2")])

                      >>> rdd.toDF().collect()

                      [Row(name=u'Alice', age=1)]

                      [Row(field1=1, field2=u'row1'), Row(field1=2, field2=u'row2')]

Member Author

HyukjinKwon Feb 7, 2017 •

edited

Loading

It seems this doctest does not run (it seems a nested function) as the results were already wrong.

>>> rdd = sc.parallelize(
...          [Row(field1=1, field2="row1"),
...           Row(field1=2, field2="row2"),
...           Row(field1=3, field2="row3")])
>>> rdd.toDF().collect()
[Row(field1=1, field2=u'row1'), Row(field1=2, field2=u'row2'), Row(field1=3, field2=u'row3')]


          Remove uesless newline

59da0f1

SparkQA commented Feb 7, 2017

Test build #72496 has finished for PR 16824 at commit 59da0f1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Member Author

HyukjinKwon commented Feb 11, 2017

gentle ping @holdenk (somehow writing @holdenk does not show your name ... )

Contributor

holdenk commented Feb 11, 2017

Just back from Spark Summit East, I'll try and take a look soon :)

Member Author

HyukjinKwon commented Feb 18, 2017 •

edited

Loading

@davies and @holdenk, what do you think about this change? I am fine if we can't decide it is worth for now. I can close this if none of you is confident of this change.

Contributor

holdenk commented Feb 24, 2017

I'm slightly against the work to make this change happen and would rather focus on some other Python PRs - I'm not sure it improves readability of the test cases as much as planned but I'll defer to @davies judgement on this one.

Member Author

HyukjinKwon commented Feb 25, 2017

Thanks @holdenk, let me close this for now.

@davies, please give me your opinion. If you think it is worth I will reopen. If not, I will resolve the JIRA.

HyukjinKwon closed this

HyukjinKwon mentioned this pull request

[SPARK-19819][SparkR] Use concrete data in SparkR DataFrame examples #17161

Closed

HyukjinKwon deleted the SPARK-18069 branch

January 2, 2018 03:38

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet