-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18069][PYTHON] Make PySpark doctests for SQL self-contained #16824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
00c8af3 to
b3acaad
Compare
|
Test build #72461 has finished for PR 16824 at commit
|
|
Test build #72462 has finished for PR 16824 at commit
|
python/pyspark/sql/dataframe.py
Outdated
| This overwrites the `how` parameter. | ||
| :param subset: optional list of column names to consider. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, let me remove this extra newline when I happen to push more commits.
| >>> df = spark.createDataFrame([Row(name='Alice', age=2), Row(name='Bob', age=5)]) | ||
| >>> df.schema | ||
| StructType(List(StructField(age,IntegerType,true),StructField(name,StringType,true))) | ||
| StructType(List(StructField(age,LongType,true),StructField(name,StringType,true))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happened to be LongType as it seems numbers become a long type by default via PySpark.
| >>> df.printSchema() | ||
| root | ||
| |-- age: integer (nullable = true) | ||
| |-- age: long (nullable = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one too.
| >>> df.explain() | ||
| == Physical Plan == | ||
| Scan ExistingRDD[age#0,name#1] | ||
| Scan ExistingRDD[age#...,name#...] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now create multiple dataframes so the columns are not always #0 and #1.
| >>> df = spark.createDataFrame([Row(name='Alice', age=2), Row(name='Bob', age=5)]) | ||
| >>> df | ||
| DataFrame[age: int, name: string] | ||
| DataFrame[age: bigint, name: string] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same. It seems bigint by default if we don't explicitly provide the schema.
| | 2|Alice| | ||
| | 5| Bob| | ||
| | 2|Alice| | ||
| | 2|Alice| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the order is varied dependent on how we create the dataframe (as I saw the related JIRA I can point out if anyone wants). I haven't looked into this deeper as this variant seems not related to the tests in repartition API.
| >>> rdd = sc.parallelize([Row(field1=1, field2="row1"), Row(field1=2, field2="row2")]) | ||
| >>> rdd.toDF().collect() | ||
| [Row(name=u'Alice', age=1)] | ||
| [Row(field1=1, field2=u'row1'), Row(field1=2, field2=u'row2')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this doctest does not run (it seems a nested function) as the results were already wrong.
>>> rdd = sc.parallelize(
... [Row(field1=1, field2="row1"),
... Row(field1=2, field2="row2"),
... Row(field1=3, field2="row3")])
>>> rdd.toDF().collect()
[Row(field1=1, field2=u'row1'), Row(field1=2, field2=u'row2'), Row(field1=3, field2=u'row3')]|
Test build #72496 has finished for PR 16824 at commit
|
|
gentle ping @holdenk (somehow writing |
|
Just back from Spark Summit East, I'll try and take a look soon :) |
|
I'm slightly against the work to make this change happen and would rather focus on some other Python PRs - I'm not sure it improves readability of the test cases as much as planned but I'll defer to @davies judgement on this one. |

What changes were proposed in this pull request?
This PR proposes to make the examples in Python API documentation as self-contained.
These seem pretty common in Python API documentation, for example, pandas and numpy.
Before
After
How was this patch tested?
Manually tested the doctests.
Closes #15053