Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR proposes to make the examples in Python API documentation as self-contained.

These seem pretty common in Python API documentation, for example, pandas and numpy.

Before

2017-02-07 2 51 52

2017-02-07 2 56 52

After

2017-02-07 2 52 03

2017-02-07 2 56 33

How was this patch tested?

Manually tested the doctests.

 ./run-tests.py --python-executables=python2.6 --modules=pyspark-sql
 ./run-tests.py --python-executables=python3.6 --modules=pyspark-sql

Closes #15053

@HyukjinKwon
Copy link
Member Author

cc @holden, @srowen and @mortada who were in the PR. Could three of you take a look when you have some time?

@SparkQA
Copy link

SparkQA commented Feb 6, 2017

Test build #72461 has finished for PR 16824 at commit 00c8af3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 6, 2017

Test build #72462 has finished for PR 16824 at commit b3acaad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

This overwrites the `how` parameter.
:param subset: optional list of column names to consider.
Copy link
Member Author

@HyukjinKwon HyukjinKwon Feb 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, let me remove this extra newline when I happen to push more commits.

>>> df = spark.createDataFrame([Row(name='Alice', age=2), Row(name='Bob', age=5)])
>>> df.schema
StructType(List(StructField(age,IntegerType,true),StructField(name,StringType,true)))
StructType(List(StructField(age,LongType,true),StructField(name,StringType,true)))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happened to be LongType as it seems numbers become a long type by default via PySpark.

>>> df.printSchema()
root
|-- age: integer (nullable = true)
|-- age: long (nullable = true)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one too.

>>> df.explain()
== Physical Plan ==
Scan ExistingRDD[age#0,name#1]
Scan ExistingRDD[age#...,name#...]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now create multiple dataframes so the columns are not always #0 and #1.

>>> df = spark.createDataFrame([Row(name='Alice', age=2), Row(name='Bob', age=5)])
>>> df
DataFrame[age: int, name: string]
DataFrame[age: bigint, name: string]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same. It seems bigint by default if we don't explicitly provide the schema.

| 2|Alice|
| 5| Bob|
| 2|Alice|
| 2|Alice|
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the order is varied dependent on how we create the dataframe (as I saw the related JIRA I can point out if anyone wants). I haven't looked into this deeper as this variant seems not related to the tests in repartition API.

>>> rdd = sc.parallelize([Row(field1=1, field2="row1"), Row(field1=2, field2="row2")])
>>> rdd.toDF().collect()
[Row(name=u'Alice', age=1)]
[Row(field1=1, field2=u'row1'), Row(field1=2, field2=u'row2')]
Copy link
Member Author

@HyukjinKwon HyukjinKwon Feb 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this doctest does not run (it seems a nested function) as the results were already wrong.

>>> rdd = sc.parallelize(
...          [Row(field1=1, field2="row1"),
...           Row(field1=2, field2="row2"),
...           Row(field1=3, field2="row3")])
>>> rdd.toDF().collect()
[Row(field1=1, field2=u'row1'), Row(field1=2, field2=u'row2'), Row(field1=3, field2=u'row3')]

@SparkQA
Copy link

SparkQA commented Feb 7, 2017

Test build #72496 has finished for PR 16824 at commit 59da0f1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

gentle ping @holdenk (somehow writing @holdenk does not show your name ... 2017-02-11 11 45 01)

@holdenk
Copy link
Contributor

holdenk commented Feb 11, 2017

Just back from Spark Summit East, I'll try and take a look soon :)

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Feb 18, 2017

@davies and @holdenk, what do you think about this change? I am fine if we can't decide it is worth for now. I can close this if none of you is confident of this change.

@holdenk
Copy link
Contributor

holdenk commented Feb 24, 2017

I'm slightly against the work to make this change happen and would rather focus on some other Python PRs - I'm not sure it improves readability of the test cases as much as planned but I'll defer to @davies judgement on this one.

@HyukjinKwon
Copy link
Member Author

Thanks @holdenk, let me close this for now.

@davies, please give me your opinion. If you think it is worth I will reopen. If not, I will resolve the JIRA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants