@@ -1640,7 +1640,7 @@ Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` a
16401640You may run ` ./bin/spark-sql --help ` for a complete list of all available
16411641options.
16421642
1643- # Usage Guide for Pandas with Arrow
1643+ # PySpark Usage Guide for Pandas with Arrow
16441644
16451645## Arrow in Spark
16461646
@@ -1651,19 +1651,19 @@ changes to configuration or code to take full advantage and ensure compatibility
16511651give a high-level description of how to use Arrow in Spark and highlight any differences when
16521652working with Arrow-enabled data.
16531653
1654- ## Ensure pyarrow Installed
1654+ ### Ensure PyArrow Installed
16551655
1656- If you install pyspark using pip, then pyarrow can be brought in as an extra dependency of the sql
1657- module with the command " pip install pyspark[ sql] " . Otherwise, you must ensure that pyarrow is
1658- installed and available on all cluster node Python environments . The current supported version is
1659- 0.8.0. You can install using pip or conda from the conda-forge channel. See pyarrow
1656+ If you install PySpark using pip, then PyArrow can be brought in as an extra dependency of the
1657+ SQL module with the command ` pip install pyspark[sql] ` . Otherwise, you must ensure that PyArrow
1658+ is installed and available on all cluster nodes . The current supported version is 0.8.0.
1659+ You can install using pip or conda from the conda-forge channel. See PyArrow
16601660[ installation] ( https://arrow.apache.org/docs/python/install.html ) for details.
16611661
1662- ## How to Enable for Conversion to/from Pandas
1662+ ## Enabling for Conversion to/from Pandas
16631663
16641664Arrow is available as an optimization when converting a Spark DataFrame to Pandas using the call
16651665` toPandas() ` and when creating a Spark DataFrame from Pandas with ` createDataFrame(pandas_df) ` .
1666- To use Arrow when executing these calls, it first must be enabled by setting the Spark conf
1666+ To use Arrow when executing these calls, it first must be enabled by setting the Spark configuration
16671667'spark.sql.execution.arrow.enabled' to 'true', this is disabled by default.
16681668
16691669<div class =" codetabs " >
@@ -1683,7 +1683,7 @@ pdf = pd.DataFrame(np.random.rand(100, 3))
16831683df = spark.createDataFrame(pdf)
16841684
16851685# Convert the Spark DataFrame to a local Pandas DataFrame
1686- selpdf = df.select(" * ").toPandas()
1686+ selpdf = df.select("* ").toPandas()
16871687
16881688{% endhighlight %}
16891689</div >
@@ -1751,13 +1751,14 @@ GroupBy-Apply implements the "split-apply-combine" pattern. Split-apply-combine
17511751 input data contains all the rows and columns for each group.
17521752* Combine the results into a new ` DataFrame ` .
17531753
1754- To use GroupBy-Apply, user needs to define:
1755- * A python function that defines the computation for each group
1756- * A ` StructType ` object or a string that defines the output schema of the output ` DataFrame `
1754+ To use GroupBy-Apply, define the following:
1755+
1756+ * A Python function that defines the computation for each group.
1757+ * A ` StructType ` object or a string that defines the schema of the output ` DataFrame ` .
17571758
17581759Examples:
17591760
1760- The first example shows a simple use case: subtracting mean from each value in the group.
1761+ The first example shows a simple use case: subtracting the mean from each value in the group.
17611762
17621763<div class =" codetabs " >
17631764<div data-lang =" python " markdown =" 1 " >
@@ -1864,15 +1865,14 @@ batches for processing.
18641865Spark internally stores timestamps as UTC values, and timestamp data that is brought in without
18651866a specified time zone is converted as local time to UTC with microsecond resolution. When timestamp
18661867data is exported or displayed in Spark, the session time zone is used to localize the timestamp
1867- values. The session time zone is set with the conf 'spark.sql.session.timeZone' and will default
1868- to the JVM system local time zone if not set. Pandas uses a ` datetime64 ` type with nanosecond
1869- resolution, ` datetime64[ns] ` , and optional time zone that can be applied on a per-column basis.
1868+ values. The session time zone is set with the configuration 'spark.sql.session.timeZone' and will
1869+ default to the JVM system local time zone if not set. Pandas uses a ` datetime64 ` type with nanosecond
1870+ resolution, ` datetime64[ns] ` , with optional time zone on a per-column basis.
18701871
18711872When timestamp data is transferred from Spark to Pandas it will be converted to nanoseconds
1872- and each column will be made time zone aware using the Spark session time zone. This will occur
1873- when calling ` toPandas() ` or ` pandas_udf ` with a timestamp column. For example if the session time
1874- zone is 'America/Los_Angeles' then the Pandas timestamp column will be of type
1875- ` datetime64[ns, America/Los_Angeles] ` .
1873+ and each column will be converted to the Spark session time zone then localized to that time
1874+ zone, which removes the time zone and displays values as local time. This will occur
1875+ when calling ` toPandas() ` or ` pandas_udf ` with timestamp columns.
18761876
18771877When timestamp data is transferred from Pandas to Spark, it will be converted to UTC microseconds. This
18781878occurs when calling ` createDataFrame ` with a Pandas DataFrame or when returning a timestamp from a
0 commit comments