[SPARK-20791][PYSPARK] Use Arrow to create Spark DataFrame from Pandas #19459

BryanCutler · 2017-10-09T19:34:48Z

What changes were proposed in this pull request?

This change uses Arrow to optimize the creation of a Spark DataFrame from a Pandas DataFrame. The input df is sliced according to the default parallelism. The optimization is enabled with the existing conf "spark.sql.execution.arrow.enabled" and is disabled by default.

How was this patch tested?

Added new unit test to create DataFrame with and without the optimization enabled, then compare results.

SparkQA · 2017-10-09T19:39:47Z

Test build #82559 has finished for PR 19459 at commit e9c6de7.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-10-09T21:05:32Z

Benchmarks for running in local mode 16 GB memory, i7-4800MQ CPU @ 2.70GHz × 8 cores
using default Spark configuration
data is 10 columns of doubles with 100,000 rows

Code:

import pandas as pd
import numpy as np
spark.conf.set("spark.sql.execution.arrow.enabled", "false")
pdf = pd.DataFrame(np.random.rand(100000, 10), columns=list("abcdefghij"))
%timeit spark.createDataFrame(pdf)
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
%timeit spark.createDataFrame(pdf)

Without Arrow:
1 loop, best of 3: 7.21 s per loop

With Arrow:
10 loops, best of 3: 30.6 ms per loop

Speedup of ~ 235x

Also, tested creating up to 2 million rows with Arrow and results scale linearly

SparkQA · 2017-10-09T21:46:54Z

Test build #82560 has finished for PR 19459 at commit 06b033f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-10-09T23:25:30Z

python/pyspark/sql/tests.py

+        pdf = self.createPandasDataFrameFromeData()
+        self.spark.conf.set("spark.sql.execution.arrow.enable", "false")
+        df_no_arrow = self.spark.createDataFrame(pdf)
+        self.spark.conf.set("spark.sql.execution.arrow.enable", "true")


I'd set this to true in finally just in case the test failed in df_no_arrow = self.spark.createDataFrame(pdf) and spark.sql.execution.arrow.enable remains false affecting other test cases in the future if I didn't miss something.

done. I guess this would make the failure easier to see?

Hmmm, I thought the tearDownClass was there but it's actually in #18664. Maybe I should put it in here since that needs some more discussion.

BryanCutler · 2017-10-10T00:26:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala

    }
  }
+
+  def toDataFrame(


I had to make this public to be callable with py4j. Alternatively, something could be added to o.a.s.sql.api.python.PythonSQLUtils?

Yup, I think we should put it there, o.a.s.sql.api.python.PythonSQLUtils.

I left the conversion logic in ArrowConverters because I think there is a good chance it will change, so just added a wrapper to PythonSQLUtils let me know if it's ok.

SparkQA · 2017-10-10T03:09:18Z

Test build #82573 has finished for PR 19459 at commit 9d667c6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-10-10T05:40:27Z

python/pyspark/sql/session.py

-            if schema is None:
-                schema = [str(x) for x in data.columns]
-            data = [r.tolist() for r in data.to_records(index=False)]
+            if self.conf.get("spark.sql.execution.arrow.enable", "false").lower() == "true" \


The config name was modified to spark.sql.execution.arrow.enabled at d29d1e8 and af8a34c.

Oh thanks, I didn't see that go in. I'll update.

…rame-from_pandas-SPARK-20791

SparkQA · 2017-10-10T21:37:52Z

Test build #82601 has finished for PR 19459 at commit c7ddee6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-10-11T03:48:14Z

python/pyspark/sql/tests.py

               ("\n\nWithout:\n%s\n%s" % (df_without, df_without.dtypes)))
        self.assertTrue(df_without.equals(df_with_arrow), msg=msg)

+    def createPandasDataFrameFromeData(self):


nit: typo createPandasDataFrameFromeData -> createPandasDataFrameFromData

ueshin · 2017-10-12T04:55:39Z

python/pyspark/sql/session.py

+                    os.unlink(tempFile.name)
+
+                # Create the Spark DataFrame, there will be at least 1 batch
+                schema = from_arrow_schema(batches[0].schema)


What if a user specify the schema?

Good point. We can pass the schema, if provided, into to_pandas for pyarrow to use when creating the RecordBatch.

ueshin · 2017-10-12T04:59:18Z

python/pyspark/sql/session.py

-            data = [r.tolist() for r in data.to_records(index=False)]
+            if self.conf.get("spark.sql.execution.arrow.enabled", "false").lower() == "true" \
+                    and len(data) > 0:
+                from pyspark.serializers import ArrowSerializer


Maybe we should split this block to a method like _createFromPandasDataFrame as the same as the other create methods?

That's probably a good idea since it's a big block of code. The other create methods return a (rdd, schema) pair, then do further processing to create a DataFrame. Here we would have to just return a DataFrame since we don't want to do the further processing.

Oh wait, without Arrow it creates a (rdd, schema) pair like the others, so having with Arrow and without in _createFromPandasDataFrame doesn't fit because they return different things. How about just putting the new arrow code in a method like _createFromArrowPandas?

HyukjinKwon

Looks pretty good to me except for those comments.

HyukjinKwon · 2017-10-12T16:18:45Z

python/pyspark/sql/session.py

+
+                # Slice the DataFrame into batches
+                split = -(-len(data) // self.sparkContext.defaultParallelism)  # round int up
+                slices = (data[i:i + split] for i in xrange(0, len(data), split))


How about split -> size (or length) and i -> offset (or start)?

sure, how about size, start and step? That looks like the terminology used in parallelize

Yea, sounds fine.

HyukjinKwon · 2017-10-12T16:21:58Z

python/pyspark/sql/session.py

+
+                # write batches to temp file, read by JVM (borrowed from context.parallelize)
+                import os
+                from tempfile import NamedTemporaryFile


I'd put those imports above with import pyarrow as pa or top of this file ...

HyukjinKwon · 2017-10-12T16:54:47Z

python/pyspark/sql/session.py

+                jdf = self._jvm.PythonSQLUtils.arrowPayloadToDataFrame(
+                    jrdd, schema.json(), self._wrapped._jsqlContext)
+                df = DataFrame(jdf, self._wrapped)
+                df._schema = schema


I'd leave some comments here about why _schema should be manually set rather than using it's own schema. Actually, mind if I ask why should we set this manually?

If the schema is not set here, then it will lazily create it through a py4j exchange with the java DataFrame. Since we already have it here, we can just set it and save some time. I don't like manually setting it like this though, it should be an optional arg in the DataFrame constructor. I'll make that change, but if you prefer not to do that I can revert.

Ahh, okay, that's fine to me.

…fy schema

…rame-from_pandas-SPARK-20791

BryanCutler · 2017-10-14T21:49:02Z

Thanks for the reviews @ueshin and @HyukjinKwon! I added to_arrow_schema conversion for when a schema is passed into createDataFrame and added some new tests to verify it. Please take another look when you can, thanks!

viirya · 2017-10-15T00:16:12Z

python/pyspark/sql/session.py


+    def _createFromPandasWithArrow(self, df, schema):
+        """
+        Create a DataFrame from a given pandas.DataFrame by slicing the into partitions, converting


typo: slicing the ...

Thanks! fixed

viirya · 2017-10-15T00:27:32Z

python/pyspark/sql/session.py

+                   for df_slice in df_slices]
+
+        # write batches to temp file, read by JVM (borrowed from context.parallelize)
+        tempFile = NamedTemporaryFile(delete=False, dir=self._sc._temp_dir)


This looks kind of duplicate with the main logic of context.parallelize. Maybe we can extract a common function from it.

Yeah, it is - I didn't want to mess around with the parallelize() logic so I left it alone. If we were to make a common function it would look like this

def _dump_to_tempfile(data, serializer, parallelism): tempFile = NamedTemporaryFile(delete=False, dir=self._temp_dir) try: serializer.dump_stream(c, tempFile) tempFile.close() readRDDFromFile = self._jvm.PythonRDD.readRDDFromFile return readRDDFromFile(self._jsc, tempFile.name, parallelism) finally: # readRDDFromFile eagerily reads the file so we can delete right after. os.unlink(tempFile.name)

and some changes to parallelize to call it

# Make sure we distribute data evenly if it's smaller than self.batchSize if "__len__" not in dir(c): c = list(c) # Make it a list so we can compute its length batchSize = max(1, min(len(c) // numSlices, self._batchSize or 1024)) serializer = BatchedSerializer(self._unbatched_serializer, batchSize) jrdd = _dump_to_tempfile(c, serializer, numSlices)

Let me know if you all think we should change this?

I'd prefer less duplicate. Let's see if others support it.

Sounds good to me.

viirya · 2017-10-15T00:32:00Z

python/pyspark/sql/session.py

        data = [schema.toInternal(row) for row in data]
        return self._sc.parallelize(data), schema

+    def _createFromPandasWithArrow(self, df, schema):


nit: df -> pdf.

SparkQA · 2017-10-15T00:32:10Z

Test build #82764 has finished for PR 19459 at commit f42e351.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-10-15T00:35:37Z

LGTM with few minor comments.

HyukjinKwon · 2017-10-16T03:51:04Z

python/pyspark/sql/dataframe.py

        self._sc = sql_ctx and sql_ctx._sc
        self.is_cached = False
-        self._schema = None  # initialized lazily
+        self._schema = schema  # initialized lazily if None


@BryanCutler, what do you think about taking this out back? Maybe, I am too much worried but I think we maybe should avoid it to be assigned actually except for few special cases ...

Sure, I can undo it. I don't really like manually assigning the schema after the constructor but it's just done in these 2 special cases..

HyukjinKwon · 2017-10-16T03:53:58Z

sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala

+  def arrowPayloadToDataFrame(
+       payloadRDD: JavaRDD[Array[Byte]],
+       schemaString: String,
+       sqlContext: SQLContext): DataFrame = {


I can't believe I found this looks 5 spaces instead of 4.

oh man, good catch! I don't know how that happened :\

cloud-fan · 2017-10-30T23:25:02Z

I think it is a bug, we should fix it first.

BTW I'm fine to upgrade arrow, just make sure we get everything we need at the arrow version we wanna upgrade, then remove all the hacks at Spark side. We should throw exception if users have an old arrow version installed.

SparkQA · 2017-10-31T01:23:33Z

Test build #83233 has finished for PR 19459 at commit cfb1c3d.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-11-01T23:14:47Z

I made SPARK-22417 for fixing reading from timestamps without arrow

…rame-from_pandas-SPARK-20791

SparkQA · 2017-11-07T23:02:43Z

Test build #83569 has started for PR 19459 at commit 99ce1e4.

ueshin · 2017-11-08T03:06:21Z

Jenkins, retest this please.

SparkQA · 2017-11-08T06:03:48Z

Test build #83579 has finished for PR 19459 at commit 99ce1e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-11-08T10:17:39Z

python/pyspark/serializers.py



-def _create_batch(series):
+def _create_batch(series, copy=False):


Do we need copy here?
I might miss something but looks like all occurrence of copy=copy in this method are always copied by s.fillna(0) in advance so we don't need to use copy=True.

Right, I forgot that fillna returns a copy. Do you think it would be worth it to first check for any nulls and only fillna if needed? The mask of nulls is already created so just need to add a function like this in _create_batch and call before series.astype():

def fill_series(s, mask): return s.fillna(0) if mask.any() else s

What do you think @ueshin ?

Hmm, I guess it depends.
With the method, it can reduce the number of copy if s doesn't include null values, but also it might increase the number if s includes null values and copy=True.

Yeah, we don't want to end up double copying if copy=True. Let me try something and if it ends up making things too complicated then we can remove the copy flag altogether and just rely on fillna(0) to always make a copy - not ideal but will be more simple

@ueshin this ended up having no effect, so I took it out. For the case of Timestamps, the timezone conversions will make a copy regardless. For the case of ints being promoted to floats then that means they will have null values and need to call fillna(0) which makes a copy anyway. So it seems this only makes copies when necessary.

ueshin · 2017-11-09T08:04:56Z

python/pyspark/serializers.py

+    Create an Arrow record batch from the given pandas.Series or list of Series, with optional type.
+
+    :param series: A single pandas.Series, list of Series, or list of (series, arrow_type)
+    :param copy: Option to make a copy of the series before performing any type casts


We can remove this description.

SparkQA · 2017-11-09T08:05:02Z

Test build #83635 has finished for PR 19459 at commit 421d0be.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-09T20:37:08Z

Test build #83647 has finished for PR 19459 at commit 0ad736b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-11-10T02:24:23Z

@ueshin @HyukjinKwon does this look ready to merge? cc @cloud-fan

HyukjinKwon · 2017-11-10T02:32:43Z

Looks pretty solid. Will take a another look today (KST) and merge this one in few days if there are no more comments and/or other committers are busy to take a look and merge.

HyukjinKwon

LGTM otherwise

HyukjinKwon · 2017-11-10T10:53:30Z

python/pyspark/serializers.py

    # NOTE: this is not necessary with Arrow >= 0.7
    def cast_series(s, t):
-        if type(t) == pa.TimestampType:
+        if t is not None and type(t) == pa.TimestampType:


Hm, mind asking why t is not None is added? I thought None is NoneType and won't be pa.TimestampType anyway.

This doesn't seem to be needed anymore. It came from an error when comparing pyarrow type instances to None.

>>> import pyarrow as pa >>> type(None) == pa.TimestampType False >>> None == pa.date32() Segmentation fault

So this check is still needed right below when we check for date32(). I can't remember if this was fixed in current versions of pyarrow, but I'll add a note here.

verified this is not an issue with pyarrow >= 0.7.1

HyukjinKwon · 2017-11-10T12:55:06Z

python/pyspark/sql/session.py

+                   for pdf_slice in pdf_slices]
+
+        # Create the Spark schema from the first Arrow batch (always at least 1 batch after slicing)
+        if schema is None or isinstance(schema, list):


isinstance(schema, list) -> isinstance(schema, (list, tuple)) maybe?

Maybe, a test like spark.createDataFrame([[1]], ("v",)) would be great.

Looks like this check should include tuples as well for converting from unicode?
https://github.com/apache/spark/pull/19459/files#diff-3b5463566251d5b09fd328738a9e9bc5L579
I'll change that since it's related..

HyukjinKwon · 2017-11-10T12:59:29Z

python/pyspark/sql/session.py

+        # Create the Spark schema from the first Arrow batch (always at least 1 batch after slicing)
+        if schema is None or isinstance(schema, list):
+            schema_from_arrow = from_arrow_schema(batches[0].schema)
+            names = pdf.columns if schema is None else schema


Could we maybe just resemble

spark/python/pyspark/sql/session.py

Lines 403 to 411 in 1d34104

if schema is None or isinstance(schema, (list, tuple)):

struct = self._inferSchemaFromList(data)

converter = _create_converter(struct)

data = map(converter, data)

if isinstance(schema, (list, tuple)):

for i, name in enumerate(schema):

struct.fields[i].name = name

struct.names[i] = name

schema = struct

just to be more readable in a way?

if schema is None or isinstance(schema, (list, tuple)): struct = from_arrow_schema(batches[0].schema) if isinstance(schema, (list, tuple)): for i, name in enumerate(schema): struct.fields[i].name = name struct.names[i] = name schema = struct

Sure. We still need the line

schema = [str(x) for x in pdf.columns] if schema is None else schema

for the case when schema is None, because the pdf column names are lost when creating the Arrow RecordBatch from using pandas.Series with _create_batch.
Otherwise, I think this makes it easier to follow too.

hmm, actually that case can be pulled out of here and non-Arrow _create_from_pandas, which @ueshin brought up in #19646 . This would simplify quite a bit now, so I'll try that out.

HyukjinKwon · 2017-11-10T13:06:11Z

python/pyspark/sql/tests.py

+
+    def test_createDataFrame_with_incorrect_schema(self):
+        pdf = self.create_pandas_data_frame()
+        wrong_schema = StructType([field for field in reversed(self.schema)])


Not a big deal at all: StructType([field for field in reversed(self.schema)]) -> StructType(list(reversed(st)))

HyukjinKwon · 2017-11-10T13:06:39Z

python/pyspark/sql/tests.py

+    def test_createDataFrame_with_names(self):
+        pdf = self.create_pandas_data_frame()
+        df = self.spark.createDataFrame(pdf, schema=list('abcdefg'))
+        self.assertEquals(df.schema.fieldNames(), list('abcdefg'))


As said above, let's add a test with a tuple too.

SparkQA · 2017-11-10T23:03:20Z

Test build #83703 has finished for PR 19459 at commit 6c72e37.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-11-13T01:13:43Z

retest this please

SparkQA · 2017-11-13T03:55:31Z

Test build #83761 has finished for PR 19459 at commit 6c72e37.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-11-13T04:12:54Z

python/pyspark/sql/session.py

+
+        # Create the Spark schema from the first Arrow batch (always at least 1 batch after slicing)
+        if isinstance(schema, (list, tuple)):
+            struct = from_arrow_schema(batches[0].schema)


@BryanCutler, I think here we'd meet the same issue, SPARK-15244 in this code path. Mind opening a followup with a simple test if it is true?

Sure, will do

HyukjinKwon · 2017-11-13T04:16:37Z

Merged to master.

BryanCutler · 2017-11-13T18:06:23Z

Thanks @HyukjinKwon @ueshin and @viirya !

BryanCutler added 4 commits September 29, 2017 15:28

createDataFrame working but with fixed schema in python

cd3d51e

added schema conversion

c73c7c6

add from_arrow_schema, test and cleanup

e9c6de7

fix style

06b033f

HyukjinKwon reviewed Oct 9, 2017

View reviewed changes

fixed xrange for Python 3

9d667c6

BryanCutler commented Oct 10, 2017

View reviewed changes

ueshin reviewed Oct 10, 2017

View reviewed changes

BryanCutler added 3 commits October 10, 2017 10:58

Merge remote-tracking branch 'upstream/master' into arrow-createDataF…

31851f8

…rame-from_pandas-SPARK-20791

moved python jvm call to PythonSQLUtils, added tearDownClass to tests

ca474db

forgot to rename conf

c7ddee6

ueshin reviewed Oct 12, 2017

View reviewed changes

HyukjinKwon reviewed Oct 12, 2017

View reviewed changes

BryanCutler added 4 commits October 13, 2017 09:47

fixed typo

b00a924

using schema if passed in to createDataFrame, added unit test to veri…

e36a176

…fy schema

Merge remote-tracking branch 'upstream/master' into arrow-createDataF…

fc3a554

…rame-from_pandas-SPARK-20791

updated function name to_arrow_type

f42e351

viirya reviewed Oct 15, 2017

View reviewed changes

HyukjinKwon reviewed Oct 16, 2017

View reviewed changes

BryanCutler added 3 commits November 7, 2017 14:20

Merge remote-tracking branch 'upstream/master' into arrow-createDataF…

b362b9a

…rame-from_pandas-SPARK-20791

some minor cleanup of _convert_from_pandas

1c244d1

minor cleanup of _create_from_pandas_with_arrow

99ce1e4

ueshin reviewed Nov 8, 2017

View reviewed changes

BryanCutler added 3 commits November 8, 2017 22:40

avoid double copies of series with nulls

7d9cc3e

added test to make sure input is unchanged

126f2e7

removed copy=True option, did not improve anything

421d0be

ueshin reviewed Nov 9, 2017

View reviewed changes

fix pydoc

0ad736b

HyukjinKwon approved these changes Nov 10, 2017

View reviewed changes

added schema tuple support, refactored creating schema

6c72e37

HyukjinKwon reviewed Nov 13, 2017

View reviewed changes

asfgit closed this in 209b936 Nov 13, 2017



		def _create_batch(series):
		def _create_batch(series, copy=False):

	if schema is None or isinstance(schema, (list, tuple)):
	struct = self._inferSchemaFromList(data)
	converter = _create_converter(struct)
	data = map(converter, data)
	if isinstance(schema, (list, tuple)):
	for i, name in enumerate(schema):
	struct.fields[i].name = name
	struct.names[i] = name
	schema = struct

[SPARK-20791][PYSPARK] Use Arrow to create Spark DataFrame from Pandas #19459

[SPARK-20791][PYSPARK] Use Arrow to create Spark DataFrame from Pandas #19459

Uh oh!

Conversation

BryanCutler commented Oct 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 9, 2017

Uh oh!

BryanCutler commented Oct 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Oct 9, 2017

Uh oh!

HyukjinKwon Oct 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Oct 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 10, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 10, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler Oct 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler Oct 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Oct 14, 2017

Uh oh!

viirya Oct 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

BryanCutler commented Oct 9, 2017 •

edited

Loading

BryanCutler commented Oct 9, 2017 •

edited

Loading

HyukjinKwon Oct 9, 2017 •

edited

Loading

HyukjinKwon Oct 10, 2017 •

edited

Loading

BryanCutler Oct 13, 2017 •

edited

Loading

BryanCutler Oct 13, 2017 •

edited

Loading

viirya Oct 15, 2017 •

edited

Loading

BryanCutler Oct 16, 2017 •

edited

Loading