Skip to content

Conversation

@BryanCutler
Copy link
Member

@BryanCutler BryanCutler commented Oct 9, 2017

What changes were proposed in this pull request?

This change uses Arrow to optimize the creation of a Spark DataFrame from a Pandas DataFrame. The input df is sliced according to the default parallelism. The optimization is enabled with the existing conf "spark.sql.execution.arrow.enabled" and is disabled by default.

How was this patch tested?

Added new unit test to create DataFrame with and without the optimization enabled, then compare results.

@SparkQA
Copy link

SparkQA commented Oct 9, 2017

Test build #82559 has finished for PR 19459 at commit e9c6de7.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

BryanCutler commented Oct 9, 2017

Benchmarks for running in local mode 16 GB memory, i7-4800MQ CPU @ 2.70GHz × 8 cores
using default Spark configuration
data is 10 columns of doubles with 100,000 rows

Code:

import pandas as pd
import numpy as np
spark.conf.set("spark.sql.execution.arrow.enabled", "false")
pdf = pd.DataFrame(np.random.rand(100000, 10), columns=list("abcdefghij"))
%timeit spark.createDataFrame(pdf)
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
%timeit spark.createDataFrame(pdf)

Without Arrow:
1 loop, best of 3: 7.21 s per loop

With Arrow:
10 loops, best of 3: 30.6 ms per loop

Speedup of ~ 235x

Also, tested creating up to 2 million rows with Arrow and results scale linearly

@SparkQA
Copy link

SparkQA commented Oct 9, 2017

Test build #82560 has finished for PR 19459 at commit 06b033f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

pdf = self.createPandasDataFrameFromeData()
self.spark.conf.set("spark.sql.execution.arrow.enable", "false")
df_no_arrow = self.spark.createDataFrame(pdf)
self.spark.conf.set("spark.sql.execution.arrow.enable", "true")
Copy link
Member

@HyukjinKwon HyukjinKwon Oct 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd set this to true in finally just in case the test failed in df_no_arrow = self.spark.createDataFrame(pdf) and spark.sql.execution.arrow.enable remains false affecting other test cases in the future if I didn't miss something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. I guess this would make the failure easier to see?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I thought the tearDownClass was there but it's actually in #18664. Maybe I should put it in here since that needs some more discussion.

}
}

def toDataFrame(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to make this public to be callable with py4j. Alternatively, something could be added to o.a.s.sql.api.python.PythonSQLUtils?

Copy link
Member

@HyukjinKwon HyukjinKwon Oct 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I think we should put it there, o.a.s.sql.api.python.PythonSQLUtils.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left the conversion logic in ArrowConverters because I think there is a good chance it will change, so just added a wrapper to PythonSQLUtils let me know if it's ok.

@SparkQA
Copy link

SparkQA commented Oct 10, 2017

Test build #82573 has finished for PR 19459 at commit 9d667c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if schema is None:
schema = [str(x) for x in data.columns]
data = [r.tolist() for r in data.to_records(index=False)]
if self.conf.get("spark.sql.execution.arrow.enable", "false").lower() == "true" \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config name was modified to spark.sql.execution.arrow.enabled at d29d1e8 and af8a34c.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thanks, I didn't see that go in. I'll update.

@SparkQA
Copy link

SparkQA commented Oct 10, 2017

Test build #82601 has finished for PR 19459 at commit c7ddee6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

("\n\nWithout:\n%s\n%s" % (df_without, df_without.dtypes)))
self.assertTrue(df_without.equals(df_with_arrow), msg=msg)

def createPandasDataFrameFromeData(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo createPandasDataFrameFromeData -> createPandasDataFrameFromData

os.unlink(tempFile.name)

# Create the Spark DataFrame, there will be at least 1 batch
schema = from_arrow_schema(batches[0].schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a user specify the schema?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We can pass the schema, if provided, into to_pandas for pyarrow to use when creating the RecordBatch.

data = [r.tolist() for r in data.to_records(index=False)]
if self.conf.get("spark.sql.execution.arrow.enabled", "false").lower() == "true" \
and len(data) > 0:
from pyspark.serializers import ArrowSerializer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should split this block to a method like _createFromPandasDataFrame as the same as the other create methods?

Copy link
Member Author

@BryanCutler BryanCutler Oct 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably a good idea since it's a big block of code. The other create methods return a (rdd, schema) pair, then do further processing to create a DataFrame. Here we would have to just return a DataFrame since we don't want to do the further processing.

Copy link
Member Author

@BryanCutler BryanCutler Oct 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, without Arrow it creates a (rdd, schema) pair like the others, so having with Arrow and without in _createFromPandasDataFrame doesn't fit because they return different things. How about just putting the new arrow code in a method like _createFromArrowPandas?

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me except for those comments.


# Slice the DataFrame into batches
split = -(-len(data) // self.sparkContext.defaultParallelism) # round int up
slices = (data[i:i + split] for i in xrange(0, len(data), split))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about split -> size (or length) and i -> offset (or start)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, how about size, start and step? That looks like the terminology used in parallelize

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, sounds fine.


# write batches to temp file, read by JVM (borrowed from context.parallelize)
import os
from tempfile import NamedTemporaryFile
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put those imports above with import pyarrow as pa or top of this file ...

jdf = self._jvm.PythonSQLUtils.arrowPayloadToDataFrame(
jrdd, schema.json(), self._wrapped._jsqlContext)
df = DataFrame(jdf, self._wrapped)
df._schema = schema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leave some comments here about why _schema should be manually set rather than using it's own schema. Actually, mind if I ask why should we set this manually?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the schema is not set here, then it will lazily create it through a py4j exchange with the java DataFrame. Since we already have it here, we can just set it and save some time. I don't like manually setting it like this though, it should be an optional arg in the DataFrame constructor. I'll make that change, but if you prefer not to do that I can revert.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, okay, that's fine to me.

@BryanCutler
Copy link
Member Author

Thanks for the reviews @ueshin and @HyukjinKwon! I added to_arrow_schema conversion for when a schema is passed into createDataFrame and added some new tests to verify it. Please take another look when you can, thanks!


def _createFromPandasWithArrow(self, df, schema):
"""
Create a DataFrame from a given pandas.DataFrame by slicing the into partitions, converting
Copy link
Member

@viirya viirya Oct 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: slicing the ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! fixed

for df_slice in df_slices]

# write batches to temp file, read by JVM (borrowed from context.parallelize)
tempFile = NamedTemporaryFile(delete=False, dir=self._sc._temp_dir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks kind of duplicate with the main logic of context.parallelize. Maybe we can extract a common function from it.

Copy link
Member Author

@BryanCutler BryanCutler Oct 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is - I didn't want to mess around with the parallelize() logic so I left it alone. If we were to make a common function it would look like this

def _dump_to_tempfile(data, serializer, parallelism):
    tempFile = NamedTemporaryFile(delete=False, dir=self._temp_dir)
    try:
        serializer.dump_stream(c, tempFile)
        tempFile.close()
        readRDDFromFile = self._jvm.PythonRDD.readRDDFromFile
        return readRDDFromFile(self._jsc, tempFile.name, parallelism)
    finally:
        # readRDDFromFile eagerily reads the file so we can delete right after.
        os.unlink(tempFile.name)

and some changes to parallelize to call it

# Make sure we distribute data evenly if it's smaller than self.batchSize
if "__len__" not in dir(c):
    c = list(c)    # Make it a list so we can compute its length
batchSize = max(1, min(len(c) // numSlices, self._batchSize or 1024))
serializer = BatchedSerializer(self._unbatched_serializer, batchSize)
jrdd = _dump_to_tempfile(c, serializer, numSlices)

Let me know if you all think we should change this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer less duplicate. Let's see if others support it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

data = [schema.toInternal(row) for row in data]
return self._sc.parallelize(data), schema

def _createFromPandasWithArrow(self, df, schema):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: df -> pdf.

@SparkQA
Copy link

SparkQA commented Oct 15, 2017

Test build #82764 has finished for PR 19459 at commit f42e351.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Oct 15, 2017

LGTM with few minor comments.

self._sc = sql_ctx and sql_ctx._sc
self.is_cached = False
self._schema = None # initialized lazily
self._schema = schema # initialized lazily if None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanCutler, what do you think about taking this out back? Maybe, I am too much worried but I think we maybe should avoid it to be assigned actually except for few special cases ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can undo it. I don't really like manually assigning the schema after the constructor but it's just done in these 2 special cases..

def arrowPayloadToDataFrame(
payloadRDD: JavaRDD[Array[Byte]],
schemaString: String,
sqlContext: SQLContext): DataFrame = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't believe I found this looks 5 spaces instead of 4.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh man, good catch! I don't know how that happened :\

@cloud-fan
Copy link
Contributor

I think it is a bug, we should fix it first.

BTW I'm fine to upgrade arrow, just make sure we get everything we need at the arrow version we wanna upgrade, then remove all the hacks at Spark side. We should throw exception if users have an old arrow version installed.

@SparkQA
Copy link

SparkQA commented Oct 31, 2017

Test build #83233 has finished for PR 19459 at commit cfb1c3d.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

I made SPARK-22417 for fixing reading from timestamps without arrow

@SparkQA
Copy link

SparkQA commented Nov 7, 2017

Test build #83569 has started for PR 19459 at commit 99ce1e4.

@ueshin
Copy link
Member

ueshin commented Nov 8, 2017

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Nov 8, 2017

Test build #83579 has finished for PR 19459 at commit 99ce1e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.



def _create_batch(series):
def _create_batch(series, copy=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need copy here?
I might miss something but looks like all occurrence of copy=copy in this method are always copied by s.fillna(0) in advance so we don't need to use copy=True.

Copy link
Member Author

@BryanCutler BryanCutler Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I forgot that fillna returns a copy. Do you think it would be worth it to first check for any nulls and only fillna if needed? The mask of nulls is already created so just need to add a function like this in _create_batch and call before series.astype():

def fill_series(s, mask):
    return s.fillna(0) if mask.any() else s

What do you think @ueshin ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess it depends.
With the method, it can reduce the number of copy if s doesn't include null values, but also it might increase the number if s includes null values and copy=True.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we don't want to end up double copying if copy=True. Let me try something and if it ends up making things too complicated then we can remove the copy flag altogether and just rely on fillna(0) to always make a copy - not ideal but will be more simple

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin this ended up having no effect, so I took it out. For the case of Timestamps, the timezone conversions will make a copy regardless. For the case of ints being promoted to floats then that means they will have null values and need to call fillna(0) which makes a copy anyway. So it seems this only makes copies when necessary.

Create an Arrow record batch from the given pandas.Series or list of Series, with optional type.
:param series: A single pandas.Series, list of Series, or list of (series, arrow_type)
:param copy: Option to make a copy of the series before performing any type casts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this description.

@SparkQA
Copy link

SparkQA commented Nov 9, 2017

Test build #83635 has finished for PR 19459 at commit 421d0be.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 9, 2017

Test build #83647 has finished for PR 19459 at commit 0ad736b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

@ueshin @HyukjinKwon does this look ready to merge? cc @cloud-fan

@HyukjinKwon
Copy link
Member

Looks pretty solid. Will take a another look today (KST) and merge this one in few days if there are no more comments and/or other committers are busy to take a look and merge.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

# NOTE: this is not necessary with Arrow >= 0.7
def cast_series(s, t):
if type(t) == pa.TimestampType:
if t is not None and type(t) == pa.TimestampType:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, mind asking why t is not None is added? I thought None is NoneType and won't be pa.TimestampType anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be needed anymore. It came from an error when comparing pyarrow type instances to None.

>>> import pyarrow as pa
>>> type(None) == pa.TimestampType
False
>>> None == pa.date32()
Segmentation fault

So this check is still needed right below when we check for date32(). I can't remember if this was fixed in current versions of pyarrow, but I'll add a note here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verified this is not an issue with pyarrow >= 0.7.1

for pdf_slice in pdf_slices]

# Create the Spark schema from the first Arrow batch (always at least 1 batch after slicing)
if schema is None or isinstance(schema, list):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance(schema, list) -> isinstance(schema, (list, tuple)) maybe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, a test like spark.createDataFrame([[1]], ("v",)) would be great.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this check should include tuples as well for converting from unicode?
https://github.com/apache/spark/pull/19459/files#diff-3b5463566251d5b09fd328738a9e9bc5L579
I'll change that since it's related..

# Create the Spark schema from the first Arrow batch (always at least 1 batch after slicing)
if schema is None or isinstance(schema, list):
schema_from_arrow = from_arrow_schema(batches[0].schema)
names = pdf.columns if schema is None else schema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we maybe just resemble

if schema is None or isinstance(schema, (list, tuple)):
struct = self._inferSchemaFromList(data)
converter = _create_converter(struct)
data = map(converter, data)
if isinstance(schema, (list, tuple)):
for i, name in enumerate(schema):
struct.fields[i].name = name
struct.names[i] = name
schema = struct

just to be more readable in a way?

if schema is None or isinstance(schema, (list, tuple)):
    struct = from_arrow_schema(batches[0].schema)
    if isinstance(schema, (list, tuple)):
        for i, name in enumerate(schema):
            struct.fields[i].name = name
            struct.names[i] = name
    schema = struct

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. We still need the line

schema = [str(x) for x in pdf.columns] if schema is None else schema

for the case when schema is None, because the pdf column names are lost when creating the Arrow RecordBatch from using pandas.Series with _create_batch.
Otherwise, I think this makes it easier to follow too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, actually that case can be pulled out of here and non-Arrow _create_from_pandas, which @ueshin brought up in #19646 . This would simplify quite a bit now, so I'll try that out.


def test_createDataFrame_with_incorrect_schema(self):
pdf = self.create_pandas_data_frame()
wrong_schema = StructType([field for field in reversed(self.schema)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal at all: StructType([field for field in reversed(self.schema)]) -> StructType(list(reversed(st)))

def test_createDataFrame_with_names(self):
pdf = self.create_pandas_data_frame()
df = self.spark.createDataFrame(pdf, schema=list('abcdefg'))
self.assertEquals(df.schema.fieldNames(), list('abcdefg'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said above, let's add a test with a tuple too.

@SparkQA
Copy link

SparkQA commented Nov 10, 2017

Test build #83703 has finished for PR 19459 at commit 6c72e37.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Nov 13, 2017

Test build #83761 has finished for PR 19459 at commit 6c72e37.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


# Create the Spark schema from the first Arrow batch (always at least 1 batch after slicing)
if isinstance(schema, (list, tuple)):
struct = from_arrow_schema(batches[0].schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanCutler, I think here we'd meet the same issue, SPARK-15244 in this code path. Mind opening a followup with a simple test if it is true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in 209b936 Nov 13, 2017
@BryanCutler
Copy link
Member Author

Thanks @HyukjinKwon @ueshin and @viirya !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants