-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-2010] [PySpark] [SQL] support nested structure in SchemaRDD #1598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
QA tests have started for PR 1598. This patch merges cleanly. |
|
Can you add [SQL] to these PRs as well? |
|
QA results for PR 1598: |
|
With this PR, what does a |
|
A StructType is presented as an namedtuple in Python, which is called Row. The Row is generated according schema, there is no predefined Row class, so it's better to keep it internal. |
Conflicts: python/pyspark/sql.py sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala
|
QA tests have started for PR 1598. This patch merges cleanly. |
|
QA tests have started for PR 1598. This patch merges cleanly. |
|
QA results for PR 1598: |
|
QA results for PR 1598: |
Conflicts: python/pyspark/sql.py
|
QA tests have started for PR 1598. This patch merges cleanly. |
python/pyspark/sql.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Breaking the doc comment like this is kind of weird; could you instead do a for r in srdd2.collect(): print r and get one per line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea!
|
QA results for PR 1598: |
python/pyspark/sql.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should expand this comment a bit, e.g. "Convert each object in the RDD to a Row with the right class for this SchemaRDD, so that fields can be accessed as attributes." Also this needs to appear in some kind of class comment at the top, e.g. say "This class receives raw tuples from Java but assigns a class to it in all its data-collection methods (mapPartitionsWithIndex, collect, take, etc) so that PySpark sees them as Row objects with named fields".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx
|
Made some comments on it from the Python side. @JoshRosen you may also want to take a look at the named tuple / class generation stuff here. |
python/pyspark/sql.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change it to The data type representing rows.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private[spark]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole PythonRDD is private, so does it still need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I did not realize that. It could still perhaps be marked protected (to prevent other spark users from depending on it directly), but thats not as big of a deal.
python/pyspark/sql.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to also add a SQL test here to make sure that types are matching up with those expected in the execution engine. (though we might change the names to avoid conflict with reserved words, as we have not implemented identifier escaping). In particular the complex nested ones like dict and list. Also it would be good to add a nested Row to the input types.
Something like:
srdd.registerAsTable("pythonData")
sqlCtx.sql("SELECT dict['a'], list[0], nested.nestedField").collect() ...|
This is looking really good to me! I'm very excited to have much more complete support for SQL in pyspark. A few minor comments on docs and testing, but I think we can merge this soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we update the part of case (c: java.util.Map[_, _], struct: StructType) as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will case (c: java.util.Map[_, _], struct: StructType) happen with your change? How do we handle inner structs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Row() in python will be convert into tuple(), so It's fine to remove this case.
|
QA tests have started for PR 1598. This patch merges cleanly. |
|
QA tests have started for PR 1598. This patch merges cleanly. |
|
QA results for PR 1598: |
|
QA results for PR 1598: |
|
Thanks for working on this! I've merge it to master. |
Convert Row in JavaSchemaRDD into Array[Any] and unpickle them as tuple in Python, then convert them into namedtuple, so use can access fields just like attributes.
This will let nested structure can be accessed as object, also it will reduce the size of serialized data and better performance.
root
|-- field1: integer (nullable = true)
|-- field2: string (nullable = true)
|-- field3: struct (nullable = true)
| |-- field4: integer (nullable = true)
| |-- field5: array (nullable = true)
| | |-- element: integer (containsNull = false)
|-- field6: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- field7: string (nullable = true)
Then we can access them by row.field3.field5[0] or row.field6[5].field7
It also will infer the schema in Python, convert Row/dict/namedtuple/objects into tuple before serialization, then call applySchema in JVM. During inferSchema(), the top level of dict in row will be StructType, but any nested dictionary will be MapType.
You can use pyspark.sql.Row to convert unnamed structure into Row object, make the RDD can be inferable. Such as:
ctx.inferSchema(rdd.map(lambda x: Row(a=x[0], b=x[1]))
Or you could use Row to create a class just like namedtuple, for example:
Person = Row("name", "age")
ctx.inferSchema(rdd.map(lambda x: Person(*x)))
Also, you can call applySchema to apply an schema to a RDD of tuple/list and turn it into a SchemaRDD. The `schema` should be StructType, see the API docs for details.
schema = StructType([StructField("name, StringType, True),
StructType("age", IntegerType, True)])
ctx.applySchema(rdd, schema)
PS: In order to use namedtuple to inferSchema, you should make namedtuple picklable.
Author: Davies Liu <[email protected]>
Closes apache#1598 from davies/nested and squashes the following commits:
f1d15b6 [Davies Liu] verify schema with the first few rows
8852aaf [Davies Liu] check type of schema
abe9e6e [Davies Liu] address comments
61b2292 [Davies Liu] add @deprecated to pythonToJavaMap
1e5b801 [Davies Liu] improve cache of classes
51aa135 [Davies Liu] use Row to infer schema
e9c0d5c [Davies Liu] remove string typed schema
353a3f2 [Davies Liu] fix code style
63de8f8 [Davies Liu] fix typo
c79ca67 [Davies Liu] fix serialization of nested data
6b258b5 [Davies Liu] fix pep8
9d8447c [Davies Liu] apply schema provided by string of names
f5df97f [Davies Liu] refactor, address comments
9d9af55 [Davies Liu] use arrry to applySchema and infer schema in Python
84679b3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into nested
0eaaf56 [Davies Liu] fix doc tests
b3559b4 [Davies Liu] use generated Row instead of namedtuple
c4ddc30 [Davies Liu] fix conflict between name of fields and variables
7f6f251 [Davies Liu] address all comments
d69d397 [Davies Liu] refactor
2cc2d45 [Davies Liu] refactor
182fb46 [Davies Liu] refactor
bc6e9e1 [Davies Liu] switch to new Schema API
547bf3e [Davies Liu] Merge branch 'master' into nested
a435b5a [Davies Liu] add docs and code refactor
2c8debc [Davies Liu] Merge branch 'master' into nested
644665a [Davies Liu] use tuple and namedtuple for schemardd
Boson 0.2.5-beta includes the notIn parquet fix: - build: Upgrade Arrow to 25.0.0 (pie/boson#599) - feat: Support ansi mode of `sum` kernel (pie/boson#600) - build: Upgrade Parquet to 1.12.0.15-dev-apple (pie/boson#602) Note this only affect when Boson is enabled in Spark.
Convert Row in JavaSchemaRDD into Array[Any] and unpickle them as tuple in Python, then convert them into namedtuple, so use can access fields just like attributes.
This will let nested structure can be accessed as object, also it will reduce the size of serialized data and better performance.
root
|-- field1: integer (nullable = true)
|-- field2: string (nullable = true)
|-- field3: struct (nullable = true)
| |-- field4: integer (nullable = true)
| |-- field5: array (nullable = true)
| | |-- element: integer (containsNull = false)
|-- field6: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- field7: string (nullable = true)
Then we can access them by row.field3.field5[0] or row.field6[5].field7
It also will infer the schema in Python, convert Row/dict/namedtuple/objects into tuple before serialization, then call applySchema in JVM. During inferSchema(), the top level of dict in row will be StructType, but any nested dictionary will be MapType.
You can use pyspark.sql.Row to convert unnamed structure into Row object, make the RDD can be inferable. Such as:
ctx.inferSchema(rdd.map(lambda x: Row(a=x[0], b=x[1]))
Or you could use Row to create a class just like namedtuple, for example:
Person = Row("name", "age")
ctx.inferSchema(rdd.map(lambda x: Person(*x)))
Also, you can call applySchema to apply an schema to a RDD of tuple/list and turn it into a SchemaRDD. The
schemashould be StructType, see the API docs for details.schema = StructType([StructField("name, StringType, True),
StructType("age", IntegerType, True)])
ctx.applySchema(rdd, schema)
PS: In order to use namedtuple to inferSchema, you should make namedtuple picklable.