-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20431][SQL] Specify a schema by using a DDL-formatted string #17719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #76034 has finished for PR 17719 at commit
|
|
Test build #76037 has finished for PR 17719 at commit
|
|
cc: @gatorsmile |
python/pyspark/sql/readwriter.py
Outdated
| elif isinstance(schema, basestring): | ||
| self._jreader = self._jreader.schema(schema) | ||
| else: | ||
| raise TypeError("schema should be StructType") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update this message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, I'll do
| * | ||
| * @since 2.3.0 | ||
| */ | ||
| def schema(schemaString: String): DataFrameReader = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change will make PySpark API inconsistent with the Scala API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I probably missed your point. What's the API consistency you pointed out here?
I just made the python APIs the same with the Scala ones like:
--- python
>>> from pyspark.sql.types import *
>>> fields = [StructField('a', IntegerType(), True), StructField('b', StringType(), True), StructField('c', DoubleType(), True)]
>>> schema = StructType(fields)
>>> spark.read.schema(schema).csv("/Users/maropu/Desktop/test.csv").show()
+---+----+---+
| a| b| c|
+---+----+---+
| 1| aaa|0.3|
+---+----+---+
>>> spark.read.schema("a INT, b STRING, c DOUBLE").csv("/Users/maropu/Desktop/test.csv").show()
+---+----+---+
| a| b| c|
+---+----+---+
| 1| aaa|0.3|
+---+----+---+
--- scala
scala> import org.apache.spark.sql.types._
scala> fields = StructField("a", IntegerType) :: StructField("b", StringType) :: StructField("c", DoubleType) :: Nil
scala> val schema = StructType(fields)
scala> spark.read.schema(schema).csv("/Users/maropu/Desktop/test.csv").show
+---+----+---+
| a| b| c|
+---+----+---+
| 1| aaa|0.3|
+---+----+---+
scala> spark.read.schema("a INT, b STRING, c DOUBLE").csv("/Users/maropu/Desktop/test.csv").show
+---+----+---+
| a| b| c|
+---+----+---+
| 1| aaa|0.3|
+---+----+---+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I misread the Python codes.
|
Test build #76061 has finished for PR 17719 at commit
|
|
@gatorsmile ping |
1 similar comment
|
@gatorsmile ping |
| inference step, and thus speed up data loading. | ||
| :param schema: a :class:`pyspark.sql.types.StructType` object | ||
| :param schema: a :class:`pyspark.sql.types.StructType` object or a DDL-formatted string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give an example here to users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
In PySpark, we have multiple ways to specify the schema. However, it sounds like we silently pick one of the input schemas. Could you submit a separate PR to fix the potential schema conflicts in PySpark? Thanks! |
|
Sure, I'd love to do tough, I probably missed your point. What's the scenario of the conflict you described? The current logic in |
|
Test build #76773 has finished for PR 17719 at commit
|
|
Test build #76774 has finished for PR 17719 at commit
|
|
nvm. I will fix it later. Thanks! |
|
LGTM |
|
Thanks! Merging to master. |
## What changes were proposed in this pull request? This pr supported a DDL-formatted string in `DataFrameReader.schema`. This fix could make users easily define a schema without importing `o.a.spark.sql.types._`. ## How was this patch tested? Added tests in `DataFrameReaderWriterSuite`. Author: Takeshi Yamamuro <[email protected]> Closes apache#17719 from maropu/SPARK-20431.
## What changes were proposed in this pull request? This pr supported a DDL-formatted string in `DataFrameReader.schema`. This fix could make users easily define a schema without importing `o.a.spark.sql.types._`. ## How was this patch tested? Added tests in `DataFrameReaderWriterSuite`. Author: Takeshi Yamamuro <[email protected]> Closes apache#17719 from maropu/SPARK-20431.
What changes were proposed in this pull request?
This pr supported a DDL-formatted string in
DataFrameReader.schema.This fix could make users easily define a schema without importing
o.a.spark.sql.types._.How was this patch tested?
Added tests in
DataFrameReaderWriterSuite.