[SPARK-20431][SQL] Specify a schema by using a DDL-formatted string #17719

maropu · 2017-04-21T14:21:41Z

What changes were proposed in this pull request?

This pr supported a DDL-formatted string in DataFrameReader.schema.
This fix could make users easily define a schema without importing o.a.spark.sql.types._.

How was this patch tested?

Added tests in DataFrameReaderWriterSuite.

SparkQA · 2017-04-21T14:24:40Z

Test build #76034 has finished for PR 17719 at commit 8447a6d.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-21T16:59:32Z

Test build #76037 has finished for PR 17719 at commit a1a2e35.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-04-22T00:10:35Z

cc: @gatorsmile

gatorsmile · 2017-04-22T07:17:02Z

python/pyspark/sql/readwriter.py

+        elif isinstance(schema, basestring):
+            self._jreader = self._jreader.schema(schema)
+        else:
+            raise TypeError("schema should be StructType")


Update this message?

yea, I'll do

gatorsmile · 2017-04-22T07:18:49Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

+   *
+   * @since 2.3.0
+   */
+  def schema(schemaString: String): DataFrameReader = {


This change will make PySpark API inconsistent with the Scala API

Sorry, but I probably missed your point. What's the API consistency you pointed out here?
I just made the python APIs the same with the Scala ones like:

--- python >>> from pyspark.sql.types import * >>> fields = [StructField('a', IntegerType(), True), StructField('b', StringType(), True), StructField('c', DoubleType(), True)] >>> schema = StructType(fields) >>> spark.read.schema(schema).csv("/Users/maropu/Desktop/test.csv").show() +---+----+---+ | a| b| c| +---+----+---+ | 1| aaa|0.3| +---+----+---+ >>> spark.read.schema("a INT, b STRING, c DOUBLE").csv("/Users/maropu/Desktop/test.csv").show() +---+----+---+ | a| b| c| +---+----+---+ | 1| aaa|0.3| +---+----+---+ --- scala scala> import org.apache.spark.sql.types._ scala> fields = StructField("a", IntegerType) :: StructField("b", StringType) :: StructField("c", DoubleType) :: Nil scala> val schema = StructType(fields) scala> spark.read.schema(schema).csv("/Users/maropu/Desktop/test.csv").show +---+----+---+ | a| b| c| +---+----+---+ | 1| aaa|0.3| +---+----+---+ scala> spark.read.schema("a INT, b STRING, c DOUBLE").csv("/Users/maropu/Desktop/test.csv").show +---+----+---+ | a| b| c| +---+----+---+ | 1| aaa|0.3| +---+----+---+

Sorry, I misread the Python codes.

SparkQA · 2017-04-22T11:17:06Z

Test build #76061 has finished for PR 17719 at commit 5fe5e39.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-04-26T23:29:00Z

@gatorsmile ping

maropu · 2017-05-10T07:42:09Z

@gatorsmile ping

gatorsmile · 2017-05-10T16:50:54Z

python/pyspark/sql/readwriter.py

        inference step, and thus speed up data loading.

-        :param schema: a :class:`pyspark.sql.types.StructType` object
+        :param schema: a :class:`pyspark.sql.types.StructType` object or a DDL-formatted string


Could you give an example here to users?

gatorsmile · 2017-05-10T17:09:42Z

In PySpark, we have multiple ways to specify the schema. However, it sounds like we silently pick one of the input schemas. Could you submit a separate PR to fix the potential schema conflicts in PySpark?

Thanks!

maropu · 2017-05-11T04:36:26Z

Sure, I'd love to do tough, I probably missed your point. What's the scenario of the conflict you described? The current logic in readwriter.py just checks types, then decides a schema. Could you give me a concrete example? Thanks

SparkQA · 2017-05-11T06:03:30Z

Test build #76773 has finished for PR 17719 at commit cc3afd7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-11T06:09:58Z

Test build #76774 has finished for PR 17719 at commit 46994fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-05-11T18:05:46Z

nvm. I will fix it later. Thanks!

gatorsmile · 2017-05-11T18:05:56Z

LGTM

gatorsmile · 2017-05-11T18:07:03Z

Thanks! Merging to master.

## What changes were proposed in this pull request? This pr supported a DDL-formatted string in `DataFrameReader.schema`. This fix could make users easily define a schema without importing `o.a.spark.sql.types._`. ## How was this patch tested? Added tests in `DataFrameReaderWriterSuite`. Author: Takeshi Yamamuro <[email protected]> Closes apache#17719 from maropu/SPARK-20431.

Specify a schema by using a DDL-formatted string

8447a6d

Fix syntax errors

a1a2e35

gatorsmile reviewed Apr 22, 2017

View reviewed changes

Apply comments

5fe5e39

gatorsmile reviewed May 10, 2017

View reviewed changes

Add an example

46994fb

maropu force-pushed the SPARK-20431 branch from cc3afd7 to 46994fb Compare May 11, 2017 03:59

asfgit closed this in 04901dd May 11, 2017

[SPARK-20431][SQL] Specify a schema by using a DDL-formatted string #17719

[SPARK-20431][SQL] Specify a schema by using a DDL-formatted string #17719

Uh oh!

Conversation

maropu commented Apr 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 21, 2017

Uh oh!

SparkQA commented Apr 21, 2017

Uh oh!

maropu commented Apr 22, 2017

Uh oh!

gatorsmile Apr 22, 2017

Choose a reason for hiding this comment

Uh oh!

maropu Apr 22, 2017

Choose a reason for hiding this comment

Uh oh!

gatorsmile Apr 22, 2017

Choose a reason for hiding this comment

Uh oh!

maropu Apr 22, 2017

Choose a reason for hiding this comment

Uh oh!

gatorsmile Apr 22, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 22, 2017

Uh oh!

maropu commented Apr 26, 2017

Uh oh!

maropu commented May 10, 2017

Uh oh!

gatorsmile May 10, 2017

Choose a reason for hiding this comment

Uh oh!

maropu May 11, 2017

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented May 10, 2017

Uh oh!

maropu commented May 11, 2017

Uh oh!

SparkQA commented May 11, 2017

Uh oh!

SparkQA commented May 11, 2017

Uh oh!

gatorsmile commented May 11, 2017

Uh oh!

gatorsmile commented May 11, 2017

Uh oh!

gatorsmile commented May 11, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

maropu commented Apr 21, 2017 •

edited

Loading