[SPARK-16515][SQL]set default record reader and writer for script transformation #14169

adrian-wang · 2016-07-13T02:33:58Z

What changes were proposed in this pull request?

In ScriptInputOutputSchema, we read default RecordReader and RecordWriter from conf. Since Spark 2.0 has deleted those config keys from hive conf, we have to set default reader/writer class name by ourselves. Otherwise we will get None for LazySimpleSerde, the data written would not be able to read by script. The test case added worked fine with previous version of Spark, but would fail now.

How was this patch tested?

added a test case in SQLQuerySuite.

SparkQA · 2016-07-13T02:39:14Z

Test build #62203 has finished for PR 14169 at commit 671a2ba.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-13T04:11:45Z

Test build #62205 has finished for PR 14169 at commit f493476.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

adrian-wang · 2016-07-13T05:33:57Z

This is strange because I can pass the specific test on my local.

adrian-wang · 2016-07-13T10:29:42Z

I have updated my code and switch to use bash as test case. Hope it will work for Jenkins.

SparkQA · 2016-07-13T12:33:28Z

Test build #62237 has finished for PR 14169 at commit e4c7e02.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jameszhouyi · 2016-07-14T02:35:38Z

Hi,
Cool ! All of my cases relative to transformation script PASSED after applying this PR . Could Spark guys please review this codes to merge this PR ? Thanks a lots !

Best Regards
Yi

chenghao-intel · 2016-07-14T11:48:19Z

LGTM.

cc @yhuai @liancheng
This breaks the existed application which using the default delimiter, and we've already verified in TPCx-BB.

jameszhouyi · 2016-07-15T00:48:31Z

Hi Spark guys,
Could you please help to review this PR to merge it in Spark 2.0.0 ? Thanks in advance !

Best Regards,
Yi

rxin · 2016-07-15T03:13:31Z

What do you mean that "Since Spark 2.0 has deleted those config keys from hive conf" ?

adrian-wang · 2016-07-15T05:04:04Z

@rxin In Spark 2.0, those conf values start with "hive.", which have default value in HiveConf, cannot get the default value now.

chenghao-intel · 2016-07-15T05:04:53Z

HiveConf provides default value org.apache.hadoop.hive.ql.exec.TextRecordReader, org.apache.hadoop.hive.ql.exec.TextRecordWriter for keys hive.script.recordreader and hive.script.recordwriter respectively; however, SQLConf doesn't provides those keys, and it means the default values will be null; this causes the backward-incompatibility;

rxin · 2016-07-15T05:43:09Z

Are all script transforms broken? Don't we already have a test case that actually run script transforms?

adrian-wang · 2016-07-15T06:26:30Z

@rxin Only those script transformation cases which use LazySimpleSerde would be affected.

yhuai · 2016-07-16T01:25:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala


        // SPARK-10310: Special cases LazySimpleSerDe
        val recordHandler = if (name == "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe") {
-          Try(conf.getConfString(configKey)).toOption


Should we just use getConfString(key: String, defaultValue: String)? That defaultRecordHandler method seems unnecessary.

The default value is different for different key, you mean to inline the defaultRecordHandler function?

rxin · 2016-07-16T06:05:08Z

@jameszhouyi / adrian-wang / @chenghao-intel

related to this pull request, we want to have a native implementation for ScriptTransform that does not depend on Hive's serdes. Can you let me know what features are missing from the current native implementation that is not lazysimpleserde? What does lazysimpleserde actually support that the built-in implementation does not?

yhuai · 2016-07-18T06:03:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

+    }
+
    val (inFormat, inSerdeClass, inSerdeProps, reader) =
      format(inRowFormat, "hive.script.recordreader")


Can we pass in the default value for the reader/writer? like format(inRowFormat, "hive.script.recordreader", "org.apache.hadoop.hive.ql.exec.TextRecordReader") and format(outRowFormat, "hive.script.recordwriter", "org.apache.hadoop.hive.ql.exec.TextRecordWriter"). Then, in def format, we just use getConfString(key: String, defaultValue: String)

SparkQA · 2016-07-18T09:40:39Z

Test build #62451 has finished for PR 14169 at commit 0edfed4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-07-18T17:36:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

    // Decode and input/output format.
    type Format = (Seq[(String, String)], Option[String], Seq[(String, String)], Option[String])
-    def format(fmt: RowFormatContext, configKey: String): Format = fmt match {
+    def format(fmt: RowFormatContext, configKey: String, configValue: String): Format = fmt match {


The name should show that this value is default value, right?

…ansformation ## What changes were proposed in this pull request? In ScriptInputOutputSchema, we read default RecordReader and RecordWriter from conf. Since Spark 2.0 has deleted those config keys from hive conf, we have to set default reader/writer class name by ourselves. Otherwise we will get None for LazySimpleSerde, the data written would not be able to read by script. The test case added worked fine with previous version of Spark, but would fail now. ## How was this patch tested? added a test case in SQLQuerySuite. Closes #14169 Author: Daoyuan Wang <[email protected]> Author: Yin Huai <[email protected]> Closes #14249 from yhuai/scriptTransformation. (cherry picked from commit 96e9afa) Signed-off-by: Yin Huai <[email protected]>

adrian-wang changed the title ~~[SPARK-16515][SQL]set default record reader and writer for script transformation~~ [WIP][SPARK-16515][SQL]set default record reader and writer for script transformation Jul 13, 2016

adrian-wang force-pushed the script branch from f493476 to e4c7e02 Compare July 13, 2016 10:28

set default record reader and writer in script transformation

e4c7e02

adrian-wang changed the title ~~[WIP][SPARK-16515][SQL]set default record reader and writer for script transformation~~ [SPARK-16515][SQL]set default record reader and writer for script transformation Jul 14, 2016

yhuai reviewed Jul 16, 2016
View reviewed changes

yhuai reviewed Jul 18, 2016
View reviewed changes

adrian-wang added 3 commits July 18, 2016 15:45

address comment

8048732

Update SparkSqlParser.scala

814f679

Update SparkSqlParser.scala

0edfed4

yhuai reviewed Jul 18, 2016
View reviewed changes

yhuai mentioned this pull request Jul 18, 2016

[SPARK-16515][SQL]set default record reader and writer for script transformation #14249

Closed

asfgit closed this in 96e9afa Jul 18, 2016

[SPARK-16515][SQL]set default record reader and writer for script transformation #14169

[SPARK-16515][SQL]set default record reader and writer for script transformation #14169

Uh oh!

Conversation

adrian-wang commented Jul 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 13, 2016

Uh oh!

SparkQA commented Jul 13, 2016

Uh oh!

adrian-wang commented Jul 13, 2016

Uh oh!

adrian-wang commented Jul 13, 2016

Uh oh!

SparkQA commented Jul 13, 2016

Uh oh!

jameszhouyi commented Jul 14, 2016

Uh oh!

chenghao-intel commented Jul 14, 2016

Uh oh!

jameszhouyi commented Jul 15, 2016

Uh oh!

rxin commented Jul 15, 2016

Uh oh!

adrian-wang commented Jul 15, 2016

Uh oh!

chenghao-intel commented Jul 15, 2016

Uh oh!

rxin commented Jul 15, 2016

Uh oh!

adrian-wang commented Jul 15, 2016

Uh oh!

yhuai Jul 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenghao-intel Jul 17, 2016

Choose a reason for hiding this comment

Uh oh!

rxin commented Jul 16, 2016

Uh oh!

yhuai Jul 18, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 18, 2016

Uh oh!

yhuai Jul 18, 2016

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

adrian-wang commented Jul 13, 2016 •

edited

Loading

yhuai Jul 16, 2016 •

edited

Loading