Skip to content

Conversation

@adrian-wang
Copy link
Contributor

@adrian-wang adrian-wang commented Jul 13, 2016

What changes were proposed in this pull request?

In ScriptInputOutputSchema, we read default RecordReader and RecordWriter from conf. Since Spark 2.0 has deleted those config keys from hive conf, we have to set default reader/writer class name by ourselves. Otherwise we will get None for LazySimpleSerde, the data written would not be able to read by script. The test case added worked fine with previous version of Spark, but would fail now.

How was this patch tested?

added a test case in SQLQuerySuite.

@SparkQA
Copy link

SparkQA commented Jul 13, 2016

Test build #62203 has finished for PR 14169 at commit 671a2ba.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 13, 2016

Test build #62205 has finished for PR 14169 at commit f493476.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@adrian-wang adrian-wang changed the title [SPARK-16515][SQL]set default record reader and writer for script transformation [WIP][SPARK-16515][SQL]set default record reader and writer for script transformation Jul 13, 2016
@adrian-wang
Copy link
Contributor Author

This is strange because I can pass the specific test on my local.

@adrian-wang
Copy link
Contributor Author

I have updated my code and switch to use bash as test case. Hope it will work for Jenkins.

@SparkQA
Copy link

SparkQA commented Jul 13, 2016

Test build #62237 has finished for PR 14169 at commit e4c7e02.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jameszhouyi
Copy link

Hi,
Cool ! All of my cases relative to transformation script PASSED after applying this PR . Could Spark guys please review this codes to merge this PR ? Thanks a lots !

Best Regards
Yi

@adrian-wang adrian-wang changed the title [WIP][SPARK-16515][SQL]set default record reader and writer for script transformation [SPARK-16515][SQL]set default record reader and writer for script transformation Jul 14, 2016
@chenghao-intel
Copy link
Contributor

LGTM.

cc @yhuai @liancheng
This breaks the existed application which using the default delimiter, and we've already verified in TPCx-BB.

@jameszhouyi
Copy link

Hi Spark guys,
Could you please help to review this PR to merge it in Spark 2.0.0 ? Thanks in advance !

Best Regards,
Yi

@rxin
Copy link
Contributor

rxin commented Jul 15, 2016

What do you mean that "Since Spark 2.0 has deleted those config keys from hive conf" ?

@adrian-wang
Copy link
Contributor Author

@rxin In Spark 2.0, those conf values start with "hive.", which have default value in HiveConf, cannot get the default value now.

@chenghao-intel
Copy link
Contributor

HiveConf provides default value org.apache.hadoop.hive.ql.exec.TextRecordReader, org.apache.hadoop.hive.ql.exec.TextRecordWriter for keys hive.script.recordreader and hive.script.recordwriter respectively; however, SQLConf doesn't provides those keys, and it means the default values will be null; this causes the backward-incompatibility;

@rxin
Copy link
Contributor

rxin commented Jul 15, 2016

Are all script transforms broken? Don't we already have a test case that actually run script transforms?

@adrian-wang
Copy link
Contributor Author

@rxin Only those script transformation cases which use LazySimpleSerde would be affected.


// SPARK-10310: Special cases LazySimpleSerDe
val recordHandler = if (name == "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe") {
Try(conf.getConfString(configKey)).toOption
Copy link
Contributor

@yhuai yhuai Jul 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just use getConfString(key: String, defaultValue: String)? That defaultRecordHandler method seems unnecessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value is different for different key, you mean to inline the defaultRecordHandler function?

@rxin
Copy link
Contributor

rxin commented Jul 16, 2016

@jameszhouyi / adrian-wang / @chenghao-intel

related to this pull request, we want to have a native implementation for ScriptTransform that does not depend on Hive's serdes. Can you let me know what features are missing from the current native implementation that is not lazysimpleserde? What does lazysimpleserde actually support that the built-in implementation does not?

}

val (inFormat, inSerdeClass, inSerdeProps, reader) =
format(inRowFormat, "hive.script.recordreader")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pass in the default value for the reader/writer? like format(inRowFormat, "hive.script.recordreader", "org.apache.hadoop.hive.ql.exec.TextRecordReader") and format(outRowFormat, "hive.script.recordwriter", "org.apache.hadoop.hive.ql.exec.TextRecordWriter"). Then, in def format, we just use getConfString(key: String, defaultValue: String)

@SparkQA
Copy link

SparkQA commented Jul 18, 2016

Test build #62451 has finished for PR 14169 at commit 0edfed4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// Decode and input/output format.
type Format = (Seq[(String, String)], Option[String], Seq[(String, String)], Option[String])
def format(fmt: RowFormatContext, configKey: String): Format = fmt match {
def format(fmt: RowFormatContext, configKey: String, configValue: String): Format = fmt match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name should show that this value is default value, right?

asfgit pushed a commit that referenced this pull request Jul 18, 2016
…ansformation

## What changes were proposed in this pull request?
In ScriptInputOutputSchema, we read default RecordReader and RecordWriter from conf. Since Spark 2.0 has deleted those config keys from hive conf, we have to set default reader/writer class name by ourselves. Otherwise we will get None for LazySimpleSerde, the data written would not be able to read by script. The test case added worked fine with previous version of Spark, but would fail now.

## How was this patch tested?
added a test case in SQLQuerySuite.

Closes #14169

Author: Daoyuan Wang <[email protected]>
Author: Yin Huai <[email protected]>

Closes #14249 from yhuai/scriptTransformation.

(cherry picked from commit 96e9afa)
Signed-off-by: Yin Huai <[email protected]>
@asfgit asfgit closed this in 96e9afa Jul 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants