-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16515][SQL]set default record reader and writer for script transformation #14169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #62203 has finished for PR 14169 at commit
|
|
Test build #62205 has finished for PR 14169 at commit
|
|
This is strange because I can pass the specific test on my local. |
|
I have updated my code and switch to use bash as test case. Hope it will work for Jenkins. |
|
Test build #62237 has finished for PR 14169 at commit
|
|
Hi, Best Regards |
|
LGTM. cc @yhuai @liancheng |
|
Hi Spark guys, Best Regards, |
|
What do you mean that "Since Spark 2.0 has deleted those config keys from hive conf" ? |
|
@rxin In Spark 2.0, those conf values start with "hive.", which have default value in HiveConf, cannot get the default value now. |
|
HiveConf provides default value |
|
Are all script transforms broken? Don't we already have a test case that actually run script transforms? |
|
@rxin Only those script transformation cases which use LazySimpleSerde would be affected. |
|
|
||
| // SPARK-10310: Special cases LazySimpleSerDe | ||
| val recordHandler = if (name == "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe") { | ||
| Try(conf.getConfString(configKey)).toOption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just use getConfString(key: String, defaultValue: String)? That defaultRecordHandler method seems unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value is different for different key, you mean to inline the defaultRecordHandler function?
|
@jameszhouyi / adrian-wang / @chenghao-intel related to this pull request, we want to have a native implementation for ScriptTransform that does not depend on Hive's serdes. Can you let me know what features are missing from the current native implementation that is not lazysimpleserde? What does lazysimpleserde actually support that the built-in implementation does not? |
| } | ||
|
|
||
| val (inFormat, inSerdeClass, inSerdeProps, reader) = | ||
| format(inRowFormat, "hive.script.recordreader") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pass in the default value for the reader/writer? like format(inRowFormat, "hive.script.recordreader", "org.apache.hadoop.hive.ql.exec.TextRecordReader") and format(outRowFormat, "hive.script.recordwriter", "org.apache.hadoop.hive.ql.exec.TextRecordWriter"). Then, in def format, we just use getConfString(key: String, defaultValue: String)
|
Test build #62451 has finished for PR 14169 at commit
|
| // Decode and input/output format. | ||
| type Format = (Seq[(String, String)], Option[String], Seq[(String, String)], Option[String]) | ||
| def format(fmt: RowFormatContext, configKey: String): Format = fmt match { | ||
| def format(fmt: RowFormatContext, configKey: String, configValue: String): Format = fmt match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name should show that this value is default value, right?
…ansformation ## What changes were proposed in this pull request? In ScriptInputOutputSchema, we read default RecordReader and RecordWriter from conf. Since Spark 2.0 has deleted those config keys from hive conf, we have to set default reader/writer class name by ourselves. Otherwise we will get None for LazySimpleSerde, the data written would not be able to read by script. The test case added worked fine with previous version of Spark, but would fail now. ## How was this patch tested? added a test case in SQLQuerySuite. Closes #14169 Author: Daoyuan Wang <[email protected]> Author: Yin Huai <[email protected]> Closes #14249 from yhuai/scriptTransformation. (cherry picked from commit 96e9afa) Signed-off-by: Yin Huai <[email protected]>
What changes were proposed in this pull request?
In
ScriptInputOutputSchema, we read defaultRecordReaderandRecordWriterfrom conf. Since Spark 2.0 has deleted those config keys from hive conf, we have to set default reader/writer class name by ourselves. Otherwise we will getNoneforLazySimpleSerde, the data written would not be able to read by script. The test case added worked fine with previous version of Spark, but would fail now.How was this patch tested?
added a test case in SQLQuerySuite.