-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-10310][SQL]Using \t as the field delimeter and \n as the line delimeter #8476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #41667 has finished for PR 8476 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems never used.
|
retest this please. |
|
Test build #41811 has finished for PR 8476 at commit
|
|
Test build #41813 has finished for PR 8476 at commit
|
|
Test build #41815 has finished for PR 8476 at commit
|
|
Apply the spark master code(commit 8d2ab75) with this PR patch, the previous broken cases can be passed now.. |
|
cc @yhuai @liancheng would you mind taking look at this ? |
|
This is real-world case using Spark SQL and hopefully it can be fixed/merged in Spark 1.5.0.Thanks in advance ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this test case should be ignored? The involved SQL query doesn't contain a RECORDREADER clause, and should fall back to TextRecordReader, shouldn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Within TextRecordReader it would try to covert writable to be Text which is not suitable for avro.
|
@zhichao-li Could you please add a test case that explicit checks for the output format of a transformation query? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we specify line delimiter here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line delimiter is control by RecorderWriter, see TextRecordWriter as an example:
public void write(Writable row) throws IOException {
Text text = (Text) row;
Text escapeText = text;
if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) {
escapeText = HiveUtils.escapeText(text);
}
out.write(escapeText.getBytes(), 0, escapeText.getLength());
out.write(Utilities.newLineCode);
}|
I'm not super familiar with the script transformation feature. If I understand this problem correctly, in prior versions, we doesn't support It seems that this PR not only tries to fix the delimiters issue, but also adds support for |
|
Previously there was no |
|
Test build #42082 has finished for PR 8476 at commit
|
|
Test build #42109 has finished for PR 8476 at commit
|
|
Test build #42110 has finished for PR 8476 at commit
|
|
Test build #42116 has finished for PR 8476 at commit
|
|
Test build #42127 has finished for PR 8476 at commit
|
|
I saw the issue marked as 'Target Version 1.5.1' and hopefully it can be merged in 1.5.1...also i have patch this PR to validate it. |
|
@zhichao-li I just opened PR #8860, which is based on this PR. Updates made there are:
Would you mind to take a look at #8860? Thanks! |
|
@liancheng thanks for taking care of this. |
**Please attribute this PR to `Zhichao Li <zhichao.liintel.com>`.** This PR is based on PR #8476 authored by zhichao-li. It fixes SPARK-10310 by adding field delimiter SerDe property to the default `LazySimpleSerDe`, and enabling default record reader/writer classes. Currently, we only support `LazySimpleSerDe`, used together with `TextRecordReader` and `TextRecordWriter`, and don't support customizing record reader/writer using `RECORDREADER`/`RECORDWRITER` clauses. This should be addressed in separate PR(s). Author: Cheng Lian <[email protected]> Closes #8860 from liancheng/spark-10310/fix-script-trans-delimiters.
**Please attribute this PR to `Zhichao Li <zhichao.liintel.com>`.** This PR is based on PR #8476 authored by zhichao-li. It fixes SPARK-10310 by adding field delimiter SerDe property to the default `LazySimpleSerDe`, and enabling default record reader/writer classes. Currently, we only support `LazySimpleSerDe`, used together with `TextRecordReader` and `TextRecordWriter`, and don't support customizing record reader/writer using `RECORDREADER`/`RECORDWRITER` clauses. This should be addressed in separate PR(s). Author: Cheng Lian <[email protected]> Closes #8860 from liancheng/spark-10310/fix-script-trans-delimiters. (cherry picked from commit 84f81e0) Signed-off-by: Yin Huai <[email protected]>
**Please attribute this PR to `Zhichao Li <zhichao.liintel.com>`.** This PR is based on PR apache#8476 authored by zhichao-li. It fixes SPARK-10310 by adding field delimiter SerDe property to the default `LazySimpleSerDe`, and enabling default record reader/writer classes. Currently, we only support `LazySimpleSerDe`, used together with `TextRecordReader` and `TextRecordWriter`, and don't support customizing record reader/writer using `RECORDREADER`/`RECORDWRITER` clauses. This should be addressed in separate PR(s). Author: Cheng Lian <[email protected]> Closes apache#8860 from liancheng/spark-10310/fix-script-trans-delimiters. (cherry picked from commit 84f81e0) Signed-off-by: Yin Huai <[email protected]> (cherry picked from commit 73d0621)
Currently we are using
LazySimpleSerDeto serialize the script input by default. but it would use '\001' as the field delimeter not the same as hive.