Skip to content

Conversation

@zhichao-li
Copy link
Contributor

Currently we are using LazySimpleSerDe to serialize the script input by default. but it would use '\001' as the field delimeter not the same as hive.

@SparkQA
Copy link

SparkQA commented Aug 27, 2015

Test build #41667 has finished for PR 8476 at commit cafe301.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems never used.

@zhichao-li
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Aug 31, 2015

Test build #41811 has finished for PR 8476 at commit 4939ed8.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 31, 2015

Test build #41813 has finished for PR 8476 at commit 517f31f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 31, 2015

Test build #41815 has finished for PR 8476 at commit 23aaa04.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jameszhouyi
Copy link

Apply the spark master code(commit 8d2ab75) with this PR patch, the previous broken cases can be passed now..

@zhichao-li
Copy link
Contributor Author

cc @yhuai @liancheng would you mind taking look at this ?

@jameszhouyi
Copy link

This is real-world case using Spark SQL and hopefully it can be fixed/merged in Spark 1.5.0.Thanks in advance !

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this test case should be ignored? The involved SQL query doesn't contain a RECORDREADER clause, and should fall back to TextRecordReader, shouldn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within TextRecordReader it would try to covert writable to be Text which is not suitable for avro.

@liancheng
Copy link
Contributor

@zhichao-li Could you please add a test case that explicit checks for the output format of a transformation query?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we specify line delimiter here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line delimiter is control by RecorderWriter, see TextRecordWriter as an example:

  public void write(Writable row) throws IOException {
    Text text = (Text) row;
    Text escapeText = text;

    if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) {
      escapeText = HiveUtils.escapeText(text);
    }

    out.write(escapeText.getBytes(), 0, escapeText.getLength());
    out.write(Utilities.newLineCode);
  }

@liancheng
Copy link
Contributor

I'm not super familiar with the script transformation feature. If I understand this problem correctly, in prior versions, we doesn't support RECORDREADER or RECORDWRITER clauses and thus always fallback to TextRecordReader and TextRecordWriter. However, we didn't specify line delimiter or field delimiter properly. Is it?

It seems that this PR not only tries to fix the delimiters issue, but also adds support for RECORDREADER and RECORDWRITER clauses, which I think could be moved into a separate PR to simplify this one.

@zhichao-li
Copy link
Contributor Author

Previously there was no TextRecordWriter or TextRecordReader involved, only manually read and use Writable.write() for serialization.
I would remove the RECORDREADER and RECORDWRITER clauses to simplify this PR.

@SparkQA
Copy link

SparkQA commented Sep 7, 2015

Test build #42082 has finished for PR 8476 at commit c6e9134.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 8, 2015

Test build #42109 has finished for PR 8476 at commit 9b3a8de.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 8, 2015

Test build #42110 has finished for PR 8476 at commit 8fb75a6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 8, 2015

Test build #42116 has finished for PR 8476 at commit 2d1e604.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 8, 2015

Test build #42127 has finished for PR 8476 at commit ae0b68e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jameszhouyi
Copy link

I saw the issue marked as 'Target Version 1.5.1' and hopefully it can be merged in 1.5.1...also i have patch this PR to validate it.

@liancheng
Copy link
Contributor

@zhichao-li I just opened PR #8860, which is based on this PR. Updates made there are:

  1. Setting field delimiter and default record reader/writer in HiveQl instead of HiveScriptIOSchema.

    In this way, it would be easier to add user-defined record reader/writer support in the future (reading them from SQL clauses).

  2. Only uses record reader/writer when they are available

    So that we can avoid disabling other input formats like Avro.

Would you mind to take a look at #8860? Thanks!

@zhichao-li
Copy link
Contributor Author

@liancheng thanks for taking care of this.

@zhichao-li zhichao-li closed this Sep 22, 2015
asfgit pushed a commit that referenced this pull request Sep 23, 2015
**Please attribute this PR to `Zhichao Li <zhichao.liintel.com>`.**

This PR is based on PR #8476 authored by zhichao-li. It fixes SPARK-10310 by adding field delimiter SerDe property to the default `LazySimpleSerDe`, and enabling default record reader/writer classes.

Currently, we only support `LazySimpleSerDe`, used together with `TextRecordReader` and `TextRecordWriter`, and don't support customizing record reader/writer using `RECORDREADER`/`RECORDWRITER` clauses. This should be addressed in separate PR(s).

Author: Cheng Lian <[email protected]>

Closes #8860 from liancheng/spark-10310/fix-script-trans-delimiters.
asfgit pushed a commit that referenced this pull request Sep 23, 2015
**Please attribute this PR to `Zhichao Li <zhichao.liintel.com>`.**

This PR is based on PR #8476 authored by zhichao-li. It fixes SPARK-10310 by adding field delimiter SerDe property to the default `LazySimpleSerDe`, and enabling default record reader/writer classes.

Currently, we only support `LazySimpleSerDe`, used together with `TextRecordReader` and `TextRecordWriter`, and don't support customizing record reader/writer using `RECORDREADER`/`RECORDWRITER` clauses. This should be addressed in separate PR(s).

Author: Cheng Lian <[email protected]>

Closes #8860 from liancheng/spark-10310/fix-script-trans-delimiters.

(cherry picked from commit 84f81e0)
Signed-off-by: Yin Huai <[email protected]>
ashangit pushed a commit to ashangit/spark that referenced this pull request Oct 19, 2016
**Please attribute this PR to `Zhichao Li <zhichao.liintel.com>`.**

This PR is based on PR apache#8476 authored by zhichao-li. It fixes SPARK-10310 by adding field delimiter SerDe property to the default `LazySimpleSerDe`, and enabling default record reader/writer classes.

Currently, we only support `LazySimpleSerDe`, used together with `TextRecordReader` and `TextRecordWriter`, and don't support customizing record reader/writer using `RECORDREADER`/`RECORDWRITER` clauses. This should be addressed in separate PR(s).

Author: Cheng Lian <[email protected]>

Closes apache#8860 from liancheng/spark-10310/fix-script-trans-delimiters.

(cherry picked from commit 84f81e0)
Signed-off-by: Yin Huai <[email protected]>
(cherry picked from commit 73d0621)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants