[SPARK-10310][SQL]Using \t as the field delimeter and \n as the line delimeter #8476

zhichao-li · 2015-08-27T03:20:12Z

Currently we are using LazySimpleSerDe to serialize the script input by default. but it would use '\001' as the field delimeter not the same as hive.

SparkQA · 2015-08-27T05:29:51Z

Test build #41667 has finished for PR 8476 at commit cafe301.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

chenghao-intel · 2015-08-27T05:45:20Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformation.scala

Seems never used.

zhichao-li · 2015-08-31T01:07:16Z

retest this please.

SparkQA · 2015-08-31T01:21:51Z

Test build #41811 has finished for PR 8476 at commit 4939ed8.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-31T03:48:15Z

Test build #41813 has finished for PR 8476 at commit 517f31f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-31T05:42:25Z

Test build #41815 has finished for PR 8476 at commit 23aaa04.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jameszhouyi · 2015-08-31T12:53:52Z

Apply the spark master code(commit 8d2ab75) with this PR patch, the previous broken cases can be passed now..

zhichao-li · 2015-09-02T02:08:49Z

cc @yhuai @liancheng would you mind taking look at this ?

jameszhouyi · 2015-09-02T06:47:03Z

This is real-world case using Spark SQL and hopefully it can be fixed/merged in Spark 1.5.0.Thanks in advance !

liancheng · 2015-09-04T12:10:39Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala

Why this test case should be ignored? The involved SQL query doesn't contain a RECORDREADER clause, and should fall back to TextRecordReader, shouldn't it?

Within TextRecordReader it would try to covert writable to be Text which is not suitable for avro.

liancheng · 2015-09-04T12:32:29Z

@zhichao-li Could you please add a test case that explicit checks for the output format of a transformation query?

liancheng · 2015-09-04T12:40:53Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformation.scala

Shouldn't we specify line delimiter here?

line delimiter is control by RecorderWriter, see TextRecordWriter as an example:

public void write(Writable row) throws IOException { Text text = (Text) row; Text escapeText = text; if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) { escapeText = HiveUtils.escapeText(text); } out.write(escapeText.getBytes(), 0, escapeText.getLength()); out.write(Utilities.newLineCode); }

liancheng · 2015-09-04T12:45:38Z

I'm not super familiar with the script transformation feature. If I understand this problem correctly, in prior versions, we doesn't support RECORDREADER or RECORDWRITER clauses and thus always fallback to TextRecordReader and TextRecordWriter. However, we didn't specify line delimiter or field delimiter properly. Is it?

It seems that this PR not only tries to fix the delimiters issue, but also adds support for RECORDREADER and RECORDWRITER clauses, which I think could be moved into a separate PR to simplify this one.

zhichao-li · 2015-09-07T07:27:22Z

Previously there was no TextRecordWriter or TextRecordReader involved, only manually read and use Writable.write() for serialization.
I would remove the RECORDREADER and RECORDWRITER clauses to simplify this PR.

SparkQA · 2015-09-07T09:31:33Z

Test build #42082 has finished for PR 8476 at commit c6e9134.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-08T01:16:09Z

Test build #42109 has finished for PR 8476 at commit 9b3a8de.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-08T03:12:47Z

Test build #42110 has finished for PR 8476 at commit 8fb75a6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-08T05:23:15Z

Test build #42116 has finished for PR 8476 at commit 2d1e604.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-08T10:17:47Z

Test build #42127 has finished for PR 8476 at commit ae0b68e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jameszhouyi · 2015-09-16T08:51:12Z

I saw the issue marked as 'Target Version 1.5.1' and hopefully it can be merged in 1.5.1...also i have patch this PR to validate it.

liancheng · 2015-09-22T02:05:02Z

@zhichao-li I just opened PR #8860, which is based on this PR. Updates made there are:

Setting field delimiter and default record reader/writer in HiveQl instead of HiveScriptIOSchema.

In this way, it would be easier to add user-defined record reader/writer support in the future (reading them from SQL clauses).
Only uses record reader/writer when they are available

So that we can avoid disabling other input formats like Avro.

Would you mind to take a look at #8860? Thanks!

zhichao-li · 2015-09-22T04:20:39Z

@liancheng thanks for taking care of this.

**Please attribute this PR to `Zhichao Li <zhichao.liintel.com>`.** This PR is based on PR #8476 authored by zhichao-li. It fixes SPARK-10310 by adding field delimiter SerDe property to the default `LazySimpleSerDe`, and enabling default record reader/writer classes. Currently, we only support `LazySimpleSerDe`, used together with `TextRecordReader` and `TextRecordWriter`, and don't support customizing record reader/writer using `RECORDREADER`/`RECORDWRITER` clauses. This should be addressed in separate PR(s). Author: Cheng Lian <[email protected]> Closes #8860 from liancheng/spark-10310/fix-script-trans-delimiters.

**Please attribute this PR to `Zhichao Li <zhichao.liintel.com>`.** This PR is based on PR #8476 authored by zhichao-li. It fixes SPARK-10310 by adding field delimiter SerDe property to the default `LazySimpleSerDe`, and enabling default record reader/writer classes. Currently, we only support `LazySimpleSerDe`, used together with `TextRecordReader` and `TextRecordWriter`, and don't support customizing record reader/writer using `RECORDREADER`/`RECORDWRITER` clauses. This should be addressed in separate PR(s). Author: Cheng Lian <[email protected]> Closes #8860 from liancheng/spark-10310/fix-script-trans-delimiters. (cherry picked from commit 84f81e0) Signed-off-by: Yin Huai <[email protected]>

**Please attribute this PR to `Zhichao Li <zhichao.liintel.com>`.** This PR is based on PR apache#8476 authored by zhichao-li. It fixes SPARK-10310 by adding field delimiter SerDe property to the default `LazySimpleSerDe`, and enabling default record reader/writer classes. Currently, we only support `LazySimpleSerDe`, used together with `TextRecordReader` and `TextRecordWriter`, and don't support customizing record reader/writer using `RECORDREADER`/`RECORDWRITER` clauses. This should be addressed in separate PR(s). Author: Cheng Lian <[email protected]> Closes apache#8860 from liancheng/spark-10310/fix-script-trans-delimiters. (cherry picked from commit 84f81e0) Signed-off-by: Yin Huai <[email protected]> (cherry picked from commit 73d0621)

tab as the field delimeter and 10 as the line delimeter

cafe301

chenghao-intel reviewed Aug 27, 2015
View reviewed changes

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformation.scala Outdated

Copy link

Contributor

chenghao-intel Aug 27, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems never used.

add some idea from hive

4939ed8

zhichao-li added 2 commits August 31, 2015 09:23

style

517f31f

fix cast

23aaa04

liancheng reviewed Sep 4, 2015
View reviewed changes

add unittest for script input and output format

c6e9134

zhichao-li added 2 commits September 8, 2015 09:00

fix path of testing script

9b3a8de

style

8fb75a6

fix script

2d1e604

python3

ae0b68e

liancheng mentioned this pull request Sep 22, 2015

[SPARK-10310] [SQL] Fixes script transformation field/line delimiters #8860

Closed

zhichao-li closed this Sep 22, 2015

[SPARK-10310][SQL]Using \t as the field delimeter and \n as the line delimeter #8476

[SPARK-10310][SQL]Using \t as the field delimeter and \n as the line delimeter #8476

Uh oh!

Conversation

zhichao-li commented Aug 27, 2015

Uh oh!

SparkQA commented Aug 27, 2015

Uh oh!

chenghao-intel Aug 27, 2015

Choose a reason for hiding this comment

Uh oh!

zhichao-li commented Aug 31, 2015

Uh oh!

SparkQA commented Aug 31, 2015

Uh oh!

SparkQA commented Aug 31, 2015

Uh oh!

SparkQA commented Aug 31, 2015

Uh oh!

jameszhouyi commented Aug 31, 2015

Uh oh!

zhichao-li commented Sep 2, 2015

Uh oh!

jameszhouyi commented Sep 2, 2015

Uh oh!

liancheng Sep 4, 2015

Choose a reason for hiding this comment

Uh oh!

zhichao-li Sep 7, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng commented Sep 4, 2015

Uh oh!

liancheng Sep 4, 2015

Choose a reason for hiding this comment

Uh oh!

zhichao-li Sep 7, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng commented Sep 4, 2015

Uh oh!

zhichao-li commented Sep 7, 2015

Uh oh!

SparkQA commented Sep 7, 2015

Uh oh!

SparkQA commented Sep 8, 2015

Uh oh!

SparkQA commented Sep 8, 2015

Uh oh!

SparkQA commented Sep 8, 2015

Uh oh!

SparkQA commented Sep 8, 2015

Uh oh!

jameszhouyi commented Sep 16, 2015

Uh oh!

liancheng commented Sep 22, 2015

Uh oh!

zhichao-li commented Sep 22, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants