[SPARK-3205] add EscapedTextInputFormat #2118

mengxr · 2014-08-25T19:21:55Z

Text records may contain in-record delimiter or newline characters. In such cases, we can either encode them or escape them. The latter is simpler and used by Redshift's UNLOAD with the ESCAPE option. The problem is that a record will span multiple lines. We need an input format for it.

@marmbrus

marmbrus · 2014-08-25T19:25:23Z

Oh cool, this will be useful for #1351 /cc @falaki

SparkQA · 2014-08-25T19:25:49Z

QA tests have started for PR 2118 at commit ff339a5.

This patch merges cleanly.

mridulm · 2014-08-25T19:40:56Z

This does not need to be in spark core.

Btw, since we allow for any arbitrary InputFormat to be used in spark, users can use any existing hadoop inputformat/outputformat for this purpose.

SparkQA · 2014-08-25T20:12:17Z

QA tests have finished for PR 2118 at commit ff339a5.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class EscapedTextInputFormat extends FileInputFormat[Long, Array[String]]

mengxr · 2014-08-25T20:45:33Z

@mridulm Any reference of existing input format? I searched on Google. The closest I found is https://github.com/msukmanowsky/OmnitureDataFileInputFormat but it is different.

SparkQA · 2014-08-25T21:15:59Z

QA tests have started for PR 2118 at commit f0e3842.

This patch merges cleanly.

mridulm · 2014-08-25T21:17:00Z

@mengxr Other than custom input/output format's i have written; iirc pig and jaql support this and both are opensource and run on top of hadoop, so they have input/output format's for this - though not sure if it is possible to directly import their code (might bring in too many other dependencies, and might be within deep layers of their abstractions).

There are also csv based reader/writers out there which allow us to customize the escape and delimiter characters - might be possible to customize them I suppose - though I have not investigated it in detail.

Even assuming we cant borrow this from an external source verbatim and have to author it ourself, I am not in favor of putting it in core.

SparkQA · 2014-08-25T21:31:32Z

QA tests have started for PR 2118 at commit e35a366.

This patch merges cleanly.

SparkQA · 2014-08-25T22:01:15Z

QA tests have finished for PR 2118 at commit f0e3842.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class EscapedTextInputFormat extends FileInputFormat[Long, Array[String]]
- class KMeansModel (val clusterCenters: Array[Vector]) extends Serializable

SparkQA · 2014-08-25T22:19:32Z

QA tests have finished for PR 2118 at commit e35a366.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- $FWDIR/bin/spark-submit --class org.apache.spark.repl.Main "$
- $FWDIR/bin/spark-submit --class org.apache.spark.repl.Main "$
- class EscapedTextInputFormat extends FileInputFormat[Long, Array[String]]
- class KMeansModel (val clusterCenters: Array[Vector]) extends Serializable
- class BoundedFloat(float):
- class JoinedRow2 extends Row
- class JoinedRow3 extends Row
- class JoinedRow4 extends Row
- class JoinedRow5 extends Row
- class GenericRow(protected[sql] val values: Array[Any]) extends Row
- abstract class MutableValue extends Serializable
- final class MutableInt extends MutableValue
- final class MutableFloat extends MutableValue
- final class MutableBoolean extends MutableValue
- final class MutableDouble extends MutableValue
- final class MutableShort extends MutableValue
- final class MutableLong extends MutableValue
- final class MutableByte extends MutableValue
- final class MutableAny extends MutableValue
- final class SpecificMutableRow(val values: Array[MutableValue]) extends MutableRow
- case class CountDistinct(expressions: Seq[Expression]) extends PartialAggregate
- case class CollectHashSet(expressions: Seq[Expression]) extends AggregateExpression
- case class CollectHashSetFunction(
- case class CombineSetsAndCount(inputSet: Expression) extends AggregateExpression
- case class CombineSetsAndCountFunction(
- case class CountDistinctFunction(
- case class MaxOf(left: Expression, right: Expression) extends Expression
- case class NewSet(elementType: DataType) extends LeafExpression
- case class AddItemToSet(item: Expression, set: Expression) extends Expression
- case class CombineSets(left: Expression, right: Expression) extends BinaryExpression
- case class CountSet(child: Expression) extends UnaryExpression

SparkQA · 2014-08-25T23:15:58Z

QA tests have started for PR 2118 at commit f8d0191.

This patch merges cleanly.

SparkQA · 2014-08-25T23:55:59Z

QA tests have finished for PR 2118 at commit f8d0191.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class EscapedTextInputFormat extends FileInputFormat[Long, Array[String]]

mateiz · 2014-08-30T23:25:40Z

core/src/main/scala/org/apache/spark/input/EscapedTextInputFormat.scala

It might be better to put this in the "io" package in case we also create output formats later. But no strong feelings. I guess the Hadoop2 one is called "input", it's just weird to make a new package just for this.

BTW if you do add a new package you'll have to fix the SBT code that generates Javadocs and Scaladocs to make sure it appears in the right ones.

mengxr · 2014-09-01T01:18:14Z

@mridulm I moved the implementation in https://github.com/mengxr/redshift-input-format and I'm closing this PR for now. If people feel that this input format is very useful, we can put it back to Spark Core later. Thanks @mridulm and @mateiz for the code review!

mridulm · 2014-09-01T01:58:29Z

Since we j might keep needing to add input formats, how about creating
spark-hadoop-io and have core depend on it ? (Also move whole text reader
and other IF's in spark core and elsewhere into this).
This will also mean non spark users can use this maven artifact without
needing to pull spark dependencies (the same reason we can't use pig or
jacl IF's)
On 01-Sep-2014 6:48 am, "Xiangrui Meng" [email protected] wrote:

Closed #2118 #2118.

—
Reply to this email directly or view it on GitHub
#2118 (comment).

mateiz · 2014-09-01T04:32:10Z

I like the idea of a separate Maven artifact for this. IMO we should try to have common formats easily accessible in Spark, but if core depends on spark-hadoop-io, that will solve that problem.

add EscapedTextInputFormat

ff339a5

mengxr force-pushed the redshift-escape branch from f0e3842 to ac0ace8 Compare August 25, 2014 21:24

use LocalSparkContext

e35a366

mengxr force-pushed the redshift-escape branch from ac0ace8 to e35a366 Compare August 25, 2014 21:25

avoid seeking beyond eof

f8d0191

mateiz reviewed Aug 30, 2014
View reviewed changes

mengxr closed this Sep 1, 2014

[SPARK-3205] add EscapedTextInputFormat #2118

[SPARK-3205] add EscapedTextInputFormat #2118

Uh oh!

Conversation

mengxr commented Aug 25, 2014

Uh oh!

marmbrus commented Aug 25, 2014

Uh oh!

SparkQA commented Aug 25, 2014

Uh oh!

mridulm commented Aug 25, 2014

Uh oh!

SparkQA commented Aug 25, 2014

Uh oh!

mengxr commented Aug 25, 2014

Uh oh!

SparkQA commented Aug 25, 2014

Uh oh!

mridulm commented Aug 25, 2014

Uh oh!

SparkQA commented Aug 25, 2014

Uh oh!

SparkQA commented Aug 25, 2014

Uh oh!

SparkQA commented Aug 25, 2014

Uh oh!

SparkQA commented Aug 25, 2014

Uh oh!

SparkQA commented Aug 25, 2014

Uh oh!

mateiz Aug 30, 2014

Choose a reason for hiding this comment

Uh oh!

mateiz Aug 30, 2014

Choose a reason for hiding this comment

Uh oh!

mengxr commented Sep 1, 2014

Uh oh!

mridulm commented Sep 1, 2014

Uh oh!

mateiz commented Sep 1, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants