Skip to content

Conversation

@mengxr
Copy link
Contributor

@mengxr mengxr commented Aug 25, 2014

Text records may contain in-record delimiter or newline characters. In such cases, we can either encode them or escape them. The latter is simpler and used by Redshift's UNLOAD with the ESCAPE option. The problem is that a record will span multiple lines. We need an input format for it.

@marmbrus

@marmbrus
Copy link
Contributor

Oh cool, this will be useful for #1351 /cc @falaki

@SparkQA
Copy link

SparkQA commented Aug 25, 2014

QA tests have started for PR 2118 at commit ff339a5.

  • This patch merges cleanly.

@mridulm
Copy link
Contributor

mridulm commented Aug 25, 2014

This does not need to be in spark core.

Btw, since we allow for any arbitrary InputFormat to be used in spark, users can use any existing hadoop inputformat/outputformat for this purpose.

@SparkQA
Copy link

SparkQA commented Aug 25, 2014

QA tests have finished for PR 2118 at commit ff339a5.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class EscapedTextInputFormat extends FileInputFormat[Long, Array[String]]

@mengxr
Copy link
Contributor Author

mengxr commented Aug 25, 2014

@mridulm Any reference of existing input format? I searched on Google. The closest I found is https://github.com/msukmanowsky/OmnitureDataFileInputFormat but it is different.

@SparkQA
Copy link

SparkQA commented Aug 25, 2014

QA tests have started for PR 2118 at commit f0e3842.

  • This patch merges cleanly.

@mridulm
Copy link
Contributor

mridulm commented Aug 25, 2014

@mengxr Other than custom input/output format's i have written; iirc pig and jaql support this and both are opensource and run on top of hadoop, so they have input/output format's for this - though not sure if it is possible to directly import their code (might bring in too many other dependencies, and might be within deep layers of their abstractions).

There are also csv based reader/writers out there which allow us to customize the escape and delimiter characters - might be possible to customize them I suppose - though I have not investigated it in detail.

Even assuming we cant borrow this from an external source verbatim and have to author it ourself, I am not in favor of putting it in core.

@SparkQA
Copy link

SparkQA commented Aug 25, 2014

QA tests have started for PR 2118 at commit e35a366.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 25, 2014

QA tests have finished for PR 2118 at commit f0e3842.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class EscapedTextInputFormat extends FileInputFormat[Long, Array[String]]
    • class KMeansModel (val clusterCenters: Array[Vector]) extends Serializable

@SparkQA
Copy link

SparkQA commented Aug 25, 2014

QA tests have finished for PR 2118 at commit e35a366.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • $FWDIR/bin/spark-submit --class org.apache.spark.repl.Main "$
    • $FWDIR/bin/spark-submit --class org.apache.spark.repl.Main "$
    • class EscapedTextInputFormat extends FileInputFormat[Long, Array[String]]
    • class KMeansModel (val clusterCenters: Array[Vector]) extends Serializable
    • class BoundedFloat(float):
    • class JoinedRow2 extends Row
    • class JoinedRow3 extends Row
    • class JoinedRow4 extends Row
    • class JoinedRow5 extends Row
    • class GenericRow(protected[sql] val values: Array[Any]) extends Row
    • abstract class MutableValue extends Serializable
    • final class MutableInt extends MutableValue
    • final class MutableFloat extends MutableValue
    • final class MutableBoolean extends MutableValue
    • final class MutableDouble extends MutableValue
    • final class MutableShort extends MutableValue
    • final class MutableLong extends MutableValue
    • final class MutableByte extends MutableValue
    • final class MutableAny extends MutableValue
    • final class SpecificMutableRow(val values: Array[MutableValue]) extends MutableRow
    • case class CountDistinct(expressions: Seq[Expression]) extends PartialAggregate
    • case class CollectHashSet(expressions: Seq[Expression]) extends AggregateExpression
    • case class CollectHashSetFunction(
    • case class CombineSetsAndCount(inputSet: Expression) extends AggregateExpression
    • case class CombineSetsAndCountFunction(
    • case class CountDistinctFunction(
    • case class MaxOf(left: Expression, right: Expression) extends Expression
    • case class NewSet(elementType: DataType) extends LeafExpression
    • case class AddItemToSet(item: Expression, set: Expression) extends Expression
    • case class CombineSets(left: Expression, right: Expression) extends BinaryExpression
    • case class CountSet(child: Expression) extends UnaryExpression

@SparkQA
Copy link

SparkQA commented Aug 25, 2014

QA tests have started for PR 2118 at commit f8d0191.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 25, 2014

QA tests have finished for PR 2118 at commit f8d0191.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class EscapedTextInputFormat extends FileInputFormat[Long, Array[String]]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to put this in the "io" package in case we also create output formats later. But no strong feelings. I guess the Hadoop2 one is called "input", it's just weird to make a new package just for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW if you do add a new package you'll have to fix the SBT code that generates Javadocs and Scaladocs to make sure it appears in the right ones.

@mengxr
Copy link
Contributor Author

mengxr commented Sep 1, 2014

@mridulm I moved the implementation in https://github.com/mengxr/redshift-input-format and I'm closing this PR for now. If people feel that this input format is very useful, we can put it back to Spark Core later. Thanks @mridulm and @mateiz for the code review!

@mengxr mengxr closed this Sep 1, 2014
@mridulm
Copy link
Contributor

mridulm commented Sep 1, 2014

Since we j might keep needing to add input formats, how about creating
spark-hadoop-io and have core depend on it ? (Also move whole text reader
and other IF's in spark core and elsewhere into this).
This will also mean non spark users can use this maven artifact without
needing to pull spark dependencies (the same reason we can't use pig or
jacl IF's)
On 01-Sep-2014 6:48 am, "Xiangrui Meng" [email protected] wrote:

Closed #2118 #2118.


Reply to this email directly or view it on GitHub
#2118 (comment).

@mateiz
Copy link
Contributor

mateiz commented Sep 1, 2014

I like the idea of a separate Maven artifact for this. IMO we should try to have common formats easily accessible in Spark, but if core depends on spark-hadoop-io, that will solve that problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants