Skip to content

Conversation

@rayortigas
Copy link

I'm still in RDD-land, so I'd like something like this to avoid writing things like

val rdd = sqlContext.csvFile(path, useHeader = false).map { row =>
  Foo(row.getString(0).toInt, row.getString(1).toInt, row.getString(2).toDouble)
}

So instead we can write

val rdd = sqlContext.csvFileToRDD[Foo](path, useHeader = false)

I tried to be minimally invasive here by building on top of csvFile. With more refactoring, I probably would've teased out some stuff in CsvRelation, but I hope this PR is useful in its present form.

Regards,
Ray

Squashed commit of the following:

commit e75167f
Author: Ray Ortigas <[email protected]>
Date:   Sat Apr 18 15:39:30 2015 -0700

    Test for rejection of case classes with non-primitive fields.

commit c4a1de0
Author: Ray Ortigas <[email protected]>
Date:   Sat Apr 18 11:54:53 2015 -0700

    Don't inherit from csv.CsvContext.

commit 674672d
Author: Ray Ortigas <[email protected]>
Date:   Fri Apr 17 19:37:52 2015 -0700

    Add TSV support.

commit e93ec4c
Author: Ray Ortigas <[email protected]>
Date:   Fri Apr 17 19:22:52 2015 -0700

    Add comment about not handling inner case classes.

commit 1495f51
Author: Ray Ortigas <[email protected]>
Date:   Fri Apr 17 19:22:38 2015 -0700

    Add test for headerless CSV.

commit 6f7fcf3
Author: Ray Ortigas <[email protected]>
Date:   Fri Apr 17 19:12:19 2015 -0700

    Add test for permissive mode (which is invalid).

commit ccbb6ba
Author: Ray Ortigas <[email protected]>
Date:   Fri Apr 17 19:10:54 2015 -0700

    Add test for fail-fast mode.

commit fb0f50d
Author: Ray Ortigas <[email protected]>
Date:   Fri Apr 17 19:04:33 2015 -0700

    Add test.

commit 51a9868
Author: Ray Ortigas <[email protected]>
Date:   Fri Apr 17 17:21:13 2015 -0700

    Move RDD-related methods to own package.

commit f5a2c2c
Author: Ray Ortigas <[email protected]>
Date:   Fri Apr 17 16:31:10 2015 -0700

    Use TypeTag and ClassTag instead of manifest.

commit ffed4fc
Author: Ray Ortigas <[email protected]>
Date:   Fri Apr 17 15:41:32 2015 -0700

    Express csvFileToRDD() in terms of csvFile().

commit b52f582
Author: Ray Ortigas <[email protected]>
Date:   Fri Apr 17 15:38:43 2015 -0700

    First cut at typed RDD.
@rxin
Copy link
Contributor

rxin commented Apr 19, 2015

@rayortigas this seems like something that can easily live outside of the CSV package. There isn't anything specific to CSV about this one.

As a matter of fact it probably deserves to either be part of the DataFrame API, or just an implicit conversion on DataFrame to add the following:

// or called toTyped, or typedRDD
def toTypedRDD[T : scala.reflect.runtime.universe.TypeTag : scala.reflect.ClassTag]: RDD[T] = {
   ...
}

@rayortigas
Copy link
Author

@rxin I'd love for DataFrames to support it directly... I picked CSV first because the conversion was more straightforward (just a row of primitives). :D

Maybe I'll put together a PR for spark proper that handles more complex objects? I see what ScalaReflection is doing (and I think I saw the latest refactoring), so I'll take a cue from that.

@rayortigas
Copy link
Author

OK, I opened apache/spark#5713. Thanks for the suggestion @rxin!

@rayortigas rayortigas closed this Apr 27, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants