Skip to content

Commit 268df7e

Browse files
committed
Documentation changes mer @pwendell comments
1 parent 761269b commit 268df7e

File tree

1 file changed

+11
-10
lines changed

1 file changed

+11
-10
lines changed

docs/programming-guide.md

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -359,8 +359,7 @@ Apart from text files, Spark's Java API also supports several other data formats
359359

360360
<div data-lang="python" markdown="1">
361361

362-
PySpark can create distributed datasets from any file system supported by Hadoop, including your local file system, HDFS, KFS, [Amazon S3](http://wiki.apache.org/hadoop/AmazonS3), etc.
363-
The current API is limited to text files, but support for binary Hadoop InputFormats is expected in future versions.
362+
PySpark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, [Amazon S3](http://wiki.apache.org/hadoop/AmazonS3), etc. Spark supports text files, [SequenceFiles](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html), and any other Hadoop [InputFormat](http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/InputFormat.html).
364363

365364
Text file RDDs can be created using `SparkContext`'s `textFile` method. This method takes an URI for the file (either a local path on the machine, or a `hdfs://`, `s3n://`, etc URI) and reads it as a collection of lines. Here is an example invocation:
366365

@@ -383,8 +382,10 @@ Apart from reading files as a collection of lines,
383382

384383
### SequenceFile and Hadoop InputFormats
385384

386-
In addition to reading text files, PySpark supports reading [SequenceFile](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html)
387-
and any arbitrary [InputFormat](http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/InputFormat.html).
385+
In addition to reading text files, PySpark supports reading ```SequenceFile```
386+
and any arbitrary ```InputFormat```.
387+
388+
**Note** this feature is currently marked ```Experimental``` and is intended for advanced users. It may be replaced in future with read/write support based on SparkSQL, in which case SparkSQL is the preferred approach.
388389

389390
#### Writable Support
390391

@@ -409,7 +410,7 @@ PySpark SequenceFile support loads an RDD within Java, and pickles the resulting
409410
#### Loading SequenceFiles
410411

411412
Similarly to text files, SequenceFiles can be loaded by specifying the path. The key and value
412-
classes can be specified, but for standard Writables it should work without requiring this.
413+
classes can be specified, but for standard Writables this is not required.
413414

414415
{% highlight python %}
415416
>>> rdd = sc.sequenceFile("path/to/sequencefile/of/doubles")
@@ -422,7 +423,7 @@ classes can be specified, but for standard Writables it should work without requ
422423
(1.0, u'aa')]
423424
{% endhighlight %}
424425

425-
#### Loading Arbitrary Hadoop InputFormats
426+
#### Loading Other Hadoop InputFormats
426427

427428
PySpark can also read any Hadoop InputFormat, for both 'new' and 'old' Hadoop APIs. If required,
428429
a Hadoop configuration can be passed in as a Python dict. Here is an example using the
@@ -444,19 +445,19 @@ Note that, if the InputFormat simply depends on a Hadoop configuration and/or in
444445
the key and value classes can easily be converted according to the above table,
445446
then this approach should work well for such cases.
446447

447-
If you have custom serialized binary data (like pulling data from Cassandra / HBase) or custom
448+
If you have custom serialized binary data (such as loading data from Cassandra / HBase) or custom
448449
classes that don't conform to the JavaBean requirements, then you will first need to
449450
transform that data on the Scala/Java side to something which can be handled by Pyrolite's pickler.
450451
A [Converter](api/scala/index.html#org.apache.spark.api.python.Converter) trait is provided
451452
for this. Simply extend this trait and implement your transformation code in the ```convert```
452-
method. The ensure this class is packaged into your Spark job jar and included on the PySpark
453+
method. Remember to ensure that this class, along with any dependencies required to access your ```InputFormat```, are packaged into your Spark job jar and included on the PySpark
453454
classpath.
454455

455456
See the [Python examples]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python) and
456457
the [Converter examples]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/pythonconverters)
457-
for examples using HBase and Cassandra.
458+
for examples of using HBase and Cassandra ```InputFormat```.
458459

459-
Future support for writing data out as SequenceFileOutputFormat and other OutputFormats,
460+
Future support for writing data out as ```SequenceFileOutputFormat``` and other ```OutputFormats```,
460461
is forthcoming.
461462

462463
</div>

0 commit comments

Comments
 (0)