You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/programming-guide.md
+11-10Lines changed: 11 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -359,8 +359,7 @@ Apart from text files, Spark's Java API also supports several other data formats
359
359
360
360
<divdata-lang="python"markdown="1">
361
361
362
-
PySpark can create distributed datasets from any file system supported by Hadoop, including your local file system, HDFS, KFS, [Amazon S3](http://wiki.apache.org/hadoop/AmazonS3), etc.
363
-
The current API is limited to text files, but support for binary Hadoop InputFormats is expected in future versions.
362
+
PySpark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, [Amazon S3](http://wiki.apache.org/hadoop/AmazonS3), etc. Spark supports text files, [SequenceFiles](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html), and any other Hadoop [InputFormat](http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/InputFormat.html).
364
363
365
364
Text file RDDs can be created using `SparkContext`'s `textFile` method. This method takes an URI for the file (either a local path on the machine, or a `hdfs://`, `s3n://`, etc URI) and reads it as a collection of lines. Here is an example invocation:
366
365
@@ -383,8 +382,10 @@ Apart from reading files as a collection of lines,
383
382
384
383
### SequenceFile and Hadoop InputFormats
385
384
386
-
In addition to reading text files, PySpark supports reading [SequenceFile](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html)
387
-
and any arbitrary [InputFormat](http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/InputFormat.html).
385
+
In addition to reading text files, PySpark supports reading ```SequenceFile```
386
+
and any arbitrary ```InputFormat```.
387
+
388
+
**Note** this feature is currently marked ```Experimental``` and is intended for advanced users. It may be replaced in future with read/write support based on SparkSQL, in which case SparkSQL is the preferred approach.
388
389
389
390
#### Writable Support
390
391
@@ -409,7 +410,7 @@ PySpark SequenceFile support loads an RDD within Java, and pickles the resulting
409
410
#### Loading SequenceFiles
410
411
411
412
Similarly to text files, SequenceFiles can be loaded by specifying the path. The key and value
412
-
classes can be specified, but for standard Writables it should work without requiring this.
413
+
classes can be specified, but for standard Writables this is not required.
@@ -422,7 +423,7 @@ classes can be specified, but for standard Writables it should work without requ
422
423
(1.0, u'aa')]
423
424
{% endhighlight %}
424
425
425
-
#### Loading Arbitrary Hadoop InputFormats
426
+
#### Loading Other Hadoop InputFormats
426
427
427
428
PySpark can also read any Hadoop InputFormat, for both 'new' and 'old' Hadoop APIs. If required,
428
429
a Hadoop configuration can be passed in as a Python dict. Here is an example using the
@@ -444,19 +445,19 @@ Note that, if the InputFormat simply depends on a Hadoop configuration and/or in
444
445
the key and value classes can easily be converted according to the above table,
445
446
then this approach should work well for such cases.
446
447
447
-
If you have custom serialized binary data (like pulling data from Cassandra / HBase) or custom
448
+
If you have custom serialized binary data (such as loading data from Cassandra / HBase) or custom
448
449
classes that don't conform to the JavaBean requirements, then you will first need to
449
450
transform that data on the Scala/Java side to something which can be handled by Pyrolite's pickler.
450
451
A [Converter](api/scala/index.html#org.apache.spark.api.python.Converter) trait is provided
451
452
for this. Simply extend this trait and implement your transformation code in the ```convert```
452
-
method. The ensure this class is packaged into your Spark job jar and included on the PySpark
453
+
method. Remember to ensure that this class, along with any dependencies required to access your ```InputFormat```, are packaged into your Spark job jar and included on the PySpark
453
454
classpath.
454
455
455
456
See the [Python examples]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python) and
456
457
the [Converter examples]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/pythonconverters)
457
-
for examples using HBase and Cassandra.
458
+
for examples of using HBase and Cassandra```InputFormat```.
458
459
459
-
Future support for writing data out as SequenceFileOutputFormat and other OutputFormats,
460
+
Future support for writing data out as ```SequenceFileOutputFormat``` and other ```OutputFormats```,
0 commit comments