Skip to content

Commit e728516

Browse files
committed
[SPARK-2013] Documentation for saveAsPickleFile and pickleFile in Python
1 parent 8919685 commit e728516

File tree

1 file changed

+7
-5
lines changed

1 file changed

+7
-5
lines changed

docs/programming-guide.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -377,13 +377,15 @@ Some notes on reading files with Spark:
377377

378378
* The `textFile` method also takes an optional second argument for controlling the number of slices of the file. By default, Spark creates one slice for each block of the file (blocks being 64MB by default in HDFS), but you can also ask for a higher number of slices by passing a larger value. Note that you cannot have fewer slices than blocks.
379379

380-
Apart from reading files as a collection of lines,
381-
`SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file.
380+
Apart from text files, Spark's Python API also supports several other data formats:
382381

383-
### SequenceFile and Hadoop InputFormats
382+
* `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file.
383+
384+
* `RDD.saveAsPickleFile` and `SparkContext.pickleFile` support saving and reading an RDD in a simple format consisting of pickled Python objects. Batching is used on pickle serialization, with default batch size 10.
384385

385-
In addition to reading text files, PySpark supports reading ```SequenceFile```
386-
and any arbitrary ```InputFormat```.
386+
* Details on reading `SequenceFile` and arbitrary Hadoop `InputFormat` are given below.
387+
388+
### SequenceFile and Hadoop InputFormats
387389

388390
**Note** this feature is currently marked ```Experimental``` and is intended for advanced users. It may be replaced in future with read/write support based on SparkSQL, in which case SparkSQL is the preferred approach.
389391

0 commit comments

Comments
 (0)