From f8044581b35eaa33aa78c29c602fedf1d89c06b8 Mon Sep 17 00:00:00 2001 From: assafmendelson Date: Sun, 18 Jun 2017 09:20:50 +0300 Subject: [PATCH 1/2] File source options for spark 2.1 appeared under File sink --- docs/structured-streaming-programming-guide.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md index 9b9177d44145f..692b53020e901 100644 --- a/docs/structured-streaming-programming-guide.md +++ b/docs/structured-streaming-programming-guide.md @@ -510,7 +510,12 @@ Here are the details of all the sources in Spark. File source path: path to the input directory, and common to all file formats. -

+
+ maxFilesPerTrigger: maximum number of new files to be considered in every trigger (default: no max) +
+ latestFirst: whether to processs the latest new files first, useful when there is a large backlog of files (default: false) +
+
For file-format-specific options, see the related methods in DataStreamReader (Scala/Java/Python/R). @@ -1235,10 +1240,6 @@ Here are the details of all the sinks in Spark. path: path to the output directory, must be specified.
- maxFilesPerTrigger: maximum number of new files to be considered in every trigger (default: no max) -
- latestFirst: whether to processs the latest new files first, useful when there is a large backlog of files (default: false) -
fileNameOnly: whether to check new files based on only the filename instead of on the full path (default: false). With this set to `true`, the following files would be considered as the same file, because their filenames, "dataset.txt", are the same:
· "file:///dataset.txt"
From 13ff475f42f22f4bdee4b982c217feb0c8825d57 Mon Sep 17 00:00:00 2001 From: assafmendelson Date: Sun, 18 Jun 2017 09:23:31 +0300 Subject: [PATCH 2/2] Additional File source options for spark 2.2 appeared under File sink --- docs/structured-streaming-programming-guide.md | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md index 692b53020e901..d478042dea5c8 100644 --- a/docs/structured-streaming-programming-guide.md +++ b/docs/structured-streaming-programming-guide.md @@ -515,6 +515,14 @@ Here are the details of all the sources in Spark.
latestFirst: whether to processs the latest new files first, useful when there is a large backlog of files (default: false)
+ fileNameOnly: whether to check new files based on only the filename instead of on the full path (default: false). With this set to `true`, the following files would be considered as the same file, because their filenames, "dataset.txt", are the same: +
+ · "file:///dataset.txt"
+ · "s3://a/dataset.txt"
+ · "s3n://a/b/dataset.txt"
+ · "s3a://a/b/c/dataset.txt"
+
+
For file-format-specific options, see the related methods in DataStreamReader (Scala/Java/Python/Append path: path to the output directory, must be specified. -
- fileNameOnly: whether to check new files based on only the filename instead of on the full path (default: false). With this set to `true`, the following files would be considered as the same file, because their filenames, "dataset.txt", are the same: -
- · "file:///dataset.txt"
- · "s3://a/dataset.txt"
- · "s3n://a/b/dataset.txt"
- · "s3a://a/b/c/dataset.txt"
-
+

For file-format-specific options, see the related methods in DataFrameWriter (
Scala/Java/Python/R).