Skip to content

Commit beed631

Browse files
committed
Merge remote-tracking branch 'upstream/master' into expr_binary_log
2 parents 6089d11 + cb7ada1 commit beed631

File tree

9 files changed

+162
-173
lines changed

9 files changed

+162
-173
lines changed

docs/streaming-custom-receivers.md

Lines changed: 12 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ title: Spark Streaming Custom Receivers
44
---
55

66
Spark Streaming can receive streaming data from any arbitrary data source beyond
7-
the one's for which it has in-built support (that is, beyond Flume, Kafka, Kinesis, files, sockets, etc.).
7+
the ones for which it has built-in support (that is, beyond Flume, Kafka, Kinesis, files, sockets, etc.).
88
This requires the developer to implement a *receiver* that is customized for receiving data from
99
the concerned data source. This guide walks through the process of implementing a custom receiver
1010
and using it in a Spark Streaming application. Note that custom receivers can be implemented
@@ -21,15 +21,15 @@ A custom receiver must extend this abstract class by implementing two methods
2121
- `onStop()`: Things to do to stop receiving data.
2222

2323
Both `onStart()` and `onStop()` must not block indefinitely. Typically, `onStart()` would start the threads
24-
that responsible for receiving the data and `onStop()` would ensure that the receiving by those threads
24+
that are responsible for receiving the data, and `onStop()` would ensure that these threads receiving the data
2525
are stopped. The receiving threads can also use `isStopped()`, a `Receiver` method, to check whether they
2626
should stop receiving data.
2727

2828
Once the data is received, that data can be stored inside Spark
2929
by calling `store(data)`, which is a method provided by the Receiver class.
30-
There are number of flavours of `store()` which allow you store the received data
31-
record-at-a-time or as whole collection of objects / serialized bytes. Note that the flavour of
32-
`store()` used to implemented a receiver affects its reliability and fault-tolerance semantics.
30+
There are a number of flavors of `store()` which allow one to store the received data
31+
record-at-a-time or as whole collection of objects / serialized bytes. Note that the flavor of
32+
`store()` used to implement a receiver affects its reliability and fault-tolerance semantics.
3333
This is discussed [later](#receiver-reliability) in more detail.
3434

3535
Any exception in the receiving threads should be caught and handled properly to avoid silent
@@ -60,7 +60,7 @@ class CustomReceiver(host: String, port: Int)
6060

6161
def onStop() {
6262
// There is nothing much to do as the thread calling receive()
63-
// is designed to stop by itself isStopped() returns false
63+
// is designed to stop by itself if isStopped() returns false
6464
}
6565

6666
/** Create a socket connection and receive data until receiver is stopped */
@@ -123,7 +123,7 @@ public class JavaCustomReceiver extends Receiver<String> {
123123

124124
public void onStop() {
125125
// There is nothing much to do as the thread calling receive()
126-
// is designed to stop by itself isStopped() returns false
126+
// is designed to stop by itself if isStopped() returns false
127127
}
128128

129129
/** Create a socket connection and receive data until receiver is stopped */
@@ -167,7 +167,7 @@ public class JavaCustomReceiver extends Receiver<String> {
167167

168168
The custom receiver can be used in a Spark Streaming application by using
169169
`streamingContext.receiverStream(<instance of custom receiver>)`. This will create
170-
input DStream using data received by the instance of custom receiver, as shown below
170+
an input DStream using data received by the instance of custom receiver, as shown below:
171171

172172
<div class="codetabs">
173173
<div data-lang="scala" markdown="1" >
@@ -206,22 +206,20 @@ there are two kinds of receivers based on their reliability and fault-tolerance
206206
and stored in Spark reliably (that is, replicated successfully). Usually,
207207
implementing this receiver involves careful consideration of the semantics of source
208208
acknowledgements.
209-
1. *Unreliable Receiver* - These are receivers for unreliable sources that do not support
210-
acknowledging. Even for reliable sources, one may implement an unreliable receiver that
211-
do not go into the complexity of acknowledging correctly.
209+
1. *Unreliable Receiver* - An *unreliable receiver* does *not* send acknowledgement to a source. This can be used for sources that do not support acknowledgement, or even for reliable sources when one does not want or need to go into the complexity of acknowledgement.
212210

213211
To implement a *reliable receiver*, you have to use `store(multiple-records)` to store data.
214-
This flavour of `store` is a blocking call which returns only after all the given records have
212+
This flavor of `store` is a blocking call which returns only after all the given records have
215213
been stored inside Spark. If the receiver's configured storage level uses replication
216214
(enabled by default), then this call returns after replication has completed.
217215
Thus it ensures that the data is reliably stored, and the receiver can now acknowledge the
218-
source appropriately. This ensures that no data is caused when the receiver fails in the middle
216+
source appropriately. This ensures that no data is lost when the receiver fails in the middle
219217
of replicating data -- the buffered data will not be acknowledged and hence will be later resent
220218
by the source.
221219

222220
An *unreliable receiver* does not have to implement any of this logic. It can simply receive
223221
records from the source and insert them one-at-a-time using `store(single-record)`. While it does
224-
not get the reliability guarantees of `store(multiple-records)`, it has the following advantages.
222+
not get the reliability guarantees of `store(multiple-records)`, it has the following advantages:
225223

226224
- The system takes care of chunking that data into appropriate sized blocks (look for block
227225
interval in the [Spark Streaming Programming Guide](streaming-programming-guide.html)).

0 commit comments

Comments
 (0)