Skip to content

Commit daa74be

Browse files
committed
[SPARK-11290][STREAMING] Basic implementation of trackStateByKey
Current updateStateByKey provides stateful processing in Spark Streaming. It allows the user to maintain per-key state and manage that state using an updateFunction. The updateFunction is called for each key, and it uses new data and existing state of the key, to generate an updated state. However, based on community feedback, we have learnt the following lessons. * Need for more optimized state management that does not scan every key * Need to make it easier to implement common use cases - (a) timeout of idle data, (b) returning items other than state The high level idea that of this PR * Introduce a new API trackStateByKey that, allows the user to update per-key state, and emit arbitrary records. The new API is necessary as this will have significantly different semantics than the existing updateStateByKey API. This API will have direct support for timeouts. * Internally, the system will keep the state data as a map/list within the partitions of the state RDDs. The new data RDDs will be partitioned appropriately, and for all the key-value data, it will lookup the map/list in the state RDD partition and create a new list/map of updated state data. The new state RDD partition will be created based on the update data and if necessary, with old data. Here is the detailed design doc. Please take a look and provide feedback as comments. https://docs.google.com/document/d/1NoALLyd83zGs1hNGMm0Pc5YOVgiPpMHugGMk6COqxxE/edit#heading=h.ph3w0clkd4em This is still WIP. Major things left to be done. - [x] Implement basic functionality of state tracking, with initial RDD and timeouts - [x] Unit tests for state tracking - [x] Unit tests for initial RDD and timeout - [ ] Unit tests for TrackStateRDD - [x] state creating, updating, removing - [ ] emitting - [ ] checkpointing - [x] Misc unit tests for State, TrackStateSpec, etc. - [x] Update docs and experimental tags Author: Tathagata Das <[email protected]> Closes #9256 from tdas/trackStateByKey. (cherry picked from commit 99f5f98) Signed-off-by: Tathagata Das <[email protected]>
1 parent 85bc729 commit daa74be

File tree

10 files changed

+2125
-19
lines changed

10 files changed

+2125
-19
lines changed

examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala

Lines changed: 10 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -44,18 +44,6 @@ object StatefulNetworkWordCount {
4444

4545
StreamingExamples.setStreamingLogLevels()
4646

47-
val updateFunc = (values: Seq[Int], state: Option[Int]) => {
48-
val currentCount = values.sum
49-
50-
val previousCount = state.getOrElse(0)
51-
52-
Some(currentCount + previousCount)
53-
}
54-
55-
val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => {
56-
iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s)))
57-
}
58-
5947
val sparkConf = new SparkConf().setAppName("StatefulNetworkWordCount")
6048
// Create the context with a 1 second batch size
6149
val ssc = new StreamingContext(sparkConf, Seconds(1))
@@ -71,9 +59,16 @@ object StatefulNetworkWordCount {
7159
val wordDstream = words.map(x => (x, 1))
7260

7361
// Update the cumulative count using updateStateByKey
74-
// This will give a Dstream made of state (which is the cumulative count of the words)
75-
val stateDstream = wordDstream.updateStateByKey[Int](newUpdateFunc,
76-
new HashPartitioner (ssc.sparkContext.defaultParallelism), true, initialRDD)
62+
// This will give a DStream made of state (which is the cumulative count of the words)
63+
val trackStateFunc = (batchTime: Time, word: String, one: Option[Int], state: State[Int]) => {
64+
val sum = one.getOrElse(0) + state.getOption.getOrElse(0)
65+
val output = (word, sum)
66+
state.update(sum)
67+
Some(output)
68+
}
69+
70+
val stateDstream = wordDstream.trackStateByKey(
71+
StateSpec.function(trackStateFunc).initialState(initialRDD))
7772
stateDstream.print()
7873
ssc.start()
7974
ssc.awaitTermination()
Lines changed: 193 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one or more
3+
* contributor license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright ownership.
5+
* The ASF licenses this file to You under the Apache License, Version 2.0
6+
* (the "License"); you may not use this file except in compliance with
7+
* the License. You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
package org.apache.spark.streaming
19+
20+
import scala.language.implicitConversions
21+
22+
import org.apache.spark.annotation.Experimental
23+
24+
/**
25+
* :: Experimental ::
26+
* Abstract class for getting and updating the tracked state in the `trackStateByKey` operation of
27+
* a [[org.apache.spark.streaming.dstream.PairDStreamFunctions pair DStream]] (Scala) or a
28+
* [[org.apache.spark.streaming.api.java.JavaPairDStream JavaPairDStream]] (Java).
29+
*
30+
* Scala example of using `State`:
31+
* {{{
32+
* // A tracking function that maintains an integer state and return a String
33+
* def trackStateFunc(data: Option[Int], state: State[Int]): Option[String] = {
34+
* // Check if state exists
35+
* if (state.exists) {
36+
* val existingState = state.get // Get the existing state
37+
* val shouldRemove = ... // Decide whether to remove the state
38+
* if (shouldRemove) {
39+
* state.remove() // Remove the state
40+
* } else {
41+
* val newState = ...
42+
* state.update(newState) // Set the new state
43+
* }
44+
* } else {
45+
* val initialState = ...
46+
* state.update(initialState) // Set the initial state
47+
* }
48+
* ... // return something
49+
* }
50+
*
51+
* }}}
52+
*
53+
* Java example:
54+
* {{{
55+
* TODO(@zsxwing)
56+
* }}}
57+
*/
58+
@Experimental
59+
sealed abstract class State[S] {
60+
61+
/** Whether the state already exists */
62+
def exists(): Boolean
63+
64+
/**
65+
* Get the state if it exists, otherwise it will throw `java.util.NoSuchElementException`.
66+
* Check with `exists()` whether the state exists or not before calling `get()`.
67+
*
68+
* @throws java.util.NoSuchElementException If the state does not exist.
69+
*/
70+
def get(): S
71+
72+
/**
73+
* Update the state with a new value.
74+
*
75+
* State cannot be updated if it has been already removed (that is, `remove()` has already been
76+
* called) or it is going to be removed due to timeout (that is, `isTimingOut()` is `true`).
77+
*
78+
* @throws java.lang.IllegalArgumentException If the state has already been removed, or is
79+
* going to be removed
80+
*/
81+
def update(newState: S): Unit
82+
83+
/**
84+
* Remove the state if it exists.
85+
*
86+
* State cannot be updated if it has been already removed (that is, `remove()` has already been
87+
* called) or it is going to be removed due to timeout (that is, `isTimingOut()` is `true`).
88+
*/
89+
def remove(): Unit
90+
91+
/**
92+
* Whether the state is timing out and going to be removed by the system after the current batch.
93+
* This timeout can occur if timeout duration has been specified in the
94+
* [[org.apache.spark.streaming.StateSpec StatSpec]] and the key has not received any new data
95+
* for that timeout duration.
96+
*/
97+
def isTimingOut(): Boolean
98+
99+
/**
100+
* Get the state as an [[scala.Option]]. It will be `Some(state)` if it exists, otherwise `None`.
101+
*/
102+
@inline final def getOption(): Option[S] = if (exists) Some(get()) else None
103+
104+
@inline final override def toString(): String = {
105+
getOption.map { _.toString }.getOrElse("<state not set>")
106+
}
107+
}
108+
109+
/** Internal implementation of the [[State]] interface */
110+
private[streaming] class StateImpl[S] extends State[S] {
111+
112+
private var state: S = null.asInstanceOf[S]
113+
private var defined: Boolean = false
114+
private var timingOut: Boolean = false
115+
private var updated: Boolean = false
116+
private var removed: Boolean = false
117+
118+
// ========= Public API =========
119+
override def exists(): Boolean = {
120+
defined
121+
}
122+
123+
override def get(): S = {
124+
if (defined) {
125+
state
126+
} else {
127+
throw new NoSuchElementException("State is not set")
128+
}
129+
}
130+
131+
override def update(newState: S): Unit = {
132+
require(!removed, "Cannot update the state after it has been removed")
133+
require(!timingOut, "Cannot update the state that is timing out")
134+
state = newState
135+
defined = true
136+
updated = true
137+
}
138+
139+
override def isTimingOut(): Boolean = {
140+
timingOut
141+
}
142+
143+
override def remove(): Unit = {
144+
require(!timingOut, "Cannot remove the state that is timing out")
145+
require(!removed, "Cannot remove the state that has already been removed")
146+
defined = false
147+
updated = false
148+
removed = true
149+
}
150+
151+
// ========= Internal API =========
152+
153+
/** Whether the state has been marked for removing */
154+
def isRemoved(): Boolean = {
155+
removed
156+
}
157+
158+
/** Whether the state has been been updated */
159+
def isUpdated(): Boolean = {
160+
updated
161+
}
162+
163+
/**
164+
* Update the internal data and flags in `this` to the given state option.
165+
* This method allows `this` object to be reused across many state records.
166+
*/
167+
def wrap(optionalState: Option[S]): Unit = {
168+
optionalState match {
169+
case Some(newState) =>
170+
this.state = newState
171+
defined = true
172+
173+
case None =>
174+
this.state = null.asInstanceOf[S]
175+
defined = false
176+
}
177+
timingOut = false
178+
removed = false
179+
updated = false
180+
}
181+
182+
/**
183+
* Update the internal data and flags in `this` to the given state that is going to be timed out.
184+
* This method allows `this` object to be reused across many state records.
185+
*/
186+
def wrapTiminoutState(newState: S): Unit = {
187+
this.state = newState
188+
defined = true
189+
timingOut = true
190+
removed = false
191+
updated = false
192+
}
193+
}

0 commit comments

Comments
 (0)