You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/learn/documentation/versioned/connectors/kafka.md
+96-72Lines changed: 96 additions & 72 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,108 +19,132 @@ title: Kafka Connector
19
19
limitations under the License.
20
20
-->
21
21
22
-
##Overview
23
-
Samza offers built-in integration with Apache Kafka for stream processing. A common pattern in Samza applications is to read messages from one or more Kafka topics, process it and emit results to other Kafka topics or external systems.
22
+
### Kafka I/O : QuickStart
23
+
Samza offers built-in integration with Apache Kafka for stream processing. A common pattern in Samza applications is to read messages from one or more Kafka topics, process them and emit results to other Kafka topics or databases.
24
24
25
-
## Consuming from Kafka
25
+
The `hello-samza` project includes multiple examples on interacting with Kafka from your Samza jobs. Each example also includes instructions on how to run them and view results.
-[High-level API Example](https://github.com/apache/samza-hello-samza/blob/latest/src/main/java/samza/examples/cookbook/FilterExample.java) with a corresponding [tutorial](/learn/documentation/{{site.version}}/deployment/yarn.html#starting-your-application-on-yarn)
28
28
29
-
The example below provides a basic example for configuring a system called `kafka-cluster-1` that uses the provided KafkaSystemFactory.
29
+
-[Low-level API Example](https://github.com/apache/samza-hello-samza/blob/latest/src/main/java/samza/examples/wikipedia/task/application/WikipediaParserTaskApplication.java) with a corresponding [tutorial](https://github.com/apache/samza-hello-samza#hello-samza)
30
30
31
-
{% highlight jproperties %}
32
-
# Set the SystemFactory implementation to instantiate KafkaSystemConsumer, KafkaSystemProducer and KafkaSystemAdmin
Samza refers to any IO source (eg: Kafka) it interacts with as a _system_, whose properties are set using a corresponding `SystemDescriptor`. The `KafkaSystemDescriptor` allows you to describe the Kafka cluster you are interacting with and specify its properties.
37
+
38
+
{% highlight java %}
39
+
KafkaSystemDescriptor kafkaSystemDescriptor =
40
+
new KafkaSystemDescriptor("kafka").withConsumerZkConnect(KAFKA_CONSUMER_ZK_CONNECT)
Samza provides a built-in KafkaSystemDescriptor to consume from and produce to Kafka from the StreamApplication (High-level API) or the TaskApplication (Low-level API).
47
45
48
-
Below is an example of how to use the descriptors in the describe method of a StreamApplication.
46
+
####KafkaInputDescriptor
49
47
48
+
A Kafka cluster usually has multiple topics (a.k.a _streams_). The `KafkaInputDescriptor` allows you to specify the properties of each Kafka topic your application should read from. For each of your input topics, you should create a corresponding instance of `KafkaInputDescriptor`
49
+
by providing a topic-name and a serializer.
50
50
{% highlight java %}
51
-
public class PageViewFilter implements StreamApplication {
52
-
@Override
53
-
public void describe(StreamApplicationDescriptor appDesc) {
54
-
// add input and output streams
55
-
KafkaSystemDescriptor ksd = new KafkaSystemDescriptor("kafka-cluster-1");
56
-
KafkaInputDescriptor<PageView> isd = ksd.getInputDescriptor("myinput", new JsonSerdeV2<>(PageView.class));
57
-
KafkaOutputDescriptor<DecoratedPageView> osd = ksd.getOutputDescriptor("myout", new JsonSerdeV2<>(DecordatedPageView.class));
58
-
59
-
MessageStream<PageView> ms = appDesc.getInputStream(isd);
60
-
OutputStream<DecoratedPageView> os = appDesc.getOutputStream(osd);
61
-
62
-
ms.filter(this::isValidPageView)
63
-
.map(this::addProfileInformation)
64
-
.sendTo(os);
65
-
}
66
-
}
51
+
KafkaInputDescriptor<PageView> pageViewStreamDescriptor = kafkaSystemDescriptor.getInputDescriptor("page-view-topic", new JsonSerdeV2<>(PageView.class));
67
52
{% endhighlight %}
68
53
69
-
Below is an example of how to use the descriptors in the describe method of a TaskApplication
54
+
The above example describes an input Kafka stream from the "page-view-topic" which Samza de-serializes into a JSON payload. Samza provides default serializers for common data-types like string, avro, bytes, integer etc.
55
+
56
+
####KafkaOutputDescriptor
57
+
58
+
Similarly, the `KafkaOutputDescriptor` allows you to specify the output streams for your application. For each output topic you write to, you should create an instance of `KafkaOutputDescriptor`.
70
59
71
60
{% highlight java %}
72
-
public class PageViewFilterTask implements TaskApplication {
73
-
@Override
74
-
public void describe(TaskApplicationDescriptor appDesc) {
75
-
// add input and output streams
76
-
KafkaSystemDescriptor ksd = new KafkaSystemDescriptor("kafka-cluster-1");
77
-
KafkaInputDescriptor<String> isd = ksd.getInputDescriptor("myinput", new StringSerde());
78
-
KafkaOutputDescriptor<String> osd = ksd.getOutputDescriptor("myout", new StringSerde());
79
-
80
-
appDesc.addInputStream(isd);
81
-
appDesc.addOutputStream(osd);
82
-
appDesc.addTable(td);
83
-
84
-
appDesc.withTaskFactory((StreamTaskFactory) () -> new MyStreamTask());
85
-
}
86
-
}
61
+
KafkaOutputDescriptor<DecoratedPageView> decoratedPageView = kafkaSystemDescriptor.getOutputDescriptor("my-output-topic", new JsonSerdeV2<>(DecoratedPageView.class));
87
62
{% endhighlight %}
88
63
89
-
### Advanced Configuration
90
64
91
-
Prefix the configuration with `systems.system-name.consumer.` followed by any of the Kafka consumer configurations. See [Kafka Consumer Configuration Documentation](http://kafka.apache.org/documentation.html#consumerconfigs)
The `KafkaSystemDescriptor` allows you to specify any [Kafka producer](https://kafka.apache.org/documentation/#producerconfigs) or [Kafka consumer](https://kafka.apache.org/documentation/#consumerconfigs)) property which are directly passed over to the underlying Kafka client. This allows for
70
+
precise control over the KafkaProducer and KafkaConsumer used by Samza.
71
+
72
+
{% highlight java %}
73
+
KafkaSystemDescriptor kafkaSystemDescriptor =
74
+
new KafkaSystemDescriptor("kafka").withConsumerZkConnect(..)
75
+
.withProducerBootstrapServers(..)
76
+
.withConsumerConfigs(..)
77
+
.withProducerConfigs(..)
96
78
{% endhighlight %}
97
79
98
-
## Producing to Kafka
99
80
100
-
### Basic Configuration
81
+
####Accessing an offset which is out-of-range
82
+
This setting determines the behavior if a consumer attempts to read an offset that is outside of the current valid range maintained by the broker. This could happen if the topic does not exist, or if a checkpoint is older than the maximum message history retained by the brokers.
101
83
102
-
The basic configuration is the same as [Consuming from Kafka](#kafka-basic-configuration).
84
+
{% highlight java %}
85
+
KafkaSystemDescriptor kafkaSystemDescriptor =
86
+
new KafkaSystemDescriptor("kafka").withConsumerZkConnect(..)
87
+
.withProducerBootstrapServers(..)
88
+
.withConsumerAutoOffsetReset("largest")
89
+
{% endhighlight %}
103
90
104
-
### Advanced Configuration
105
91
106
-
#### Changelog to Kafka for State Stores
92
+
#####Ignoring checkpointed offsets
93
+
Samza periodically persists the last processed Kafka offsets as a part of its checkpoint. During startup, Samza resumes consumption from the previously checkpointed offsets by default. You can over-ride this behavior and configure Samza to ignore checkpoints with `KafkaInputDescriptor#shouldResetOffset()`.
94
+
Once there are no checkpoints for a stream, the `#withOffsetDefault(..)` determines whether we start consumption from the oldest or newest offset.
107
95
108
-
For Samza processors that have local state and is configured with a changelog for durability, if the changelog is configured to use Kafka, there are Kafka specific configuration parameters.
109
-
See section on `TODO: link to state management section` State Management `\TODO` for more details.
The above example configures Samza to ignore checkpointed offsets for `page-view-topic` and consume from the oldest available offset during startup. You can configure this behavior to apply to all topics in the Kafka cluster by using `KafkaSystemDescriptor#withDefaultStreamOffsetDefault`.
105
+
106
+
118
107
119
-
Increasing the consumer fetch buffer thresholds may improve throughput at the expense of memory by buffering more messages. Run some performance analysis to find the optimal values.
108
+
### Code walkthrough
109
+
110
+
In this section, we walk through a complete example.
111
+
112
+
#### High-level API
113
+
{% highlight java %}
114
+
// Define coordinates of the Kafka cluster using the KafkaSystemDescriptor
115
+
1 KafkaSystemDescriptor kafkaSystemDescriptor = new KafkaSystemDescriptor("kafka")
Copy file name to clipboardExpand all lines: docs/learn/documentation/versioned/connectors/overview.md
+16-22Lines changed: 16 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,33 +20,27 @@ title: Connectors overview
20
20
-->
21
21
22
22
Stream processing applications often read data from external sources like Kafka or HDFS. Likewise, they require processed
23
-
results to be written to external system or data stores. As of the 1.0 release, Samza integrates with the following systems
24
-
out-of-the-box:
23
+
results to be written to external system or data stores. Samza is pluggable and designed to support a variety of [producers](/learn/documentation/{{site.version}}/api/javadocs/org/apache/samza/system/SystemProducer.html) and [consumers](/learn/documentation/{{site.version}}/api/javadocs/org/apache/samza/system/SystemConsumer.html) for your data. You can
24
+
integrate Samza with any streaming system by implementing the[SystemFactory](/learn/documentation/{{site.version}}/api/javadocs/org/apache/samza/system/SystemFactory.html) interface.
0 commit comments