mstopa-splunk · mstopa-splunk · Aug 29, 2024 · Aug 30, 2024 · Aug 30, 2024 · Sep 3, 2024
diff --git a/docs/architecture.md b/docs/architecture.md
diff --git a/docs/architecture/detect_and_troubleshoot.md b/docs/architecture/detect_and_troubleshoot.md
@@ -0,0 +1 @@
+`sudo tcpdump -n -s 0 -S -i any -A -v 'port 514 and (tcp or udp)'`
diff --git a/docs/architecture/finetuning_for_tcp.md b/docs/architecture/finetuning_for_tcp.md
diff --git a/docs/architecture/finetuning_for_udp.md b/docs/architecture/finetuning_for_udp.md
diff --git a/docs/architecture/ha.md b/docs/architecture/ha.md
@@ -0,0 +1,3 @@
+Load balancing for high availability does not work well for stateless, unacknowledged syslog traffic. More data is preserved when you use a more simple design such as vMotioned VMs.  With syslog, the protocol itself is prone to loss, and syslog data collection can be made "mostly available" at best.
+
+The best deployment model for high availability is a [Microk8s](https://microk8s.io/) based deployment with MetalLB in BGP mode. This model uses a special class of load balancer that is implemented as destination network translation.
diff --git a/docs/architecture/lb.md b/docs/architecture/lb.md
@@ -0,0 +1,21 @@
+# Load balancers are not a best practice for SC4S
+In syslog ingestion systems load balancers are usually used for horizontal scaling and high availability.
+
+It is a best practice to avoid load balancing in both cases. Instead of horizontal scaling it is recommended use a robust, single server. For high availability choose rather a shared-IP cluster.
+
+While neither recommended nor supported, the usage of LBs is still popular among SC4S users. This section of documentation discusses various LB solutions and their possible setups together with well known issues.
+
+## General considerations regarding load balancers
+While using load balancers it's recommended to:
+- Preserve the actual source IP of the sending machine. The default behavior of L4 LBs is to overwrite the source IP from the client’s IP to their own.
+- For high availability use the LB solution with HA mode
+
+Load balancing setup differs for TCP/TLS and UDP.
+
+For TCP/TLS:
+- There are two ways of preserving the source IP: using the "PROXY" protocol or IP transparency (DNAT configuration)
+- For the "PROXY" configuration make sure to enable it on the SC4S side with  `SC4S_SOURCE_PROXYCONNECT=yes`
+- TCP/TLS load balancers do not consider the weight of individual connection load and are frequently biased to one instance. Vertically scale all members in a single resource pool to accommodate the full workload
+
+For UDP:
+- Load balancers for UDP can only use DNAT, for example with DSR (Direct Server Response)
diff --git a/docs/architecture/nginx.md b/docs/architecture/nginx.md
@@ -0,0 +1,178 @@
+# Nginx
+While load balancing syslog with NGINX Open Source is neither recommended, nor supported by Splunk, it is still a "good enough" solution for some customers.
+
+Note the main disadvantages of Nginx Open Source:
+- Due to no High Availability an Nginx LB becomes a new single point of failure.
+- Even with the round-robin we also often observe bias in traffic distribution which results in overloading some of the instances in the pool. This results in growing queues, which lead to delays, data drops and memory and disk issues.
+- Nginx Open Source doesn't provide active health checking, which is crucial for UDP DSR (Direct Server Return) load balancing.
+
+## Install Nginx
+1. Refer to Nginx documentation for instructions on installing Nginx **with the stream module**, which is necessary for TCP/UDP load balancing. For example on Ubuntu:
+```bash
+sudo apt update
+sudo apt -y install nginx libnginx-mod-stream
+```
+
+2. In the main Nginx configuration update `events` section to increase performance, for example:
+`/etc/nginx/nginx.conf`
+```conf
+events {
+    worker_connections 20480;
+    multi_accept on;
+    use epoll;
+}
+```
+
+## Preserving source IP
+| Method                     | Protocol   |
+|----------------------------|------------|
+| PROXY protocol             | TCP/TLS    |
+| Transparent IP             | TCP/TLS    |
+| Direct Server Return (DSR) | UDP        |
+
+## Option 1: Configure Nginx with the PROXY protocol
+Advantages:
+- easy to set up
+
+Disadvantages:
+- worse performance
+- available only for TCP/TLS, not available for UDP
+- overwriting the source IP in syslog-ng is not ideal. SOURCEIP is a hard macro and only HOST can be overwritten
+- overwriting the source IP is available only in SC4S>3.4.0
+
+1. On your LB node add a configuration similar to the following:
+`/etc/nginx/modules-enabled/sc4s.conf`
+```conf
+stream {
+    # Define upstream for each of SC4S hosts and ports
+    # Default SC4S TCP ports are 514, 601, 5425, 6514
+    # Include also your custom ports
+    upstream stream_syslog_514 {
+        server <SC4S_IP_1>:514;
+        server <SC4S_IP_2>:514;
+    }
+    upstream stream_syslog_601 {
+        server <SC4S_IP_1>:601;
+        server <SC4S_IP_2>:601;
+    }
+    upstream stream_syslog_5425 {
+        server <SC4S_IP_1>:5425;
+        server <SC4S_IP_2>:5425;
+    }
+    upstream stream_syslog_6514 {
+        server <SC4S_IP_1>:6514;
+        server <SC4S_IP_2>:6514;
+    }
+
+    # Define a common configuration block for all servers
+    map $server_port $upstream_name {
+        514   stream_syslog_514;
+        601   stream_syslog_601;
+        5425  stream_syslog_5425;
+    }
+
+    # Define a virtual server for each upstream connection
+    # make sure to set 'proxy_protocol' to 'on'
+    server {
+        listen        514;
+        listen        601;
+        listen        5425;
+        proxy_pass    $upstream_name;
+
+        proxy_timeout 3s;
+        proxy_connect_timeout 3s;
+
+        proxy_protocol on;
+    }
+
+    server {
+        listen        6514;
+        proxy_pass    stream_syslog_6514;
+
+        proxy_timeout 3s;
+        proxy_connect_timeout 3s;
+
+        proxy_protocol on;
+
+        proxy_ssl on;
+    }
+}
+```
+3. Refer to Nginx documentation to find the command to reload the service, for example `sudo nginx -s reload`.
+4. Add the following parameter to SC4S configuration and restart your instances:
+`/opt/sc4s/env_file`
+```conf
+SC4S_SOURCE_PROXYCONNECT=yes
+```
+
+### Test your setup
+1. Send TCP/TLS messages to the load balancer and ensure that they are being correctly received in Splunk with the correct host IP:
+```bash
+echo "hello world" | netcat <LB_IP> 514
+```
+
+2. Run performance tests based on [Check TCP Performance](tcp_performance_tests.md)
+| Receiver                   | Performance                   |
+|----------------------------|-------------------------------|
+| Single SC4S Server         | 4,341,000 (71,738.98 msg/sec) |
+| Load Balancer + 2 Servers  | 5,996,000 (99,089.03 msg/sec) |
+
+
+## Option 2: Configure Nginx with DSR (Direct Server Return)
+Advantages:
+- works for UDP
+- more efficient (saves one hop)
+
+Disadvantages:
+- DSR setup requires active health checks, because LB cannot expect responses from the upstream. Active health checks are not available in Nginx open source. Switch to Nginx Plus or implement your own active health checking
+- requires superuser privileges
+- for cloud users might require disabling Source/Destination Checking (tested with AWS)
+
+1. In the main Nginx configuration update `user` to root, for example:
+`/etc/nginx/nginx.conf`
+```conf
+user root;
+```
+
+2. Add a configuration similar to the following:
+`/etc/nginx/modules-enabled/sc4s.conf`
+```conf
+stream {
+    # Define upstream for each of SC4S hosts and ports
+    # Default SC4S UDP port is 514
+    # Include also your custom ports
+    upstream stream_syslog_514 {
+        server <SC4S_IP_1>:514;
+        server <SC4S_IP_2>:514;
+    }
+
+    # Define connections to each of your upstreams.
+    # Make sure to include `proxy_bind` and `proxy_responses 0`.
+    server {
+        listen        514 udp;
+        proxy_pass    stream_syslog_514;
+
+        proxy_bind $remote_addr:$remote_port transparent;
+        proxy_responses 0;
+    }
+}
+```
+
+3. Refer to Nginx documentation to find the command to reload the service, for example `sudo nginx -s reload`.
+
+4. Make sure to disable `Source/Destination Checking` on your LB's host if you work on AWS
+
+### Test your setup
+1. Send UDP messages to the load balancer and ensure that they are being correctly received in Splunk with the correct host IP:
+```bash
+echo "hello world" > /dev/udp/<LB_IP>/514
+```
+
+2. Run performance tests
+
+| Receiver / Drops Rate for EPS (msgs/sec) | 4,500  | 9,000  | 27,000 | 50,000 | 150,000 | 300,000 |
+|------------------------------------------|--------|--------|--------|--------|---------|---------|
+| Single SC4S Server                       | 0.33%  | 1.24%  | 52.31% | 74.71% |    --   |    --   |
+| Load Balancer + 2 Servers                | 1%     | 1.19%  | 6.11%  | 47.64% |    --   |    --   |
+| Single Finetuned SC4S Server             | 0%     | 0%     | 0%     | 0%     |  47.37% |    --   |
+| Load Balancer + 2 Finetuned Servers      | 0.98%  | 1.14%  | 1.05%  | 1.16%  |  3.56%  |  55.54% |
diff --git a/docs/architecture/performance_tests.md b/docs/architecture/performance_tests.md
diff --git a/docs/architecture/recommendations.md b/docs/architecture/recommendations.md
@@ -0,0 +1,62 @@
+# Architectural Considerations
+
+Building a performant, HA, performant and scalable syslog ingestion system is a non-trivial task.
+
+The syslog protocol design prioritizes speed and efficiency, which can occur at the expense of resiliency and reliability. Because of these tradeoffs, traditional methods to provide scale and resiliency do not necessarily transfer to syslog.
+
+## Syslog Architecture recommendations
+The following subsections provide recommendations and suggestions for planning your syslog ingestions system based on SC4S.
+
+### Recommended system design sequence
+1. Locate your SC4S server
+2. Choose your optimal hardware setup
+3. Fine-tune your SC4S instance
+4. Monitor and troubleshoot
+5. Build a high-availability architecture
+
+#### Locate your SC4S server
+Syslog is a "send and forget" protocol snf iy does not perform well when routed through substantial network infrastructure.
+
+For centrally located syslog servers we often observe both UDP and TCP traffic problems and data loss.
+
+Instead, provide for edge collection. Keep the client and server ideally a few - optimally one - hop away from each other. Syslog should not pass a WAN and the chance of a failure increaces with the number of Layer 4 devices in the path, including TCP/UDP load balancers.
+
+#### Choose your optimal hardware setup
+Hardware specification is the crucial part of designing a performant syslog ingestion system. See [Choose Your Hardware Setup](hardware.md).
+
+#### Choose between UDP and TCP and fine-tune SC4S
+While UDP is the protocol traditionally recommended for syslog, TCP is also an option provided by the standard and many vendors.
+
+UDP reduces network load on the network stream with no required receipt verification or window adjustment. TCP uses Acknowledgement Signals (ACKS) to avoid data loss, however, loss can still occur, when:
+
+* The TCP session is closed: Events published while the system is creating a new session are lost. 
+* The remote side is busy and cannot send an acknowledgement signal fast enough: Events are lost due to a full local buffer.
+* A single acknowledgement signal is lost by the network and the client closes the connection: Local and remote buffer are lost.
+* The remote server restarts for any reason: Local buffer is lost.
+* The remote server restarts without closing the connection: Local buffer plus timeout time are lost.
+* The client side restarts without closing the connection.
+* Increased overhead on the network can lead to loss.
+
+You can for example use TCP only if the syslog event is larger than the maximum size of the UDP packet on your network (typically limited to Web Proxy, DLP, and IDs type sources).
+
+Depending on your choice you should check some or all of the following subsections:
+- [Check UDP Performance]("architecture/udp_performance_tests.md")
+- [Finetuning for UDP]("architecture/finetuning_for_udp.md")
+- [Check TCP Performance]("architecture/tcp_performance_tests.md")
+- [Finetuning for TCP]("architecture/finetuning_for_tcp.md")
+
+#### Avoid load balancers in front of SC4S
+It is common to see syslog designs with various load balancers distributing traffic to multiple SC4S instances.
+
+We are aware of the popularity of this solution. We document best practices related to load balancers in the [Load Balancers](architecture/lb.md) section, as well as requirements and challenges related to load balancing syslog.
+
+However, Splunk does not support architectures utilizing load balancers for scaling.
+
+As a best practice, do not co-locate syslog servers for horizontal scale and do not load balance to them with a front-side load balancer. Instead, make sure that every SC4S instance in your HA cluster can accomodate the full workload.
+
+For the reasons behind see the [Load Balancers](architecture/lb.md) section.
+
+#### Monitor and troubleshoot
+
+#### Build a high-availability architecture
+Load balancing for high availability does not work well for stateless, unacknowledged syslog traffic. More data is preserved when you use a more simple design such as vMotioned VMs.  With syslog, the protocol itself is prone to loss, and syslog data collection can be made "mostly available" at best.
diff --git a/docs/lb.md b/docs/lb.md
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -33,8 +33,7 @@ theme:
 
 nav:
   - Home: "index.md"
-  - Architectural Considerations: "architecture.md"
-  - Load Balancers: "lb.md"
+
   - Getting Started:
       - Read First: "gettingstarted/index.md"
       - Quickstart Guide: "gettingstarted/quickstart_guide.md"
@@ -59,7 +58,23 @@ nav:
       - Read First: "sources/index.md"
       - Basic Onboarding: "sources/base"
       - Known Vendors: "sources/vendor"
-  - Performance: "performance.md"
+
+  - High Availability and Scalability:
+      - Architectural Recommendations: "architecture/recommendations.md"
+      - SC4S Performance:
+        - Check UDP Performance: "architecture/udp_performance_tests.md"
+        - Check TCP Performance: "architecture/tcp_performance_tests.md"
+        - Detect and Troubleshoot Data Losses: "architecture/detect_and_troubleshoot.md"
+      - Choose Your Hardware Setup: "architecture/hardware.md"
+      - Vertical Scaling:
+        - Finetuning for UDP: "architecture/finetuning_for_udp.md"
+        - Finetuning for TCP: "architecture/finetuning_for_tcp.md"
+      - High Availability:
+        - Recommendations: "architecture/ha.md"
+      - Load Balancers:
+        - Recommendations: "architecture/lb.md"
+        - Nginx: "architecture/nginx.md"
+
   - SC4S Lite (Experimental):
       - Intro: "lite.md"
       - Pluggable modules: "pluggable_modules.md"
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		`sudo tcpdump -n -s 0 -S -i any -A -v 'port 514 and (tcp or udp)'`
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		Load balancing for high availability does not work well for stateless, unacknowledged syslog traffic. More data is preserved when you use a more simple design such as vMotioned VMs. With syslog, the protocol itself is prone to loss, and syslog data collection can be made "mostly available" at best.

		The best deployment model for high availability is a [Microk8s](https://microk8s.io/) based deployment with MetalLB in BGP mode. This model uses a special class of load balancer that is implemented as destination network translation.