Description
Hey @rueian, this is me again.
I was preparing Rueidis-based code for release and suddenly discovered an interesting thing. I did quite a lot of Go benchmarks to make sure the new implementation based on Rueidis produces a better operation latency and a better throughput. And it does.
I also expected that migration to Rueidis will provide Centrifugo a better CPU utilization since Rueidis produces less memory allocations. And here are dragons.
Before making release I decided to do macro-benchmarks and found that Centrifugo consumes more CPU than before in equal conditions. Moreover, Rueidis-based implementation results into more CPU usage on Redis instance than we had with previous implementation. I did not expect that at all. To investigate that I made a repo: https://github.com/FZambia/pipelines.
In that repo I implemented 3 benchmarks: for pipelined Redigo, pipelined Go-Redis and Rueidis.
After running benchmarks I observed the following:
input_1.mp4
❯ go test -run xxx -bench . -benchtime 10s
goos: darwin
goarch: arm64
pkg: github.com/FZambia/pipelines
BenchmarkRedigo-8 12460600 959.6 ns/op 181 B/op 4 allocs/op
BenchmarkGoredis-8 8069197 1534 ns/op 343 B/op 5 allocs/op
BenchmarkRueidis-8 19451470 620.0 ns/op 80 B/op 1 allocs/op
Here we can see that CPU usage is:
Redigo | Goredis | Rueidis | |
---|---|---|---|
Application CPU, % | 285 | 270 | 470 |
Redis CPU, % | 56 | 34 | 80 |
Nothing too special here – all numbers are +/- expected. Rueidis produced better throughput so it loaded Redis more and the price for the better throughput is application CPU utilization.
But in Centrifugo case I compared CPU usage with Redigo and Rueidis in equal conditions. So I added rate limiter to benchmarks in the https://github.com/FZambia/pipelines repo to generate the same load in all cases. Limiting load to 100 commands per millisecond (100k per second).
input_2.mp4
❯ PIPE_LIMITED=1 go test -run xxx -bench . -benchtime 10s
goos: darwin
goarch: arm64
pkg: github.com/FZambia/pipelines
BenchmarkRedigo-8 1000000 10000 ns/op 198 B/op 5 allocs/op
BenchmarkGoredis-8 1000000 10000 ns/op 350 B/op 8 allocs/op
BenchmarkRueidis-8 1000000 10000 ns/op 113 B/op 2 allocs/op
PASS
ok github.com/FZambia/pipelines 30.629s
Redigo | Goredis | Rueidis | |
---|---|---|---|
Application CPU, % | 91 | 96 | 118 |
Redis CPU, % | 36 | 34 | 45 |
This is more interesting. We are generating the same load in all benchmarks but both app and Redis CPU is the worst in Rueidis case.
Turned out the difference here is the result of different batch sizes we are sending to Redis. In Redigo/Goredis case we have larger batches than in Rueidis case. In Rueidis case we have smaller size batches and thus more syscalls in app and on Redis side. As we can see CPU is very sensitive to this.
There is a project called Twemproxy which acts as a proxy between applications and Redis and makes automatic batches thus reducing load on Redis, so in general pipelining is known not only to increase throughput but to reduce CPU usage of Redis. As Redis is single threaded its capacity is quite limited actually.
I tried to find a simple way to improve batching of Rueidis somehow. The simplest solution I found at this point is this one: main...FZambia:rueidis:GetWriterEachConn
I.e. introducing an option to provide custom bufio.Writer. I used it like this:
func rueidisClient() rueidis.Client {
options := rueidis.ClientOption{
InitAddress: []string{":6379"},
DisableCache: true,
}
if os.Getenv("PIPE_DELAYED") != "" {
options.GetWriterEachConn = func(writer io.Writer) (*bufio.Writer, func()) {
mlw := newDelayWriter(bufio.NewWriterSize(writer, 1<<19), time.Millisecond)
w := bufio.NewWriterSize(mlw, 1<<19)
return w, func() { mlw.close() }
}
}
client, err := rueidis.NewClient(options)
if err != nil {
log.Fatal(err)
}
return client
}
type writeFlusher interface {
io.Writer
Flush() error
}
type delayWriter struct {
dst writeFlusher
delay time.Duration // zero means to flush immediately
mu sync.Mutex // protects tm, flushPending, and dst.Flush
tm *time.Timer
err error
flushPending bool
}
func newDelayWriter(dst writeFlusher, delay time.Duration) *delayWriter {
return &delayWriter{dst: dst, delay: delay}
}
func (m *delayWriter) Write(p []byte) (n int, err error) {
m.mu.Lock()
defer m.mu.Unlock()
if m.err != nil {
return 0, err
}
n, err = m.dst.Write(p)
if m.delay <= 0 {
err = m.dst.Flush()
return
}
if m.flushPending {
return
}
if m.tm == nil {
m.tm = time.AfterFunc(m.delay, m.delayedFlush)
} else {
m.tm.Reset(m.delay)
}
m.flushPending = true
return
}
func (m *delayWriter) delayedFlush() {
m.mu.Lock()
defer m.mu.Unlock()
if !m.flushPending { // if stop was called but AfterFunc already started this goroutine
return
}
err := m.dst.Flush()
if err != nil {
m.err = err
}
m.flushPending = false
}
func (m *delayWriter) close() {
m.mu.Lock()
defer m.mu.Unlock()
m.flushPending = false
if m.tm != nil {
m.tm.Stop()
}
}
The code of delayed writer inspired by Caddy's code. It basically delays writes into connection.
We sacrifice latency for less syscalls.
input_3.mp4
❯ PIPE_LIMITED=1 PIPE_DELAYED=1 go test -run xxx -bench . -benchtime 10s
goos: darwin
goarch: arm64
pkg: github.com/FZambia/pipelines
BenchmarkRedigo-8 1000000 10000 ns/op 198 B/op 5 allocs/op
BenchmarkGoredis-8 1000000 10000 ns/op 350 B/op 8 allocs/op
BenchmarkRueidis-8 1000000 10002 ns/op 114 B/op 2 allocs/op
PASS
ok github.com/FZambia/pipelines 30.712s
Redigo | Goredis | Rueidis | |
---|---|---|---|
Application CPU, % | 91 | 96 | 51 |
Redis CPU, % | 36 | 34 | 6 |
From these results we can see that by better batching we can reduce both application and Redis CPU usage, as we make less read/write syscalls. For Rueidis CPU of benchmark process reduced from 118 to 51 %, for Redis process from 45 to 6 %. Extra millisecond latency seems tolerable for such a huge resource reduction.
Unfortunately, it may be that I missed sth – so would be interesting to listen to your opinion, whether you see potential issues with this approach. Actually under different level of parallelism results may be different – since batch sizes change. All libraries in the test may perform better or worse.
I think resource reduction like this is great to have. In Centrifugo case users tend to add more Centrifugo nodes that work with single Redis instance - so possibility to keep Redis CPU as low as possible seems nice. Probably you may suggest a better approach to achieve this.