Skip to content

Write batching strategy #156

Open
Open
@FZambia

Description

@FZambia

Hey @rueian, this is me again.

I was preparing Rueidis-based code for release and suddenly discovered an interesting thing. I did quite a lot of Go benchmarks to make sure the new implementation based on Rueidis produces a better operation latency and a better throughput. And it does.

I also expected that migration to Rueidis will provide Centrifugo a better CPU utilization since Rueidis produces less memory allocations. And here are dragons.

Before making release I decided to do macro-benchmarks and found that Centrifugo consumes more CPU than before in equal conditions. Moreover, Rueidis-based implementation results into more CPU usage on Redis instance than we had with previous implementation. I did not expect that at all. To investigate that I made a repo: https://github.com/FZambia/pipelines.

In that repo I implemented 3 benchmarks: for pipelined Redigo, pipelined Go-Redis and Rueidis.

After running benchmarks I observed the following:

input_1.mp4
❯ go test -run xxx -bench . -benchtime 10s
goos: darwin
goarch: arm64
pkg: github.com/FZambia/pipelines
BenchmarkRedigo-8    	12460600	       959.6 ns/op	     181 B/op	       4 allocs/op
BenchmarkGoredis-8   	 8069197	      1534 ns/op	     343 B/op	       5 allocs/op
BenchmarkRueidis-8   	19451470	       620.0 ns/op	      80 B/op	       1 allocs/op

Here we can see that CPU usage is:

Redigo Goredis Rueidis
Application CPU, % 285 270 470
Redis CPU, % 56 34 80

Nothing too special here – all numbers are +/- expected. Rueidis produced better throughput so it loaded Redis more and the price for the better throughput is application CPU utilization.

But in Centrifugo case I compared CPU usage with Redigo and Rueidis in equal conditions. So I added rate limiter to benchmarks in the https://github.com/FZambia/pipelines repo to generate the same load in all cases. Limiting load to 100 commands per millisecond (100k per second).

input_2.mp4
❯ PIPE_LIMITED=1 go test -run xxx -bench . -benchtime 10s
goos: darwin
goarch: arm64
pkg: github.com/FZambia/pipelines
BenchmarkRedigo-8    	 1000000	     10000 ns/op	     198 B/op	       5 allocs/op
BenchmarkGoredis-8   	 1000000	     10000 ns/op	     350 B/op	       8 allocs/op
BenchmarkRueidis-8   	 1000000	     10000 ns/op	     113 B/op	       2 allocs/op
PASS
ok  	github.com/FZambia/pipelines	30.629s
Redigo Goredis Rueidis
Application CPU, % 91 96 118
Redis CPU, % 36 34 45

This is more interesting. We are generating the same load in all benchmarks but both app and Redis CPU is the worst in Rueidis case.

Turned out the difference here is the result of different batch sizes we are sending to Redis. In Redigo/Goredis case we have larger batches than in Rueidis case. In Rueidis case we have smaller size batches and thus more syscalls in app and on Redis side. As we can see CPU is very sensitive to this.

There is a project called Twemproxy which acts as a proxy between applications and Redis and makes automatic batches thus reducing load on Redis, so in general pipelining is known not only to increase throughput but to reduce CPU usage of Redis. As Redis is single threaded its capacity is quite limited actually.

I tried to find a simple way to improve batching of Rueidis somehow. The simplest solution I found at this point is this one: main...FZambia:rueidis:GetWriterEachConn

I.e. introducing an option to provide custom bufio.Writer. I used it like this:

func rueidisClient() rueidis.Client {
	options := rueidis.ClientOption{
		InitAddress:  []string{":6379"},
		DisableCache: true,
	}
	if os.Getenv("PIPE_DELAYED") != "" {
		options.GetWriterEachConn = func(writer io.Writer) (*bufio.Writer, func()) {
			mlw := newDelayWriter(bufio.NewWriterSize(writer, 1<<19), time.Millisecond)
			w := bufio.NewWriterSize(mlw, 1<<19)
			return w, func() { mlw.close() }
		}
	}
	client, err := rueidis.NewClient(options)
	if err != nil {
		log.Fatal(err)
	}
	return client
}


type writeFlusher interface {
	io.Writer
	Flush() error
}

type delayWriter struct {
	dst   writeFlusher
	delay time.Duration // zero means to flush immediately

	mu           sync.Mutex // protects tm, flushPending, and dst.Flush
	tm           *time.Timer
	err          error
	flushPending bool
}

func newDelayWriter(dst writeFlusher, delay time.Duration) *delayWriter {
	return &delayWriter{dst: dst, delay: delay}
}

func (m *delayWriter) Write(p []byte) (n int, err error) {
	m.mu.Lock()
	defer m.mu.Unlock()
	if m.err != nil {
		return 0, err
	}
	n, err = m.dst.Write(p)
	if m.delay <= 0 {
		err = m.dst.Flush()
		return
	}
	if m.flushPending {
		return
	}
	if m.tm == nil {
		m.tm = time.AfterFunc(m.delay, m.delayedFlush)
	} else {
		m.tm.Reset(m.delay)
	}
	m.flushPending = true
	return
}

func (m *delayWriter) delayedFlush() {
	m.mu.Lock()
	defer m.mu.Unlock()
	if !m.flushPending { // if stop was called but AfterFunc already started this goroutine
		return
	}
	err := m.dst.Flush()
	if err != nil {
		m.err = err
	}
	m.flushPending = false
}

func (m *delayWriter) close() {
	m.mu.Lock()
	defer m.mu.Unlock()
	m.flushPending = false
	if m.tm != nil {
		m.tm.Stop()
	}
}

The code of delayed writer inspired by Caddy's code. It basically delays writes into connection.

We sacrifice latency for less syscalls.

input_3.mp4
❯ PIPE_LIMITED=1 PIPE_DELAYED=1 go test -run xxx -bench . -benchtime 10s
goos: darwin
goarch: arm64
pkg: github.com/FZambia/pipelines
BenchmarkRedigo-8    	 1000000	     10000 ns/op	     198 B/op	       5 allocs/op
BenchmarkGoredis-8   	 1000000	     10000 ns/op	     350 B/op	       8 allocs/op
BenchmarkRueidis-8   	 1000000	     10002 ns/op	     114 B/op	       2 allocs/op
PASS
ok  	github.com/FZambia/pipelines	30.712s
Redigo Goredis Rueidis
Application CPU, % 91 96 51
Redis CPU, % 36 34 6

From these results we can see that by better batching we can reduce both application and Redis CPU usage, as we make less read/write syscalls. For Rueidis CPU of benchmark process reduced from 118 to 51 %, for Redis process from 45 to 6 %. Extra millisecond latency seems tolerable for such a huge resource reduction.


Unfortunately, it may be that I missed sth – so would be interesting to listen to your opinion, whether you see potential issues with this approach. Actually under different level of parallelism results may be different – since batch sizes change. All libraries in the test may perform better or worse.

I think resource reduction like this is great to have. In Centrifugo case users tend to add more Centrifugo nodes that work with single Redis instance - so possibility to keep Redis CPU as low as possible seems nice. Probably you may suggest a better approach to achieve this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions