Skip to content

Conversation

@solsson
Copy link
Contributor

@solsson solsson commented Jun 25, 2017

The README.md used to say:

If you lose your zookeeper cluster, kafka will be unaware that persisted topics exist. The data is still there, but you need to re-create topics.

But that's a risky assumption if you have partitioning, which we aim to improve at with #30.

We can do as suggested in #26

@solsson
Copy link
Contributor Author

solsson commented Jun 25, 2017

I've tested this with GKE.

How well does multi-zone cluster work with volumes? I get a feeling we lose the ability to transition, in case one zone becomes entirely unavailable, because each volume is created in a zone and pods get affinity to there.

@solsson
Copy link
Contributor Author

solsson commented Jun 25, 2017

An issue with #32 seems to be that pod deletion is slow. Maybe signals don't reach the java process.

solsson and others added 2 commits June 26, 2017 13:22
Suggest a mix of persistent and ephemeral data to improve reliability across zones
and with the mix of PV and emptyDir there's no reason to make PVs faster than host disks.

Use 10GB as it is the minimum for standard disks on GKE.
@solsson
Copy link
Contributor Author

solsson commented Jun 26, 2017

How well does multi-zone cluster work with volumes? I get a feeling we lose the ability to transition, in case one zone becomes entirely unavailable, because each volume is created in a zone and pods get affinity to there.

#34 merged with a potential solution for that. Interesting tests remain. Go ahead and kill nodes etc.

Known remaining issues, non-blocers, with this branch:

  • Pod deletion is slow (maybe that's good for controlled scale down, but it might indicate an image och command issue)
  • The prometheus exporter shows no meaningful data.

@solsson
Copy link
Contributor Author

solsson commented Jun 26, 2017

Tested kubectl logs -f -c zookeeper zoo-1 together with kubectl scale --replicas=1 statefulset zoo and logs show no sign that the pod is aware that it will be terminated.

I also tested locally with docker kill zookeeper-test after

docker run --rm -d --name zookeeper-test --entrypoint ./bin/zookeeper-server-start.sh solsson/kafka:0.11.0.0-rc2@sha256:c1316e0131f4ec83bc645ca2141e4fda94e0d28f4fb5f836e15e37a5e054bdf1 config/zookeeper.properties
docker logs -f zookeeper-test

and no trace of termination handling there either, though shutdown is really fast. Will test on Kafka instead, after merge, as it has a documented graceful shutdown behavior. I will also postpone metrics troubleshooting, as that too benefits from comparison with kafka. Might simply be about jmx_exporter config.

@solsson
Copy link
Contributor Author

solsson commented Jun 27, 2017

Switched Kafka to dynamic provisioning in 10543bf

@solsson
Copy link
Contributor Author

solsson commented Jun 27, 2017

Good resource on termination: https://pracucci.com/graceful-shutdown-of-kubernetes-pods.html

@solsson
Copy link
Contributor Author

solsson commented Jun 27, 2017

After 411192d I get properly logged shutdown behavior in kafka, taking around 15s with negligible load. So the alpine shell not forwardning signals might have been the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants