-
Notifications
You must be signed in to change notification settings - Fork 340
464 logs infra #478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
464 logs infra #478
Conversation
@domdomegg Can you please review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few small comments but lgtm 👍 Small nit is if deployment fails do we have a job that cleans up leftover resources? I wonder if that could be the case with the latest issues we are having in prod
Repo: pulumi.String("https://victoriametrics.github.io/helm-charts/"), | ||
}, | ||
Values: pulumi.Map{ | ||
"server": pulumi.Map{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: I know these values can be adjusted based on usage but would we get alerts if we hit these thresholds? Also valid for the persistence setting
}, | ||
}, | ||
}, | ||
"resources": pulumi.Map{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the other question above for VictoriaLogs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For resource limits hit based alert we have to setup exporter to get the resource data, we currently have metric_server already running that is one of the exporter that can help with pod level resource utilisation based alerts, for others we have to add other exporters. Can we create another issue for setting up alerts for resource limits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course 👍 Would you like to do it or I can do it too if you want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will create the issue, no problem :)
Can you help me understanding this better about meaning of leftover resources? Ideally kubernetes should handle the replicas lifecycle for us unless it is like a short term background jobs. |
@rdimitrov 1 more question about the issues happening on prod? Can you explain the issue what is happening, I will help in understanding what kind of monitoring data we need to have correct alerts for the issue. Btw, thanks a lot for reviewing the PR, appreciate it :) |
It's mainly because I see a few
Of course, I don't have access to the cluster logs/stats so cannot be certain, but my assumption based on these logs is the deployment job times out. My guess is caused because of some resource limits we are hitting in the production environment (notice this same run passes for staging), but that's just a wild guess 👍 |
@rdimitrov Based on the logs, it can be because of 2 reasons.
Noticed that it is passing on staging, it means it cannot be because of first, it has to be 2nd one. This will be solved if we have node exporter in place to give us alert and also will help us in coming up with an strategy of auto scale. |
Interesting 👍 Could we expose the logs from the deployment status too so we can debug such failures more easily? What I refer to is access to logs for errors like failed image pulls, etc which could help us while debugging. |
Assuming these issues will come if there are configuration mismatch between envs otherwise it should not happen on production. Yes, we can get access to these logs depending on kind of issue. For |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your replies! 🙏 I think it's fair to merge this one as it is and address as follow up PRs the remaining items for logs (and potentially alerts?) for Kubernetes events and node-level metrics (we can skip system logs for now I think).
ps. I don't want to leave this PR hanging open for too long and I know you requested a review from @domdomegg initially so if he has any additional comments we can address them as follow ups 👍
Motivation and Context
Logs will help in fixing issues quickly, thereby reducing the downtime and MTTD.
How Has This Been Tested?
End to end local setup is tested
Breaking Changes
No
Types of changes
Checklist
Additional context