Skip to content

Conversation

pree-dew
Copy link
Contributor

Motivation and Context

Logs will help in fixing issues quickly, thereby reducing the downtime and MTTD.

How Has This Been Tested?

End to end local setup is tested

Breaking Changes

No

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

Additional context

  • It handles container logs for mcp-registry pods.
  • It supports k8sattributes resources which helps in identifying from where the logs are coming.
  • Supports victorialogs and it's plugin in grafana.
Screenshot 2025-09-15 at 1 43 51 AM Screenshot 2025-09-15 at 1 44 34 AM Screenshot 2025-09-15 at 1 46 10 AM

@pree-dew
Copy link
Contributor Author

@domdomegg Can you please review?

Copy link
Collaborator

@rdimitrov rdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few small comments but lgtm 👍 Small nit is if deployment fails do we have a job that cleans up leftover resources? I wonder if that could be the case with the latest issues we are having in prod

Repo: pulumi.String("https://victoriametrics.github.io/helm-charts/"),
},
Values: pulumi.Map{
"server": pulumi.Map{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: I know these values can be adjusted based on usage but would we get alerts if we hit these thresholds? Also valid for the persistence setting

},
},
},
"resources": pulumi.Map{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the other question above for VictoriaLogs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For resource limits hit based alert we have to setup exporter to get the resource data, we currently have metric_server already running that is one of the exporter that can help with pod level resource utilisation based alerts, for others we have to add other exporters. Can we create another issue for setting up alerts for resource limits?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course 👍 Would you like to do it or I can do it too if you want?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create the issue, no problem :)

@pree-dew
Copy link
Contributor Author

if deployment fails do we have a job that cleans up leftover resources? I wonder if that could be the case with the latest issues we are having in prod

Can you help me understanding this better about meaning of leftover resources? Ideally kubernetes should handle the replicas lifecycle for us unless it is like a short term background jobs.

@pree-dew
Copy link
Contributor Author

@rdimitrov 1 more question about the issues happening on prod? Can you explain the issue what is happening, I will help in understanding what kind of monitoring data we need to have correct alerts for the issue.

Btw, thanks a lot for reviewing the PR, appreciate it :)

@rdimitrov
Copy link
Collaborator

Can you help me understanding this better about meaning of leftover resources? Ideally kubernetes should handle the replicas lifecycle for us unless it is like a short term background jobs.

It's mainly because I see a few deployXXX methods and was wondering if they fail for some reason is there a risk of having some leftover things hanging there?

@rdimitrov 1 more question about the issues happening on prod? Can you explain the issue what is happening, I will help in understanding what kind of monitoring data we need to have correct alerts for the issue.

Of course, I don't have access to the cluster logs/stats so cannot be certain, but my assumption based on these logs is the deployment job times out. My guess is caused because of some resource limits we are hitting in the production environment (notice this same run passes for staging), but that's just a wild guess 👍

@pree-dew
Copy link
Contributor Author

pree-dew commented Sep 17, 2025

@rdimitrov Based on the logs, it can be because of 2 reasons.

  1. Either the application is failing to be deployed because of some issue with the image which is resulting into failing in minimum replicas to be deployed.
  2. Or the nodes are running out of capacity and not able to deploy more pods.

Noticed that it is passing on staging, it means it cannot be because of first, it has to be 2nd one. This will be solved if we have node exporter in place to give us alert and also will help us in coming up with an strategy of auto scale.

@rdimitrov
Copy link
Collaborator

rdimitrov commented Sep 17, 2025

  1. Either the application is failing to be deployed because of some issue with the image which is resulting into failing in minimum replicas to be deployed.

Interesting 👍 Could we expose the logs from the deployment status too so we can debug such failures more easily? What I refer to is access to logs for errors like failed image pulls, etc which could help us while debugging.

@pree-dew
Copy link
Contributor Author

Could we expose the logs from the deployment status too so we can debug such failures more easily? What I refer to is access to logs for errors like failed image pulls, etc which could help us while debugging.

Assuming these issues will come if there are configuration mismatch between envs otherwise it should not happen on production.

Yes, we can get access to these logs depending on kind of issue. For Imagepullbackoff logs can be found by kubernetes events or by kubelet logs while for Crashloop can be found from container logs itself (which is already part of PR). We can define what all types of issues we have to have logs for then pull it from those sources.

Copy link
Collaborator

@rdimitrov rdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your replies! 🙏 I think it's fair to merge this one as it is and address as follow up PRs the remaining items for logs (and potentially alerts?) for Kubernetes events and node-level metrics (we can skip system logs for now I think).

ps. I don't want to leave this PR hanging open for too long and I know you requested a review from @domdomegg initially so if he has any additional comments we can address them as follow ups 👍

@rdimitrov rdimitrov merged commit 6ed625f into modelcontextprotocol:main Sep 18, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants