464 logs infra #478

pree-dew · 2025-09-14T20:17:34Z

Motivation and Context

Logs will help in fixing issues quickly, thereby reducing the downtime and MTTD.

How Has This Been Tested?

End to end local setup is tested

Breaking Changes

No

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update

Checklist

I have read the MCP Documentation
My code follows the repository's style guidelines
New and existing tests pass locally
I have added appropriate error handling
I have added or updated documentation as needed

Additional context

It handles container logs for mcp-registry pods.
It supports k8sattributes resources which helps in identifying from where the logs are coming.
Supports victorialogs and it's plugin in grafana.

…stry service

pree-dew · 2025-09-14T20:22:16Z

@domdomegg Can you please review?

rdimitrov

Left a few small comments but lgtm 👍 Small nit is if deployment fails do we have a job that cleans up leftover resources? I wonder if that could be the case with the latest issues we are having in prod

rdimitrov · 2025-09-17T20:17:06Z

deploy/pkg/k8s/monitoring.go

+			Repo: pulumi.String("https://victoriametrics.github.io/helm-charts/"),
+		},
+		Values: pulumi.Map{
+			"server": pulumi.Map{


question: I know these values can be adjusted based on usage but would we get alerts if we hit these thresholds? Also valid for the persistence setting

rdimitrov · 2025-09-17T20:18:45Z

deploy/pkg/k8s/monitoring.go

+					},
+				},
+			},
+			"resources": pulumi.Map{


Similar to the other question above for VictoriaLogs

For resource limits hit based alert we have to setup exporter to get the resource data, we currently have metric_server already running that is one of the exporter that can help with pod level resource utilisation based alerts, for others we have to add other exporters. Can we create another issue for setting up alerts for resource limits?

of course 👍 Would you like to do it or I can do it too if you want?

I will create the issue, no problem :)

pree-dew · 2025-09-17T20:43:10Z

if deployment fails do we have a job that cleans up leftover resources? I wonder if that could be the case with the latest issues we are having in prod

Can you help me understanding this better about meaning of leftover resources? Ideally kubernetes should handle the replicas lifecycle for us unless it is like a short term background jobs.

pree-dew · 2025-09-17T20:45:14Z

@rdimitrov 1 more question about the issues happening on prod? Can you explain the issue what is happening, I will help in understanding what kind of monitoring data we need to have correct alerts for the issue.

Btw, thanks a lot for reviewing the PR, appreciate it :)

rdimitrov · 2025-09-17T21:22:50Z

Can you help me understanding this better about meaning of leftover resources? Ideally kubernetes should handle the replicas lifecycle for us unless it is like a short term background jobs.

It's mainly because I see a few deployXXX methods and was wondering if they fail for some reason is there a risk of having some leftover things hanging there?

@rdimitrov 1 more question about the issues happening on prod? Can you explain the issue what is happening, I will help in understanding what kind of monitoring data we need to have correct alerts for the issue.

Of course, I don't have access to the cluster logs/stats so cannot be certain, but my assumption based on these logs is the deployment job times out. My guess is caused because of some resource limits we are hitting in the production environment (notice this same run passes for staging), but that's just a wild guess 👍

pree-dew · 2025-09-17T21:36:47Z

@rdimitrov Based on the logs, it can be because of 2 reasons.

Either the application is failing to be deployed because of some issue with the image which is resulting into failing in minimum replicas to be deployed.
Or the nodes are running out of capacity and not able to deploy more pods.

Noticed that it is passing on staging, it means it cannot be because of first, it has to be 2nd one. This will be solved if we have node exporter in place to give us alert and also will help us in coming up with an strategy of auto scale.

rdimitrov · 2025-09-17T21:44:06Z

Either the application is failing to be deployed because of some issue with the image which is resulting into failing in minimum replicas to be deployed.

Interesting 👍 Could we expose the logs from the deployment status too so we can debug such failures more easily? What I refer to is access to logs for errors like failed image pulls, etc which could help us while debugging.

pree-dew · 2025-09-17T21:56:45Z

Could we expose the logs from the deployment status too so we can debug such failures more easily? What I refer to is access to logs for errors like failed image pulls, etc which could help us while debugging.

Assuming these issues will come if there are configuration mismatch between envs otherwise it should not happen on production.

Yes, we can get access to these logs depending on kind of issue. For Imagepullbackoff logs can be found by kubernetes events or by kubelet logs while for Crashloop can be found from container logs itself (which is already part of PR). We can define what all types of issues we have to have logs for then pull it from those sources.

rdimitrov

Thank you for your replies! 🙏 I think it's fair to merge this one as it is and address as follow up PRs the remaining items for logs (and potentially alerts?) for Kubernetes events and node-level metrics (we can skip system logs for now I think).

ps. I don't want to leave this PR hanging open for too long and I know you requested a review from @domdomegg initially so if he has any additional comments we can address them as follow ups 👍

pree-dew added 11 commits September 14, 2025 14:06

464-logs-infra: add victorialogs and otel collector as daemonset

3b710d9

fix exporter of otlphttp

4ad69a7

add env variable to get nodename

62ec59c

add env variable to get nodename

253c798

fix picking correct resources

67451d5

parse all attributew from file

dd45a57

remove extensions

75434e6

add victorialogs datasource plugin and restrict logs only to mcp regi…

43ba3ae

…stry service

change otel collector to use helm chart

8feeb55

add image for otel collector

b4b3431

remove log filtering as of now

3b27368

pree-dew mentioned this pull request Sep 16, 2025

Infra for logs scrapping and storage #464

Closed

Merge branch 'main' into 464-logs-infra

18e325d

rdimitrov reviewed Sep 17, 2025

View reviewed changes

rdimitrov approved these changes Sep 18, 2025

View reviewed changes

rdimitrov merged commit 6ed625f into modelcontextprotocol:main Sep 18, 2025
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

464 logs infra #478

464 logs infra #478

pree-dew commented Sep 14, 2025

Uh oh!

pree-dew commented Sep 14, 2025

Uh oh!

rdimitrov left a comment

Uh oh!

rdimitrov Sep 17, 2025

Uh oh!

rdimitrov Sep 17, 2025

Uh oh!

pree-dew Sep 17, 2025

Uh oh!

rdimitrov Sep 17, 2025

Uh oh!

pree-dew Sep 17, 2025

Uh oh!

pree-dew commented Sep 17, 2025

Uh oh!

pree-dew commented Sep 17, 2025

Uh oh!

rdimitrov commented Sep 17, 2025

Uh oh!

pree-dew commented Sep 17, 2025 •

edited

Loading

Uh oh!

rdimitrov commented Sep 17, 2025 •

edited

Loading

Uh oh!

pree-dew commented Sep 17, 2025

Uh oh!

rdimitrov left a comment

Uh oh!

Uh oh!

Uh oh!

464 logs infra #478

464 logs infra #478

Conversation

pree-dew commented Sep 14, 2025

Motivation and Context

How Has This Been Tested?

Breaking Changes

Types of changes

Checklist

Additional context

Uh oh!

pree-dew commented Sep 14, 2025

Uh oh!

rdimitrov left a comment

Choose a reason for hiding this comment

Uh oh!

rdimitrov Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

rdimitrov Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

pree-dew Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

rdimitrov Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

pree-dew Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

pree-dew commented Sep 17, 2025

Uh oh!

pree-dew commented Sep 17, 2025

Uh oh!

rdimitrov commented Sep 17, 2025

Uh oh!

pree-dew commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdimitrov commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pree-dew commented Sep 17, 2025

Uh oh!

rdimitrov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pree-dew commented Sep 17, 2025 •

edited

Loading

rdimitrov commented Sep 17, 2025 •

edited

Loading