-
Notifications
You must be signed in to change notification settings - Fork 14
feat: Support "hits" trace metric #827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
bottlecap/src/tags/lambda/tags.rs
Outdated
@@ -39,7 +39,7 @@ const SERVICE_KEY: &str = "service"; | |||
// ComputeStatsKey is the tag key indicating whether trace stats should be computed | |||
const COMPUTE_STATS_KEY: &str = "_dd.compute_stats"; | |||
// ComputeStatsValue is the tag value indicating trace stats should be computed | |||
const COMPUTE_STATS_VALUE: &str = "1"; | |||
const COMPUTE_STATS_VALUE: &str = "0"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1
means computing stats on the backend when it receives traces, instead of on our extension side. This is the current approach and doesn't work well. This PR changes traces stats to be computed on our extension side.
.expect("Failed to get current timestamp") | ||
.as_nanos() | ||
.try_into() | ||
.expect("Failed to convert timestamp to u64"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do below to avoid potential panic?
let current_timestamp = match SystemTime::now()
.duration_since(UNIX_EPOCH)
.and_then(|d| d.as_nanos().try_into().map_err(|_| std::io::Error::new(std::io::ErrorKind::Other, "Timestamp overflow")))
{
Ok(ts) => ts,
Err(e) => {
error!("Failed to get current timestamp: {}, skipping stats flush", e);
return Vec::new();
}
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change, though I don't it matters a lot because u64 can represent 300+ years of time.
|
||
for timestamp in to_remove { | ||
self.buckets.remove(×tamp); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using retain() to save one round of scanning?
self.buckets.retain(|×tamp, bucket| {
if force_flush || Self::should_flush_bucket(current_timestamp, timestamp) {
stats.push(self.construct_stats_payload(timestamp, bucket));
false // remove this bucket
} else {
true // keep this bucket
}
});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Will change.
bottlecap/src/bin/bottlecap/main.rs
Outdated
let stats_concentrator = Arc::new(TokioMutex::new(StatsConcentrator::new( | ||
Arc::clone(config), | ||
Arc::clone(tags_provider), | ||
))); | ||
let stats_aggregator = Arc::new(TokioMutex::new(StatsAggregator::new_with_concentrator( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way where we don't need any locks and we just follow the same pattern as dogstatsd event handling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you referring to this? DataDog/serverless-components#32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will implement
} | ||
|
||
#[allow(clippy::module_name_repetitions)] | ||
pub struct StatsAgent { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this own the channel it creates and then return the tx through a public method? that way it's not created in the main binary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Will do.
hostname: String::new(), | ||
env: self.config.env.clone().unwrap_or_default(), | ||
version: self.config.version.clone().unwrap_or_default(), | ||
lang: "rust".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be the lambda runtime lang?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raphaelgavache Do you know what the lang
field means? I couldn't understand it from protobuf definition: https://github.com/DataDog/datadog-agent/blob/main/pkg/proto/datadog/trace/stats.proto#L32
Reaching out to you as the author of DataDog/datadog-agent#7875, which adds this field.
## This PR - Add the skeleton of `StatsConcentrator`, with no implementation - Add `StatsConcentratorHandle` and `StatsConcentratorService`, which send and process stats requests (`add()` and `get_stats()`) to/from a queue, so mutex is not needed, and lock contention can be avoided. (Thanks @duncanista for the suggestion and @astuyve for the example code DataDog/serverless-components#32) ## Next steps - Implement `StatsConcentrator`, which aggregates stats data into buckets and returns it in batch - Add more fields to `AggregationKey` and `Stats` - Move the processing of stats after "obfuscation", as suggested by APM team. This will involve lots of code changes, so I'll make it a separate PR. I'll mainly move code from this draft PR: #827 ## Architecture <img width="1296" height="674" alt="image" src="https://github.com/user-attachments/assets/2d4cb925-6cfc-4581-8ed6-6bd87cf0d87a" /> Jira: https://datadoghq.atlassian.net/browse/SVLS-7593
What
This PR enable accurate counting of
trace.aws.lambda.hits
metric (hopefully).How
Adds a "stats concentrator", which is a simple version of the stats concentrator in datadog-agent
Architecture
Testing
Tested with a modified "busycaller" self monitoring stack with Python 3.11.
After
The "hits" graph in Datadog overall aligns with the graph on AWS. The counts are:
There's an undercounting of 1 invocation. To investigate later.
Before
The "hits" graph in Datadog was very different from the graph on AWS. The timestamp for many invocations were wrong.
Next steps:
trace.aws.lambda.error