feat: Support "hits" trace metric #827

lym953 · 2025-09-10T18:58:33Z

What

This PR enable accurate counting of trace.aws.lambda.hits metric (hopefully).

How

Adds a "stats concentrator", which is a simple version of the stats concentrator in datadog-agent

Architecture

Testing

Tested with a modified "busycaller" self monitoring stack with Python 3.11.

After

The "hits" graph in Datadog overall aligns with the graph on AWS. The counts are:

Datadog: [44, 171, 171, 169, 172, 121]
AWS: [44, 171, 171, 170, 172, 121]

There's an undercounting of 1 invocation. To investigate later.

Before

The "hits" graph in Datadog was very different from the graph on AWS. The timestamp for many invocations were wrong.

Next steps:

Support other AWS Lambda trace metrics such as trace.aws.lambda.error
Maybe support trace stats for custom traces
Extract the code into a component for reuse

litianningdatadog · 2025-09-11T14:56:43Z

bottlecap/src/tags/lambda/tags.rs

@@ -39,7 +39,7 @@ const SERVICE_KEY: &str = "service";
 // ComputeStatsKey is the tag key indicating whether trace stats should be computed
 const COMPUTE_STATS_KEY: &str = "_dd.compute_stats";
 // ComputeStatsValue is the tag value indicating trace stats should be computed
-const COMPUTE_STATS_VALUE: &str = "1";
+const COMPUTE_STATS_VALUE: &str = "0";


What is the reason for this change?

1 means computing stats on the backend when it receives traces, instead of on our extension side. This is the current approach and doesn't work well. This PR changes traces stats to be computed on our extension side.

litianningdatadog · 2025-09-11T15:03:41Z

bottlecap/src/traces/stats_concentrator.rs

+            .expect("Failed to get current timestamp")
+            .as_nanos()
+            .try_into()
+            .expect("Failed to convert timestamp to u64");


Can we do below to avoid potential panic?

let current_timestamp = match SystemTime::now() .duration_since(UNIX_EPOCH) .and_then(|d| d.as_nanos().try_into().map_err(|_| std::io::Error::new(std::io::ErrorKind::Other, "Timestamp overflow"))) { Ok(ts) => ts, Err(e) => { error!("Failed to get current timestamp: {}, skipping stats flush", e); return Vec::new(); } };

Will change, though I don't it matters a lot because u64 can represent 300+ years of time.

litianningdatadog · 2025-09-11T15:10:43Z

bottlecap/src/traces/stats_concentrator.rs

+
+        for timestamp in to_remove {
+            self.buckets.remove(&timestamp);
+        }


How about using retain() to save one round of scanning?

self.buckets.retain(|&timestamp, bucket| { if force_flush || Self::should_flush_bucket(current_timestamp, timestamp) { stats.push(self.construct_stats_payload(timestamp, bucket)); false // remove this bucket } else { true // keep this bucket } });

Good point! Will change.

duncanista · 2025-09-11T17:08:17Z

bottlecap/src/bin/bottlecap/main.rs

+    let stats_concentrator = Arc::new(TokioMutex::new(StatsConcentrator::new(
+        Arc::clone(config),
+        Arc::clone(tags_provider),
+    )));
+    let stats_aggregator = Arc::new(TokioMutex::new(StatsAggregator::new_with_concentrator(


Is there a way where we don't need any locks and we just follow the same pattern as dogstatsd event handling?

Are you referring to this? DataDog/serverless-components#32

Sure, will implement

duncanista · 2025-09-11T17:10:24Z

bottlecap/src/traces/stats_agent.rs

+}
+
+#[allow(clippy::module_name_repetitions)]
+pub struct StatsAgent {


Should this own the channel it creates and then return the tx through a public method? that way it's not created in the main binary?

Good point. Will do.

duncanista · 2025-09-11T17:12:51Z

bottlecap/src/traces/stats_concentrator.rs

+            hostname: String::new(),
+            env: self.config.env.clone().unwrap_or_default(),
+            version: self.config.version.clone().unwrap_or_default(),
+            lang: "rust".to_string(),


should this be the lambda runtime lang?

@raphaelgavache Do you know what the lang field means? I couldn't understand it from protobuf definition: https://github.com/DataDog/datadog-agent/blob/main/pkg/proto/datadog/trace/stats.proto#L32
Reaching out to you as the author of DataDog/datadog-agent#7875, which adds this field.

@duncanista

## This PR - Add the skeleton of `StatsConcentrator`, with no implementation - Add `StatsConcentratorHandle` and `StatsConcentratorService`, which send and process stats requests (`add()` and `get_stats()`) to/from a queue, so mutex is not needed, and lock contention can be avoided. (Thanks @duncanista for the suggestion and @astuyve for the example code DataDog/serverless-components#32) ## Next steps - Implement `StatsConcentrator`, which aggregates stats data into buckets and returns it in batch - Add more fields to `AggregationKey` and `Stats` - Move the processing of stats after "obfuscation", as suggested by APM team. This will involve lots of code changes, so I'll make it a separate PR. I'll mainly move code from this draft PR: #827 ## Architecture <img width="1296" height="674" alt="image" src="https://github.com/user-attachments/assets/2d4cb925-6cfc-4581-8ed6-6bd87cf0d87a" /> Jira: https://datadoghq.atlassian.net/browse/SVLS-7593

lym953 added 25 commits September 2, 2025 13:32

Add debug log

2097d3d

Send dummy stats

6c6532a

...

338273a

Add more fields

0af0f8f

Can see metrics in metrics explorer

cb2e790

Add stats concentrator, which pushes data to aggregator

102cbbe

Avoid returning unused trace_tx

82d692b

Move start_stats_agent() inside start_stats_agent()

0589583

Make stats aggregator pull from stats concentrator

1c7bb7c

Add logging

03e88cc

Fix double counting

e06b1e7

Move ClientStatsPayload construction to StatsConcentrator

92ba7b2

Create buckets

853e955

Do not flush the latest two buckets

b383773

Do not use hard coded keys such as yiming_name

c69adb6

Change _dd.compute_stats from 1 to 0

13c2aef

Remove MyStatsProcessor

609294c

Remove unused code

39766d6

Format

503c07d

Rename variables and remove unnecessary code

ce2879c

Fix code style

70f5e07

Code style

029a556

Add comments

abf9372

Add tests for should_flush_bucket()

815e740

Format

c5770e3

litianningdatadog reviewed Sep 11, 2025

View reviewed changes

duncanista reviewed Sep 11, 2025

View reviewed changes

lym953 added 18 commits September 12, 2025 16:00

Move the trigger to trace agent, without grouping by key

aa7073c

Support aggregation keys

01441c3

Support duration

847ec5d

Support errors

8b103ee

Support duration

ca42a2c

Change http status code from 200 to 0

efe580c

Remove unused resource param

2687e00

Get service from trace

c39a4a0

Get env from trace

d0fa419

Support r#type

f298a2e

Add comments

2793e41

Use retain()

e8ca422

Handle error when casting u128 to u64

45c65d9

Fix the support for error and duration

6a9a16a

Support top_level_hits

cc0e252

Add feature flag

7026753

Add stats concentrator service to avoid using mutex

d0f55e8

Remove some debug log

9f58ddb

lym953 mentioned this pull request Sep 17, 2025

feat: [Trace Stats] Add skeleton of concentrator #842

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Support "hits" trace metric #827

feat: Support "hits" trace metric #827

Uh oh!

lym953 commented Sep 10, 2025 •

edited

Loading

Uh oh!

litianningdatadog Sep 11, 2025

Uh oh!

lym953 Sep 15, 2025

Uh oh!

litianningdatadog Sep 11, 2025

Uh oh!

lym953 Sep 15, 2025

Uh oh!

litianningdatadog Sep 11, 2025

Uh oh!

lym953 Sep 15, 2025

Uh oh!

duncanista Sep 11, 2025

Uh oh!

lym953 Sep 16, 2025

Uh oh!

duncanista Sep 16, 2025

Uh oh!

lym953 Sep 16, 2025

Uh oh!

duncanista Sep 11, 2025

Uh oh!

lym953 Sep 15, 2025

Uh oh!

duncanista Sep 11, 2025

Uh oh!

lym953 Sep 15, 2025

Uh oh!

Uh oh!

feat: Support "hits" trace metric #827

Are you sure you want to change the base?

feat: Support "hits" trace metric #827

Uh oh!

Conversation

lym953 commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How

Architecture

Testing

After

Before

Next steps:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lym953 commented Sep 10, 2025 •

edited

Loading