New Workflow Submit: Distributed Data Stream Aggregator workflow with 3-tier architecture #404

treadyaparna · 2025-09-14T16:37:25Z

Issue:

Description of changes:

This PR introduces a comprehensive Distributed Data Stream Aggregator workflow that demonstrates large-scale data aggregation from multiple third-party locations using AWS Step Functions' distributed processing capabilities.

A key highlight of this solution is its unique no-Lambda approach, making it a low-code (almost no-code) architecture — with minimal coding required only in the AWS Glue job for final consolidation.

Key Features:

3-Tier Architecture: Main orchestrator + Standard execution child + Express execution child
Distributed Processing: Parallel processing of multiple data sources with configurable concurrency
Scalable Pagination: Handles large datasets with limit/offset pagination strategy
Data Consolidation: AWS Glue integration for combining partial files into final output
Robust Error Handling: Comprehensive retry logic and graceful failure management
Low-Code Implementation: Minimal reliance on custom code; orchestration is entirely achieved with native Step Functions integrations

Technical Implementation:

JSONata Query Language: Modern state machine definitions with advanced data manipulation
HTTP Integration: Secure third-party API connections via EventBridge
Storage Strategy: Organized S3 temporary storage with task-based directory structure
Status Tracking: DynamoDB integration for task lifecycle management
No-Lambda Pattern: Achieves distributed data aggregation without requiring custom Lambda functions

Use Cases:

Multi-source data aggregation
Third-party API data consolidation
Large-scale ETL operations
Distributed data processing workflows

Why this matters:

Reduced Operational Overhead: Eliminates the need to manage, patch, and scale Lambda functions.
Lower Costs: Step Functions native integrations avoid Lambda invocation charges.
Simplified Maintenance: A low-code architecture that is easier to evolve and extend.
Future-Ready: Leverages AWS-managed services for scalability, reliability, and modernization without custom infrastructure.

bfreiberg

Hi, thanks for your submission. The workflow seems targeted to specific circumstances, e.g. a summary API endpoint and Glue script. I'm not sure what the added benefit over using something like distributed map directly is?

distributed-data-stream-aggregator/README.md

bfreiberg · 2025-09-17T15:39:19Z

distributed-data-stream-aggregator/README.md

+        --command Name=glueetl,ScriptLocation=s3://your-bucket/glue-script.py
+    ```
+
+6. Create HTTP connections for third-party API access:


I think it would make more sense to link to the documentation here instead as you don't target a specific API. Also to highlight that there are other auth methods as well

Thanks, I have included these details.

bfreiberg · 2025-09-17T15:41:02Z

distributed-data-stream-aggregator/README.md

+    aws glue create-job \
+        --name data-aggregation-job \
+        --role arn:aws:iam::YOUR_ACCOUNT:role/GlueServiceRole \
+        --command Name=glueetl,ScriptLocation=s3://your-bucket/glue-script.py


What script are you referring to here?

I have uploaded the glue job file.

bfreiberg · 2025-09-17T15:44:29Z

distributed-data-stream-aggregator/README.md

+### Data Processing Workflow (Express Execution)
+The express workflow handles the actual API calls to third-party endpoints. It receives location details, data type, and pagination parameters, makes HTTP calls with query parameters, formats the retrieved data into standardized JSON format, and returns results with count and pagination metadata.
+
+### Data Consolidation


Wouldn't Step Functions Distributed Map be an alternative to this?

The aggregation phase is a reduce step across all items, which the Distributed Map isn’t designed to handle directly. I write per-item results as part files and use a Glue job to merge them. This also helps avoid Step Functions’ 256 KB state payload limit by keeping large intermediate data out of the state.

distributed-data-stream-aggregator/README.md

bfreiberg · 2025-09-17T15:54:54Z

distributed-data-stream-aggregator/statemachine/statemachine.asl.json

+      "Type": "Map",
+      "ItemProcessor": {
+        "ProcessorConfig": {
+          "Mode": "DISTRIBUTED",


why use a distributed map here, how many locations do you expect to query in parallel?

I chose a Distributed Map to keep the parent execution lightweight and suited for large workloads, with room to increase concurrency later through configuration rather than a redesign. In load tests with ~1,000 locations (MaxConcurrency=1000), it performed reliably, and this approach gives me headroom to scale further if needed.

bfreiberg · 2025-09-17T15:55:56Z

distributed-data-stream-aggregator/statemachine/statemachine.asl.json

+      "Label": "IterateContainers",
+      "MaxConcurrency": 5,
+      "ItemBatcher": {
+        "MaxItemsPerBatch": 1,


Why enable batching but set it to 1 item per batch then?

Here, I aim to handle large-scale data.
I am using Distributed Map to keep the parent execution’s history small. An Inline Map records every per-item step in the parent, which can push a Standard workflow toward the 25,000-event limit at scale. With Distributed Map, each item runs as its own child execution, so the parent stays lightweight—even with large fan-out.

bfreiberg · 2025-09-17T15:57:03Z

distributed-data-stream-aggregator/statemachine/statemachine.asl.json

+      },
+      "Next": "Combine Part Files",
+      "Label": "IterateContainers",
+      "MaxConcurrency": 5,


If you only need a concurrency of 5, why not use inline map mode?

I set MaxConcurrency to 5 here to respect downstream rate limits. However, I chose a Distributed Map to keep the parent execution light, to get per-item isolation for our multi-step logic, and to leave room to increase concurrency later without redesign. I’ve load-tested ~1,000 items in parallel successfully; this PR uses 5 simply as a starting point.

I think it be really helpful to include this in the README

bfreiberg · 2025-09-17T15:58:16Z

distributed-data-stream-aggregator/README.md

+7. Deploy the state machines by updating the placeholder values in each ASL file:
+   - Replace `'s3-bucket-name'` with your source bucket name
+   - Replace `'destination_bucket'` with your destination bucket name  
+   - Replace `'api_endpoint'` and `'summary_api_endpoint'` with your API URLs


What is the purpose of this summary_api_endpoint?

I see your point. summary_api_endpoint is the URL the Get Location Summary state uses to retrieve each location summary. I previously switched ARNs to static names and missed updating this reference. I’ve corrected the endpoint now.

bfreiberg · 2025-09-17T16:01:06Z

distributed-data-stream-aggregator/statemachine/statemachine.asl.json

+          "--output_file_name": "{% $outputFileName %}"
+        }
+      },
+      "Next": "Wait for the status",


why do a wait loop instead of invoking Glue synchronously?

Glue can be invoked synchronously. I chose a small wait/poll loop to allow a tunable polling interval, a dedicated timeout/circuit breaker, and explicit handling of intermediate states with retries/backoff.

If you prefer the .sync pattern for simplicity and fewer states, I’m happy to switch—both approaches work; this just gave me finer control. 😄

Co-authored-by: Ben <[email protected]>

treadyaparna · 2025-09-18T08:34:37Z

Hi @bfreiberg Thank you for your review. I have addressed all the comments and look forward to your response.

bfreiberg · 2025-09-23T19:15:31Z

Thanks for your quick updates. The PR looks good now. Thank you for your contribution. Your workflow will be added to Serverlessland soon.

bfreiberg · 2025-09-23T19:09:30Z

distributed-data-stream-aggregator/statemachine/statemachine.asl.json

+      },
+      "Next": "Combine Part Files",
+      "Label": "IterateContainers",
+      "MaxConcurrency": 5,


I think it be really helpful to include this in the README

treadyaparna · 2025-09-24T06:22:15Z

@bfreiberg Thank you for the approval. I’ll update the README in the following PR and look forward to the workflow being added to Serverlessland.

treadyaparna marked this pull request as draft September 14, 2025 16:37

feat: Init distributed-data-stream-aggregator

91fcbdb

treadyaparna force-pushed the distributed-data-stream-aggregator branch from 4c402da to 91fcbdb Compare September 14, 2025 17:23

treadyaparna marked this pull request as ready for review September 14, 2025 17:24

chore: Update the name of the child workflows

82a5556

treadyaparna mentioned this pull request Sep 14, 2025

New Workflow Submit: Distributed Data Stream Aggregator – No-Lambda, Low-Code 3-Tier Architecture for Scalable Data Processing #405

Open

treadyaparna changed the title ~~feat: Distributed Data Stream Aggregator workflow with 3-tier architecture~~ New Workflow Submit: Distributed Data Stream Aggregator workflow with 3-tier architecture Sep 15, 2025

bfreiberg suggested changes Sep 17, 2025

View reviewed changes

treadyaparna and others added 6 commits September 18, 2025 07:13

Update distributed-data-stream-aggregator/README.md

e6ada31

Co-authored-by: Ben <[email protected]>

Update distributed-data-stream-aggregator/README.md

6333837

Co-authored-by: Ben <[email protected]>

feat: add glue job file

e0bb0c1

fix: update the api_endpoint

6ad506b

docs: update the thrid-party api auth

eb43aca

fix: update the destination bucket as AWS S3

91bf342

bfreiberg approved these changes Sep 23, 2025

View reviewed changes

New Workflow Submit: Distributed Data Stream Aggregator workflow with 3-tier architecture #404

Are you sure you want to change the base?

New Workflow Submit: Distributed Data Stream Aggregator workflow with 3-tier architecture #404

Uh oh!

Conversation

treadyaparna commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue:

Description of changes:

Key Features:

Technical Implementation:

Use Cases:

Why this matters:

Uh oh!

bfreiberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

treadyaparna Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

treadyaparna commented Sep 18, 2025

Uh oh!

bfreiberg commented Sep 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

treadyaparna commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

treadyaparna commented Sep 14, 2025 •

edited

Loading

treadyaparna Sep 18, 2025 •

edited

Loading

treadyaparna commented Sep 24, 2025 •

edited

Loading