Add S3 integration for input/output data to job executor #26

dawidkurdyla · 2025-08-28T22:50:44Z

This pull request introduces first‑class support for pulling input data from S3/MinIO and pushing output data back, while preserving backward‑compatible behaviour for existing HyperFlow jobs.

Key points:

Data stager. New data‑stager.js and storage/s3Adapter.js modules implement:
- Concurrent downloads from S3 prefixes/keys into the executor’s input directory. Include/exclude glob patterns and optional recursion are supported.
- Concurrent uploads of output files back to S3, with configurable overwrite and layout (e.g. {stem}.ext to derive object names from input stems).
- Concurrency and retries are controlled via HF_S3_CONCURRENCY and HF_S3_RETRIES (defaults sensible).
- Optional local cleanup (HF_TASK_CLEANUP_LOCAL=1) to remove downloaded inputs and uploaded outputs after job completion.
- Uses AWS SDK v3 (@aws-sdk/client-s3 and @aws-sdk/lib-storage) with endpoint/region/path‑style settings read from HF_S3_ENDPOINT, HF_S3_FORCE_PATH_STYLE, AWS_REGION or AWS_DEFAULT_REGION for MinIO/AWS compatibility
Connector tweaks. The RemoteJobConnector now uses a keys object to refer to wf::tasksPendingCompletionHandling and wf::completedTasks, making it easier to mark tasks as completed or ready for completion handling
Environment variables & defaults. New variables:
HF_VAR_USE_S3_IO – enable S3 downloads/uploads; off by default to maintain old behaviour.
HF_S3_ENDPOINT, HF_S3_FORCE_PATH_STYLE, AWS_REGION/AWS_DEFAULT_REGION – S3/MinIO config.
HF_S3_CONCURRENCY, HF_S3_RETRIES – concurrency and retry controls.
HF_TASK_CLEANUP_LOCAL – remove local data after successful upload.
Existing workflows without these variables continue to run as before.

Dependencies. Adds @aws-sdk/client-s3, @aws-sdk/lib-storage, minimatch and updates amqplib to latest, but retains callback‑based AMQP API for backward compatibility

…compatibility for hyperflow

dawidkurdyla added 15 commits August 13, 2025 19:44

refactor: rewrite hflow-job-listener from callback to async/await

58ec12b

fix: fix missmatch in task completedNotificationQueueKey

2560ec8

feature: add placeholders for input/output dirs as executor arguments

977175a

(wip) Add S3 file management support to executor

50c5e7a

update lock, update reddis queue keys

05fb71a

chore: change env variables to match hyperflow

97c34c7

chore(listener): revert to amqplib callback API, implement backwards …

badc892

…compatibility for hyperflow

chore: minor cleanup

55b1b16

S3 file support refactor, improve info/error logging

bee193d

Preserve full file path as a key

6c4c896

Add missing files to npm package.json

1d68fee

Improve executable arguments templating

10e2efd

Fix broken minimatch import

3be9785

Add graceful SIGTERM handler

6340bf5

Sigterm, sigint handler

55bc005

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add S3 integration for input/output data to job executor #26

Add S3 integration for input/output data to job executor #26

Uh oh!

dawidkurdyla commented Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add S3 integration for input/output data to job executor #26

Are you sure you want to change the base?

Add S3 integration for input/output data to job executor #26

Uh oh!

Conversation

dawidkurdyla commented Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant