Queue-based Worker Manager #647

al-rigazzi · 2024-07-25T21:27:04Z

This PR adds the RequestDispatcher to the MLI. The RequestDispatcher batches inference requests together.

The implementation can be improved, especially by adding:

Abstraction for Memory, so that Dragon's MemoryPool can be wrapped in a SmartSim class and different types of memory can be injected (esp. at unit testing time)
More parameters around Torch Threads and intra-op threads
Queue removal mechanism in RequestDispatcher
Model removal mechanism if OOM error when loading a model
Tests for RequestDispatcher, BatchQueue, DeviceManager, and so on.

There is no mechanism to address model versions right now.

We can decide what to do now and what to put up a ticket for.

…to fli-worker

AlyssaCote

LGTM! One tiny comment about potentially removing a timing line, but not worth holding up the approval!

ex/high_throughput_inference/mock_app.py

codecov · 2024-08-27T19:17:59Z

Codecov Report

Attention: Patch coverage is 0% with 623 lines in your changes missing coverage. Please review.

Please upload report for BASE (mli-feature@6d5518b). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...re/mli/infrastructure/control/requestdispatcher.py	0.00%	254 Missing ⚠️
.../_core/mli/infrastructure/control/workermanager.py	0.00%	95 Missing ⚠️
smartsim/_core/utils/timings.py	0.00%	86 Missing ⚠️
smartsim/_core/mli/infrastructure/worker/worker.py	0.00%	66 Missing ⚠️
...im/_core/mli/infrastructure/worker/torch_worker.py	0.00%	63 Missing ⚠️
.../_core/mli/infrastructure/control/devicemanager.py	0.00%	42 Missing ⚠️
..._core/mli/infrastructure/control/error_handling.py	0.00%	16 Missing ⚠️
smartsim/_core/launcher/dragon/dragonBackend.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@              Coverage Diff               @@
##             mli-feature     #647   +/-   ##
==============================================
  Coverage               ?   71.34%           
==============================================
  Files                  ?      102           
  Lines                  ?     8525           
  Branches               ?        0           
==============================================
  Hits                   ?     6082           
  Misses                 ?     2443           
  Partials               ?        0

Files with missing lines	Coverage Δ
smartsim/_core/launcher/dragon/dragonBackend.py	`1.96% <0.00%> (ø)`
..._core/mli/infrastructure/control/error_handling.py	`0.00% <0.00%> (ø)`
.../_core/mli/infrastructure/control/devicemanager.py	`0.00% <0.00%> (ø)`
...im/_core/mli/infrastructure/worker/torch_worker.py	`0.00% <0.00%> (ø)`
smartsim/_core/mli/infrastructure/worker/worker.py	`0.00% <0.00%> (ø)`
smartsim/_core/utils/timings.py	`0.00% <0.00%> (ø)`
.../_core/mli/infrastructure/control/workermanager.py	`0.00% <0.00%> (ø)`
...re/mli/infrastructure/control/requestdispatcher.py	`0.00% <0.00%> (ø)`

mellis13

Fantastic first implementation of the queue-based architecture. Thanks!

al-rigazzi added 30 commits June 25, 2024 12:21

Initial FLI-based implementation

e98e2fe

Add inference example stub

043f0e7

Lint, style, black magic

efc9e83

Merge branch 'mli-feature' of https://github.com/CrayLabs/SmartSim in…

35ec45e

…to fli-worker

Bring up to feature branch

ed3c42a

Update example

e5be26b

Change the changelog

a23010f

Make style

3c20f46

Attempt to mitigate import dragon error

b9ed5ba

Import dragon optional

0de06f3

isort

d051385

Fix imports in dragon backend tests

e77b1cd

Style

a90888d

Fix type

b431221

Rename examples dir

23efebc

Remove old dir

09b9d24

Add tests for torch worker

56d8e50

Switch to sender-supplied channels in app example

6cec83e

Add prototype client for mock app

3ad6d44

Update mock app

bd5f133

Changes to feature store

3e343ee

Merge upstream

a0525e5

Make style

a2bed26

Fix typing

36e92d9

Fix lint

59840a3

Remove duplicated/useless comments

b35b37d

Bring up to date with new schema

51e0b17

Add feature store prototype caching

1fcf17d

Add redis driver, fix FLI

d76f880

Merge branch 'mli-feature' of https://github.com/CrayLabs/SmartSim in…

0564d01

…to fli-worker

al-rigazzi added 8 commits August 26, 2024 11:17

Added tests for device manager

4a5185b

Fix tests

9d0ba30

Style and type

99da355

Fix mock app

c3646d7

Small change to app

c54e880

Merge branch 'mli-feature' into queue-wm

01c6fa9

Small change to app

093d706

Last fixes!

d9de5c1

al-rigazzi requested review from AlyssaCote, ankona and mellis13 and removed request for ankona August 27, 2024 18:06

Avoid using t.Self

eb03f08

AlyssaCote approved these changes Aug 27, 2024

View reviewed changes

ex/high_throughput_inference/mock_app.py Outdated Show resolved Hide resolved

al-rigazzi added 5 commits August 27, 2024 13:31

Remove unused timing

1e1b8c9

Split timing for request and tensors

be0b8e0

Pin watchdog to <5

bc11d92

Style

b04f4c1

Other styling fixes

47088f0

Move tests that require dragon.MemoryPool

0609eec

mellis13 approved these changes Aug 27, 2024

View reviewed changes

al-rigazzi added 6 commits August 27, 2024 17:51

Update tests

275e102

Style

b220d99

Import or skip dragon

d3ab796

Isort

14e627e

Fix pytest import

bbe97ff

Adapt syntax for python 3.9

eea793e

al-rigazzi merged commit 5d85995 into CrayLabs:mli-feature Aug 28, 2024

al-rigazzi deleted the queue-wm branch August 28, 2024 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Queue-based Worker Manager #647

Queue-based Worker Manager #647

Uh oh!

al-rigazzi commented Jul 25, 2024 •

edited

Loading

Uh oh!

AlyssaCote left a comment

Uh oh!

Uh oh!

codecov bot commented Aug 27, 2024 •

edited

Loading

Uh oh!

mellis13 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Queue-based Worker Manager #647

Queue-based Worker Manager #647

Uh oh!

Conversation

al-rigazzi commented Jul 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlyssaCote left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Aug 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mellis13 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

al-rigazzi commented Jul 25, 2024 •

edited

Loading

codecov bot commented Aug 27, 2024 •

edited

Loading