Python nature brings a lot of challenges when dealing with blocking IO. HuggingFace SDK doesn't provide an out-of-box solution to having inference on models be threaded, although the lower-level structures (PyTorch and Tensorflow) provides the necessary tooling. HF docs suggest using multi-threaded web server, but my attempts didn't to apply the same snippet didn't resolve well.
As I needed an urgent PoC of being able to provide multi-tenant (more than one user using using LLM capabilities at once) service. I decided to build a PoC that follows workers concept, where multiple number of workers can be started alongside backend to provide multi-tenant API for LLM inference.
This specific demo runs falcon-40b-instruct model in conversational mode, and allows users to provide knowledge source, article, and ask question so LLM would answer it assuming its only knowledge is article. To use this PoC:
- Create a
venv, and activate it:
python[3[.11]] -m venv venv
source venv/bin/activate- Install runtime dependencies:
pip install .- Start
backend:
python backend.py- Start one or multiple
workerinstances: This is doable by starting new shell and sourcing earlier createdvenvand starting theworkerinstance:
# New shell, Working directory is this project
source venv/bin/active
python worker.py- Make a request to
backend:
curl -X POST -H "Content-type: application/json" -d '{"article": "Today is Wed. 21st. Jun 2023. The weather is hot. I am currently not at home, but at office. I am working on implementing multi-threading for the LLM backend", "question":"What date is it?"}' 'http://127.0.0.1:8080'
# The date mentioned in the article is 21st. June...To Convert this into an MVP, following points should be tackled:
- Add health check to
backend: Upon making request toworkerinstance, ifworkerinstance is unreachable over period of retries, it should be removed frombackendregistered workers. - Add access-control on
backendendpoints: Endpoints ofbackendfor registering and de-registeringworkerinstances should be scoped-down to prevent misuse. - Return inference time with requests for analytics.
- Containerize PoC to run with
docker-compose.