From 126d80c55e9c0189ecde4d6c15a33d79f65a447f Mon Sep 17 00:00:00 2001 From: XiaoYun Zhang Date: Wed, 31 Jul 2024 12:31:51 -0700 Subject: [PATCH 1/7] add readme --- src/Microsoft.ML.GenAI.Phi/README.md | 119 +++++++++++++++++++++++++++ 1 file changed, 119 insertions(+) create mode 100644 src/Microsoft.ML.GenAI.Phi/README.md diff --git a/src/Microsoft.ML.GenAI.Phi/README.md b/src/Microsoft.ML.GenAI.Phi/README.md new file mode 100644 index 0000000000..31f9f151bf --- /dev/null +++ b/src/Microsoft.ML.GenAI.Phi/README.md @@ -0,0 +1,119 @@ +# Microsoft.ML.GenAI.Phi +Torchsharp implementation of Microsoft phi-series models for GenAI + +## Supported list +The following phi-models are supported and tested: +- [x] [Phi-2](https://huggingface.co/microsoft/phi-2) +- [x] [Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) +- [x] [Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) +- [ ] [Phi-3-small-8k-instruct](https://huggingface.co/microsoft/Phi-3-small-8k-instruct) +- [ ] [Phi-3-small-128k-instruct](https://huggingface.co/microsoft/Phi-3-small-128k-instruct) +- [ ] [Phi-3-medium-4k-instruct](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct) +- [ ] [Phi-3-medium-128k-instruct](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct) +- [ ] [Phi-3-vision-128k-instruct](https://huggingface.co/microsoft/Phi-3-large-4k-instruct) + +## Getting Started with Semantic Kernel + +### Download model weight (e.g. phi-3-mini-4k-instruct) from Huggingface +```bash +## make sure you have lfs installed +git clone https://huggingface.co/microsoft/Phi-3-mini-4k-instruct +``` + +### Load model +```csharp +var weightFolder = "/path/to/Phi-3-mini-4k-instruct"; +var configName = "config.json"; +var config = JsonSerializier.Deserialize(File.ReadAllText(Path.Combine(weightFolder, configName))); +var model = new Phi3ForCasualLM(config); + +// load tokenzier +var tokenizerModelName = "tokenizer.model"; +var tokenizer = Phi3TokenizerHelper.FromPretrained(Path.Combine(weightFolder, tokenizerModelName)); + +// load weight +model.LoadSafeTensors(weightFolder); + +// initialize device +var device = "cuda"; +if (device == "cuda") +{ + torch.InitializeDeviceType(DeviceType.CUDA); +} + + +// create causal language model pipeline +var pipeline = new CausalLMPipeline(tokenizer, model, device); +``` + +### Add pipeline as `IChatCompletionService` to sematic kernel +```csharp +var kernel = Kernel.CreateBuilder() + .AddGenAIChatCompletion(pipeline) + .Build(); +``` + +### chat with the model +```csharp +var chatService = kernel.GetRequiredService(); +var chatHistory = new ChatHistory(); +chatHistory.AddSystemMessage("you are a helpful assistant"); +chatHistory.AddUserMessage("write a C# program to calculate the factorial of a number"); +await foreach (var response in chatService.GetStreamingChatMessageContentsAsync(chatHistory)) +{ + Console.Write(response); +} +``` + +## Getting started with AutoGen.Net +### Follow the same steps download model weight and load model +### Create `Phi3Agent` from pipeline +```csharp +var agent = new Phi3Agent(pipeline, name: "assistant") + .RegisterPrintMessage(); +``` + +### Chat with the model +```csharp +var task = """ +write a C# program to calculate the factorial of a number +"""; + +await agent.SendAsync(task); +``` + +### More examples +Please refer to [Microsoft.ML.GenAI.Samples](./../../docs/samples/Microsoft.ML.GenAI.Samples/) for more examples. + +## Dynamic loading +For the best of inference performance, it's recommended to run model inference on GPU, which requires at least 8GB of GPU memory for phi-3-mini-4k-instruct model if fully loaded. + +If your GPU memory is not enough, you can choose to dynamically load the model weight to GPU memory. Here is how it works behind the scene: +- when initializing the model, the size of each layer is calculated and stored in a dictionary +- when loading the model weight, each layer is assigned to a device (CPU or GPU) based on the size of the layer and the remaining memory of the device. If there is no enough memory on the device, the layer is loaded to CPU memory. +- when inference, the layer which is loaded to CPU memory is moved to GPU memory before the inference and moved back to CPU memory after the inference. + +Here is how to enable dynamic loading of model: +### Step 1: infer the size of each layer +You can infer the size of each layer using `InferDeviceMapForEachLayer` api. The `deviceMap` will be a key-value dictionary, where the key is the layer name and the value is the device name (e.g. "cuda" or "cpu"). + +```csharp +// manually set up the available memory on each device +var deviceSizeMap = new Dictionary + { + ["cuda"] = modelSizeOnCudaInGB * 1L * 1024 * 1024 * 1024, + ["cpu"] = modelSizeOnMemoryInGB * 1L * 1024 * 1024 * 1024, + ["disk"] = modelSizeOnDiskInGB * 1L * 1024 * 1024 * 1024, + }; + +var deviceMap = model.InferDeviceMapForEachLayer( + devices: ["cuda", "cpu", "disk"], + deviceSizeMapInByte: deviceSizeMap); +``` + +### Step 2: load model weight using `ToDynamicLoadingModel` api +Once the `deviceMap` is calculated, you can pass it to `ToDynamicLoadingModel` api to load the model weight. + +```csharp +model = model.ToDynamicLoadingModel(deviceMap, "cuda"); +``` From b45eb89343db077010e03da1deb53523d6dceb45 Mon Sep 17 00:00:00 2001 From: Xiaoyun Zhang Date: Mon, 5 Aug 2024 13:33:54 -0700 Subject: [PATCH 2/7] Update src/Microsoft.ML.GenAI.Phi/README.md Co-authored-by: Luis Quintanilla <46974588+luisquintanilla@users.noreply.github.com> --- src/Microsoft.ML.GenAI.Phi/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/Microsoft.ML.GenAI.Phi/README.md b/src/Microsoft.ML.GenAI.Phi/README.md index 31f9f151bf..bc94d326aa 100644 --- a/src/Microsoft.ML.GenAI.Phi/README.md +++ b/src/Microsoft.ML.GenAI.Phi/README.md @@ -95,7 +95,7 @@ If your GPU memory is not enough, you can choose to dynamically load the model w Here is how to enable dynamic loading of model: ### Step 1: infer the size of each layer -You can infer the size of each layer using `InferDeviceMapForEachLayer` api. The `deviceMap` will be a key-value dictionary, where the key is the layer name and the value is the device name (e.g. "cuda" or "cpu"). +You can infer the size of each layer using `InferDeviceMapForEachLayer` API. The `deviceMap` will be a key-value dictionary, where the key is the layer name and the value is the device name (e.g. "cuda" or "cpu"). ```csharp // manually set up the available memory on each device From d1019b85da428dfcf2eef037237705b4653f465f Mon Sep 17 00:00:00 2001 From: Xiaoyun Zhang Date: Mon, 5 Aug 2024 13:34:01 -0700 Subject: [PATCH 3/7] Update src/Microsoft.ML.GenAI.Phi/README.md Co-authored-by: Luis Quintanilla <46974588+luisquintanilla@users.noreply.github.com> --- src/Microsoft.ML.GenAI.Phi/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/Microsoft.ML.GenAI.Phi/README.md b/src/Microsoft.ML.GenAI.Phi/README.md index bc94d326aa..d3caa2e928 100644 --- a/src/Microsoft.ML.GenAI.Phi/README.md +++ b/src/Microsoft.ML.GenAI.Phi/README.md @@ -94,7 +94,7 @@ If your GPU memory is not enough, you can choose to dynamically load the model w - when inference, the layer which is loaded to CPU memory is moved to GPU memory before the inference and moved back to CPU memory after the inference. Here is how to enable dynamic loading of model: -### Step 1: infer the size of each layer +### Step 1: Infer the size of each layer You can infer the size of each layer using `InferDeviceMapForEachLayer` API. The `deviceMap` will be a key-value dictionary, where the key is the layer name and the value is the device name (e.g. "cuda" or "cpu"). ```csharp From 3800b1405f34dd003b204bad6d54ac4726ee560e Mon Sep 17 00:00:00 2001 From: Xiaoyun Zhang Date: Mon, 5 Aug 2024 13:34:08 -0700 Subject: [PATCH 4/7] Update src/Microsoft.ML.GenAI.Phi/README.md Co-authored-by: Luis Quintanilla <46974588+luisquintanilla@users.noreply.github.com> --- src/Microsoft.ML.GenAI.Phi/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/Microsoft.ML.GenAI.Phi/README.md b/src/Microsoft.ML.GenAI.Phi/README.md index d3caa2e928..bb6d99de01 100644 --- a/src/Microsoft.ML.GenAI.Phi/README.md +++ b/src/Microsoft.ML.GenAI.Phi/README.md @@ -86,7 +86,7 @@ await agent.SendAsync(task); Please refer to [Microsoft.ML.GenAI.Samples](./../../docs/samples/Microsoft.ML.GenAI.Samples/) for more examples. ## Dynamic loading -For the best of inference performance, it's recommended to run model inference on GPU, which requires at least 8GB of GPU memory for phi-3-mini-4k-instruct model if fully loaded. +It's recommended to run model inference on GPU, which requires at least 8GB of GPU memory for phi-3-mini-4k-instruct model if fully loaded. If your GPU memory is not enough, you can choose to dynamically load the model weight to GPU memory. Here is how it works behind the scene: - when initializing the model, the size of each layer is calculated and stored in a dictionary From e7a51c61225071b8bbd480dd010deda01d2bc1e8 Mon Sep 17 00:00:00 2001 From: Xiaoyun Zhang Date: Mon, 5 Aug 2024 13:34:20 -0700 Subject: [PATCH 5/7] Update src/Microsoft.ML.GenAI.Phi/README.md Co-authored-by: Luis Quintanilla <46974588+luisquintanilla@users.noreply.github.com> --- src/Microsoft.ML.GenAI.Phi/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/Microsoft.ML.GenAI.Phi/README.md b/src/Microsoft.ML.GenAI.Phi/README.md index bb6d99de01..b51ac942c9 100644 --- a/src/Microsoft.ML.GenAI.Phi/README.md +++ b/src/Microsoft.ML.GenAI.Phi/README.md @@ -27,7 +27,7 @@ var configName = "config.json"; var config = JsonSerializier.Deserialize(File.ReadAllText(Path.Combine(weightFolder, configName))); var model = new Phi3ForCasualLM(config); -// load tokenzier +// load tokenizer var tokenizerModelName = "tokenizer.model"; var tokenizer = Phi3TokenizerHelper.FromPretrained(Path.Combine(weightFolder, tokenizerModelName)); From e2083aa5910051ec072dfb71fd440721ec361a86 Mon Sep 17 00:00:00 2001 From: Xiaoyun Zhang Date: Mon, 5 Aug 2024 13:34:29 -0700 Subject: [PATCH 6/7] Update src/Microsoft.ML.GenAI.Phi/README.md Co-authored-by: Luis Quintanilla <46974588+luisquintanilla@users.noreply.github.com> --- src/Microsoft.ML.GenAI.Phi/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/Microsoft.ML.GenAI.Phi/README.md b/src/Microsoft.ML.GenAI.Phi/README.md index b51ac942c9..effa0dc3b5 100644 --- a/src/Microsoft.ML.GenAI.Phi/README.md +++ b/src/Microsoft.ML.GenAI.Phi/README.md @@ -111,7 +111,7 @@ var deviceMap = model.InferDeviceMapForEachLayer( deviceSizeMapInByte: deviceSizeMap); ``` -### Step 2: load model weight using `ToDynamicLoadingModel` api +### Step 2: Load model weights using `ToDynamicLoadingModel` API Once the `deviceMap` is calculated, you can pass it to `ToDynamicLoadingModel` api to load the model weight. ```csharp From cc55546c9ad72ed5ad46c1ded11b34fbfb4764df Mon Sep 17 00:00:00 2001 From: Xiaoyun Zhang Date: Mon, 5 Aug 2024 13:35:19 -0700 Subject: [PATCH 7/7] Update src/Microsoft.ML.GenAI.Phi/README.md Co-authored-by: Luis Quintanilla <46974588+luisquintanilla@users.noreply.github.com> --- src/Microsoft.ML.GenAI.Phi/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/Microsoft.ML.GenAI.Phi/README.md b/src/Microsoft.ML.GenAI.Phi/README.md index effa0dc3b5..758a78ad47 100644 --- a/src/Microsoft.ML.GenAI.Phi/README.md +++ b/src/Microsoft.ML.GenAI.Phi/README.md @@ -53,7 +53,7 @@ var kernel = Kernel.CreateBuilder() .Build(); ``` -### chat with the model +### Chat with the model ```csharp var chatService = kernel.GetRequiredService(); var chatHistory = new ChatHistory();