Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Commit cf94d30

Browse files
authored
Add libtorchtext cpp example (#1817)
* First attempt at adding examples * Working tokenizer example * Fixes to readme * Formatting fixes * Added instructions to download artifacts * Resolve PR comments
1 parent e1c7bc6 commit cf94d30

File tree

7 files changed

+148
-0
lines changed

7 files changed

+148
-0
lines changed

examples/libtorchtext/.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
build
2+
**/*.pt
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
cmake_minimum_required(VERSION 3.18 FATAL_ERROR)
2+
project(libtorchtext_cpp_example)
3+
4+
SET(BUILD_TORCHTEXT_PYTHON_EXTENSION OFF CACHE BOOL "Build Python binding")
5+
6+
find_package(Torch REQUIRED)
7+
message("libtorchtext CMakeLists: ${TORCH_CXX_FLAGS}")
8+
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}")
9+
10+
add_subdirectory(../.. libtorchtext)
11+
add_subdirectory(tokenizer)

examples/libtorchtext/README.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Libtorchtext Examples
2+
3+
- [Tokenizer](./tokenizer)
4+
5+
## Build
6+
7+
The example applications in this directory depend on `libtorch` and `libtorchtext`. If you have a working `PyTorch`, you
8+
already have `libtorch`. Please refer to
9+
[this tutorial](https://pytorch.org/tutorials/advanced/torch_script_custom_classes.html) for the use of `libtorch` and
10+
TorchScript.
11+
12+
`libtorchtext` is the library of torchtext's C++ components without Python components. It is currently not distributed,
13+
and it will be built alongside with the applications.
14+
15+
To build `libtorchtext` and the example applications you can run the following command.
16+
17+
```bash
18+
chmod +x build.sh # give script execute permission
19+
./build.sh
20+
```
21+
22+
For the usages of each application, refer to the corresponding application directory.

examples/libtorchtext/build.sh

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
#!/usr/bin/env bash
2+
3+
set -eux
4+
5+
this_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
6+
build_dir="${this_dir}/build"
7+
8+
mkdir -p "${build_dir}"
9+
cd "${build_dir}"
10+
11+
git submodule update
12+
cmake \
13+
-DCMAKE_PREFIX_PATH="$(python -c 'import torch;print(torch.utils.cmake_prefix_path)')" \
14+
-DRE2_BUILD_TESTING:BOOL=OFF \
15+
-DBUILD_TESTING:BOOL=OFF \
16+
-DSPM_ENABLE_SHARED=OFF \
17+
..
18+
cmake --build .
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# Tokenizer
2+
3+
This example demonstrates how you can use torchtext's `GPT2BPETokenizer` in a C++ environment.
4+
5+
## Steps
6+
7+
### 1. Download necessary artifacts
8+
9+
First we download `gpt2_bpe_vocab.bpe` and `gpt2_bpe_encoder.json` artifacts, both of which are needed to construct the
10+
`GPT2BPETokenizer` object.
11+
12+
```bash
13+
curl -O https://download.pytorch.org/models/text/gpt2_bpe_vocab.bpe
14+
curl -O https://download.pytorch.org/models/text/gpt2_bpe_encoder.json
15+
```
16+
17+
### 2. Create tokenizer TorchScript file
18+
19+
Next we create our tokenizer object, and save it as a TorchScript object. We also print out the output of the tokenizer
20+
on a sample sentence and verify that the output is the same before and after saving and re-loading the tokenizer. In the
21+
next steps we will load and execute the tokenizer in our C++ application. The C++ code is found in
22+
[`main.cpp`](./main.cpp).
23+
24+
```bash
25+
tokenizer_file="tokenizer.pt"
26+
python create_tokenizer.py --tokenizer-file "${tokenizer_file}"
27+
```
28+
29+
### 3. Build the application
30+
31+
Please refer to [the top level README.md](../README.md)
32+
33+
### 4. Run the application
34+
35+
Now we run the C++ application `tokenizer`, with the TorchScript object we created in Step 2. The tokenizer is run with
36+
the following sentence as input and we verify that the output is the same as that of Step 2.
37+
38+
In [the top level directory](../)
39+
40+
```bash
41+
./build/tokenizer/tokenize "tokenizer/${tokenizer_file}"
42+
```
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
from argparse import ArgumentParser
2+
3+
import torch
4+
from torchtext import transforms
5+
6+
7+
def main(args):
8+
tokenizer_file = args.tokenizer_file
9+
sentence = "The green grasshopper jumped over the fence"
10+
11+
# create tokenizer object
12+
encoder_json = "gpt2_bpe_encoder.json"
13+
bpe_vocab = "gpt2_bpe_vocab.bpe"
14+
tokenizer = transforms.GPT2BPETokenizer(encoder_json_path=encoder_json, vocab_bpe_path=bpe_vocab)
15+
16+
# script and save tokenizer
17+
tokenizer = torch.jit.script(tokenizer)
18+
print(tokenizer(sentence))
19+
torch.jit.save(tokenizer, tokenizer_file)
20+
21+
# load saved tokenizer and verify outputs match
22+
t = torch.jit.load(tokenizer_file)
23+
print(t(sentence))
24+
25+
26+
if __name__ == "__main__":
27+
parser = ArgumentParser()
28+
parser.add_argument("--tokenizer-file", default="tokenizer.pt", type=str)
29+
main(parser.parse_args())
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
#include <torch/nn/functional/activation.h>
2+
#include <torch/script.h>
3+
4+
#include <iostream>
5+
#include <string>
6+
#include <vector>
7+
8+
int main(int argc, const char* argv[]) {
9+
std::cout << "Loading model...\n";
10+
11+
torch::jit::script::Module module;
12+
try {
13+
module = torch::jit::load(argv[1]);
14+
} catch (const c10::Error& e) {
15+
return -1;
16+
}
17+
18+
torch::NoGradGuard no_grad; // ensures that autograd is off
19+
torch::jit::IValue tokens_ivalue = module.forward(std::vector<c10::IValue>(
20+
1, "The green grasshopper jumped over the fence"));
21+
std::cout << "Result: " << tokens_ivalue << std::endl;
22+
23+
return 0;
24+
}

0 commit comments

Comments
 (0)