T5Transform text pre-processing for t5 model #1852

pmabbo13 · 2022-07-21T19:25:37Z

Description

Add a transformation class that takes string inputs and prepares them to be passed into a T5 encoder. The transformation class should also have a decode method that translates token ids back into strings. This will be used to translate the sequences generated by the T5 decoder.

Process

T5Transform is instantiated by providing a path to a pre-trained SentencePiece model, the maximum sequence length (used for truncation), the padding index, and the end-of-sequence index.

Its forward method accepts a single string, or a batch of strings, and uses the pre-trained SentencePiece model to tokenize the string(s) and translate the tokens to their corresponding ids. Then the resulting sequences are truncated and an end-of-sequence token is added to each. Finally, the sequences are padded to the length of the longest sequence in the batch.

Its decode method accepts a single list of token ids, or a batch of these lists to represent multiple sequences. The pre-trained SentencePiece model is then used to translate them back into tokens and merge the tokens into a single string per sequence.

Test

We test that the forward method correctly translates an input string (batched and un-batched) into the appropriate token ids, with the special tokens added where necessary. We also test that the decode method correctly translates token ids (batched and un-batched) into the correct strings. The torch-scripted versions of these transforms are also tested.

pytest test/prototype/models/test_transforms.py

test/prototype/test_transforms.py

torchtext/prototype/models/t5/t5_transform.py

parmeet · 2022-07-22T16:22:09Z

test/prototype/models/test_transforms.py

+
+class TestTransforms(TorchtextTestCase):
+    def _t5tokenizer(self, test_scripting):
+        asset_name = "t5_tokenizer_base.model"


Instead of adding new asset file, we should probably work with existing assets if available. In this case, shall we try working with spm_example.model ?

@Nayef211 and I were actually debating how best to approach this, because if we used spm_example.model then we'd essentially be testing for functional correctness. But since T5Transform is so similar to SentencePieceTokenizer except that it includes additional transformations specific to T5, we thought it made more sense to tailor the test towards t5 specifically as opposed to a general spm model.

@parmeet if we use the existing spm_example.model, these tests do not add as much value as we already have specific tests for the SentencePiece tokenizer. As @pmabbo13 mentioned, if we want to test that the output of the T5Transform is equal to that of the T5Transform in HF, then it would make sense to make use of the spm model specific to T5. Also the asset is around 700 KB which is less than some of the existing assets we've checked in. Lmk what you think!

I understand the overall sentiment here and it's a good argument for adding the actual asset file. But then this make me wonder if we are really unit-testing the functional correctness of the transform implementation or actually testing the asset file :).

That said, I think we would also be needing this for integration testing, since we need a real output in there instead of dummy output from any SPM model file. So I think I agree with you both, adding the actually asset file would make sense!

test/prototype/models/test_transforms.py

torchtext/prototype/models/t5/t5_transform.py

Nayef211 · 2022-07-22T18:01:14Z

torchtext/prototype/models/t5/t5_transform.py

+        self.padding_idx = padding_idx
+        self.pipeline = T.Sequential(T.Truncate(self.max_seq_len), T.AddToken(token=self.eos_idx, begin=False))
+
+    def forward(self, input: Any) -> Any:


Is there a reason we specify the input as Any instead of Union[str, List[str]]?

I initially had them typed, but noticed that SentencePieceTokenizer had them as Any so deferred to that. I will revert it back.

Is there a reason we specify the input as Any instead of Union[str, List[str]]?

I guess @pmabbo13 might have followed what we did in our transform implementation where we always use Any as type for input. The reason for this is to ensure when transforms are combined in SequentialTransform, the overall transform is still scriptable. More details about this issue can be found here. As for T5Transform, if we do not expect it to be used in SequentialTransform and treat it as a standalone one, I agree that we could just use the right annotation types as suggested above.

torchtext/prototype/models/t5/t5_transform.py

parmeet

LGTM! Thanks @pmabbo13 for adding the transform class.

text pre-processing for t5 model

74b6fd3

facebook-github-bot added the cla signed label Jul 21, 2022

Nayef211 reviewed Jul 21, 2022

View reviewed changes

test/prototype/test_transforms.py Outdated Show resolved Hide resolved

Nayef211 reviewed Jul 21, 2022

View reviewed changes

test/prototype/test_transforms.py Outdated Show resolved Hide resolved

pmabbo13 added 4 commits July 21, 2022 16:43

save tokenizer model in asset, to be used during testing

2656071

moving t5transform and tests under prototype/models

eb05721

instantiate pipeline when initializing transform

f9a59fa

add testing for decode method

1bb5302

parmeet reviewed Jul 21, 2022

View reviewed changes

torchtext/prototype/models/t5/t5_transform.py Show resolved Hide resolved

adding docstrings

94b50eb

pmabbo13 marked this pull request as ready for review July 22, 2022 15:20

pmabbo13 requested review from Nayef211 and parmeet July 22, 2022 15:21

script encode method

adf166f

parmeet reviewed Jul 22, 2022

View reviewed changes

coalesce encode and decode tests

75a48d2

Nayef211 reviewed Jul 22, 2022

View reviewed changes

updating docstrings

a136eea

parmeet approved these changes Jul 22, 2022

View reviewed changes

pmabbo13 added 2 commits July 22, 2022 14:33

type annotations

1f29d6b

Merge branch 'main' into feature/t5-transform

270052e

pmabbo13 merged commit e114e98 into pytorch:main Jul 22, 2022

pmabbo13 deleted the feature/t5-transform branch July 22, 2022 19:22

pmabbo13 mentioned this pull request Jul 22, 2022

Add T5 Model and Demo on Text Summarization using CNNDM Dataset #1800

Closed

25 tasks

T5Transform text pre-processing for t5 model #1852

T5Transform text pre-processing for t5 model #1852

Uh oh!

Conversation

pmabbo13 commented Jul 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Process

Test

Uh oh!

Uh oh!

Uh oh!

Uh oh!

parmeet Jul 22, 2022

Choose a reason for hiding this comment

Uh oh!

pmabbo13 Jul 22, 2022

Choose a reason for hiding this comment

Uh oh!

Nayef211 Jul 22, 2022

Choose a reason for hiding this comment

Uh oh!

parmeet Jul 22, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Nayef211 Jul 22, 2022

Choose a reason for hiding this comment

Uh oh!

pmabbo13 Jul 22, 2022

Choose a reason for hiding this comment

Uh oh!

parmeet Jul 22, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

parmeet left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pmabbo13 commented Jul 21, 2022 •

edited

Loading