[MERGE 1/2] merge `main` branch to `fbsync` #1948

Nayef211 · 2022-10-18T21:01:49Z

Description

Refer to Ensure main and fbsync are both in sync #1949

* Migrating penntreebank dataset to use torchdata * Update FileLoader to FileOpener * Resolved comments about return_path * Using strip() to remove leading/trailing spaces Co-authored-by: nayef211 <[email protected]>

* Migrating enwik9 dataset to use torchdata * Added typing to params * Fixed PR comments. Updated to data_dp * Added caching for extracted files * Moved FileOpener after ondiskcache datapipe Co-authored-by: nayef211 <[email protected]>

…ytorch#1530) * add double caching for yelp polarity to speed up extracted reading. * rename dps for consistency and simplify filepath_fn * add FileOpener within caching block for more consistency.

* Migrate IMDB to datapipes * add double cache for extracted reading * update cache name

…1528) * add double caching for yahoo to speed up extracted reading. * simplify filepath_fn * rename dps for consistency. * add FileOpener within caching block for more consistency.

* Migrate WikiText2 to datapipes * Address code review comments and add double caching

…rch#1529)

* First attempt at adding test for amazon review polarity * Updated dataset to take validate_hash param. Finalized tests * Created non empty tar file * Remove formatting. Patch _hash_check method from torchdata during testing * Added super().setUpClass() * Remove commented import Co-authored-by: nayef211 <[email protected]>

* Migrating SST2 from experimental to datasets folder * Added SST2 to docs and to init file * Removing empty line from docs Co-authored-by: nayef211 <[email protected]>

* Rename amazon review polarity test * Added renamed file to git Co-authored-by: nayef211 <[email protected]>

Co-authored-by: nayef211 <[email protected]>

* Added mock test for SST2 * Remove print line * Resolving PR comments * Updated comment to say zip * updated ordering of splits in parameterization * Using zip_equal for iteration in test_sst2 Co-authored-by: nayef211 <[email protected]>

Co-authored-by: nayef211 <[email protected]>

* migrate IWSLT2017 to datapipes. * refactor IWSLT2017 to use feedback from IWSLT2016. * remove unused import. * fix flake. * fix typo in comment. * add TODOs to IWSLT datasets. * refactor common code out of IWSLTs and convert single quotes to double. * fix typo.

…ch#1541) * Implement CLIPEncoder in C++ Add case insensitive flag to CLIP pre tokenization regex Add Python interface Bring back gpt2 Add docstring Update docs * Fix stylecheck

* mock up IWSLT2016 test for faster testing. * rename variable for consistency.

) * Fix Sphinx-gallery display and pin sphinx-related packages * Resolving PR comments * Resolving PR comments * Remove language = None from docs

* Resolve and reemove TODOs * remove todo

…#1913) * avoid to loop through the whole counter in bleu_score method * fix bug when max_n > len(candidate) * add comment to explain L88

* add decoding capability to GPT2BPE tokenizer * use wstring_convert for all conversions * minor update to comment and string creation logic * move converter definition outside of for loop

…d avoid splitting on them (pytorch#1916) * add_special_tokens and never split features added * removed a comment and updated a type hint * added explanation and example for how this change works * move SPECIAL_TOKENS_ATTRIBUTES to utils * rebase and address latest nit comments

) * Add ability to load HF checkpoints into T5 model * Add HuggingFace to integrations tests * Remove duplicate code * Revert fix * Add setup * Remove ability to download from remote URL * Remove line break from docstring

…ch#1927) [ghstack-poisoned]

* Fix upload channell using correct flag * Fix version extraction

This reverts commit 0026773.

* Fixed on_disk_cache issues [ghstack-poisoned] * Update on "Fixed on_disk_cache issues" Fixed issues with cache locks and cache files overwrites. Required to be compatible with meta-pytorch/data#810 [ghstack-poisoned] * Update on "Fixed on_disk_cache issues" Fixed issues with cache locks and cache files overwrites. Required to be compatible with meta-pytorch/data#810 [ghstack-poisoned] Co-authored-by: Vitaly Fedyunin <[email protected]>

* update decoding logic to handle special tokens * rebased and added example * minor refactor: moved boolean assignment outside of for loop

* Move relative_buckets Tensor to same device as relative_position * Update code pointer comments * Reference self.device from within MultiHeadedAttention private methods * Remove faulty call with device to t5 forward method * Add device to Attention obj

* Add Character Level BPE Tokenizer (pytorch#1936) Summary: Pull Request resolved: pytorch#1936 This change adds a character level BPE tokenizer to the set of available transforms. It takes a pre-trained encoder dict (i.e vocab dict) and merge list as input. It is not using C++ for encoding / decoding at this time. Reviewed By: langong347 Differential Revision: D40186470 fbshipit-source-id: 48bacc631f537e941a495e39ef9ccb17d3ef7896 * run linter * add regex to requirements and CharBPETokenizer to transforms.rst * fix docs and requirements * try to fix docstring format Co-authored-by: Roman Shraga <[email protected]>

facebook-github-bot · 2022-10-18T21:22:59Z

@Nayef211 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: ## Description - Refer to #1949 Pull Request resolved: #1948 Reviewed By: rshraga Differential Revision: D40490112 Pulled By: Nayef211 fbshipit-source-id: 687c2eb0765f8caea4872c64522dd8085bc23c51

Nayef211 · 2022-10-19T18:29:13Z

Closing PR as corresponding diff was merged

parmeet and others added 30 commits January 19, 2022 21:12

Cache extraction for AmazonReviewPolarity (pytorch#1527)

0f7f859

Migrating PennTreebank to datapipes (pytorch#1511)

eb39945

* Migrating penntreebank dataset to use torchdata * Update FileLoader to FileOpener * Resolved comments about return_path * Using strip() to remove leading/trailing spaces Co-authored-by: nayef211 <[email protected]>

add double caching for yelp polarity to speed up extracted reading. (p…

83aebf4

…ytorch#1530) * add double caching for yelp polarity to speed up extracted reading. * rename dps for consistency and simplify filepath_fn * add FileOpener within caching block for more consistency.

Migrate IMDB to datapipes (pytorch#1531)

03afb7e

* Migrate IMDB to datapipes * add double cache for extracted reading * update cache name

add max_tokens kwarg to vocab factory. (pytorch#1525)

e1d66cf

add double caching for yahoo to speed up extracted reading. (pytorch#…

ff78e99

…1528) * add double caching for yahoo to speed up extracted reading. * simplify filepath_fn * rename dps for consistency. * add FileOpener within caching block for more consistency.

Migrate WikiText2 to datapipes (pytorch#1519)

437eea8

* Migrate WikiText2 to datapipes * Address code review comments and add double caching

add double caching for yelp full to speed up extracted reading. (pyto…

d19a77e

…rch#1529)

Migrate WikiText103 to datapipes (pytorch#1518)

042f12f

add initial pass at migrating UDPOS to datapipes. (pytorch#1535)

f685c55

migrate Multi30k to datapipes. (pytorch#1536)

627c71f

Migrate SST2 from experimental to datasets folder (pytorch#1538)

d72124c

* Migrating SST2 from experimental to datasets folder * Added SST2 to docs and to init file * Removing empty line from docs Co-authored-by: nayef211 <[email protected]>

Rename AmazonReviewPolarity test file (pytorch#1540)

e0c5528

* Rename amazon review polarity test * Added renamed file to git Co-authored-by: nayef211 <[email protected]>

Removing unused param args constant (pytorch#1544)

91dde7e

Co-authored-by: nayef211 <[email protected]>

Add SST2 Mocked Unit Test (pytorch#1542)

7f839b6

* Added mock test for SST2 * Remove print line * Resolving PR comments * Updated comment to say zip * updated ordering of splits in parameterization * Using zip_equal for iteration in test_sst2 Co-authored-by: nayef211 <[email protected]>

Convert _get_mock_dataset fn to be private (pytorch#1543)

169924b

Co-authored-by: nayef211 <[email protected]>

Updated test to be consistent with SST2 test (pytorch#1548)

fe09343

Co-authored-by: nayef211 <[email protected]>

fix yelp dataset (pytorch#1550)

1b2f12e

fix yahoo dataset (pytorch#1551)

5056218

fix penn dataset (pytorch#1552)

9561cde

mock up AG NEWS test for faster testing. (pytorch#1553)

15c4222

migrate IWSLT2016 to datapipes. (pytorch#1545)

c10d7ef

remove extra print (pytorch#1557)

f27047f

fix flake. (pytorch#1558)

2372682

Implement ClipTokenizer that builds on top of GPT2BPETokenizer (pytor…

448a791

…ch#1541) * Implement CLIPEncoder in C++ Add case insensitive flag to CLIP pre tokenization regex Add Python interface Bring back gpt2 Add docstring Update docs * Fix stylecheck

mock up IWSLT2016 test for faster testing. (pytorch#1563)

3ba62ca

* mock up IWSLT2016 test for faster testing. * rename variable for consistency.

Multi30k mocked testing (pytorch#1554)

69825a1

Nayef211 and others added 18 commits September 20, 2022 12:53

Add missing Cmake file for in tokenizer dir (pytorch#1908)

befea6e

Fix Sphinx-gallery display and pin sphinx-related packages (pytorch#1907

9b06d56

) * Fix Sphinx-gallery display and pin sphinx-related packages * Resolving PR comments * Resolving PR comments * Remove language = None from docs

Resolve and remove TODO comments (pytorch#1912)

766cf9d

* Resolve and reemove TODOs * remove todo

Avoid looping through the whole counter in bleu_score method (pytorch…

5c48f4a

…#1913) * avoid to loop through the whole counter in bleu_score method * fix bug when max_n > len(candidate) * add comment to explain L88

Resolve inconsistency in IMDB label output (pytorch#1914)

52436c8

Add decoding capability to GPT2BPE tokenizer (pytorch#1919)

258a356

* add decoding capability to GPT2BPE tokenizer * use wstring_convert for all conversions * minor update to comment and string creation logic * move converter definition outside of for loop

Updating usage of torch.utils.data.graph.traverse in test case (pytor…

ff1fdfc

…ch#1927) [ghstack-poisoned]

[CI] Fix upload channel (pytorch#1932)

0026773

* Fix upload channell using correct flag * Fix version extraction

Avoid using std::regex and fix lint errors (pytorch#1930)

6ffe7be

Update dataset RTE information (pytorch#1934)

c776dc1

Revert "[CI] Fix upload channel (pytorch#1932)" (pytorch#1939)

4d88d4e

This reverts commit 0026773.

Update decoding logic to handle special tokens (pytorch#1925)

238b342

* update decoding logic to handle special tokens * rebased and added example * minor refactor: moved boolean assignment outside of for loop

Merge branch 'main' into merge_main_to_fbsync

db987ed

facebook-github-bot added the cla signed label Oct 18, 2022

Nayef211 requested review from abhinavarora, joecummings and rshraga October 18, 2022 21:02

Nayef211 mentioned this pull request Oct 18, 2022

Ensure main and fbsync are both in sync #1949

Open

Nayef211 marked this pull request as ready for review October 18, 2022 21:19

joecummings approved these changes Oct 19, 2022

View reviewed changes

Merge branch 'fbsync' into merge_main_to_fbsync

1aaaf3e

Nayef211 closed this Oct 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MERGE 1/2] merge `main` branch to `fbsync` #1948

[MERGE 1/2] merge `main` branch to `fbsync` #1948

Uh oh!

Nayef211 commented Oct 18, 2022 •

edited

Loading

Uh oh!

facebook-github-bot commented Oct 18, 2022

Uh oh!

Nayef211 commented Oct 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

28 participants

[MERGE 1/2] merge main branch to fbsync #1948

[MERGE 1/2] merge main branch to fbsync #1948

Uh oh!

Conversation

Nayef211 commented Oct 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

facebook-github-bot commented Oct 18, 2022

Uh oh!

Nayef211 commented Oct 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

28 participants

[MERGE 1/2] merge `main` branch to `fbsync` #1948

[MERGE 1/2] merge `main` branch to `fbsync` #1948

Nayef211 commented Oct 18, 2022 •

edited

Loading