Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Conversation

@Nayef211
Copy link
Contributor

@Nayef211 Nayef211 commented Oct 18, 2022

parmeet and others added 30 commits January 19, 2022 21:12
* Migrating penntreebank dataset to use torchdata

* Update FileLoader to FileOpener

* Resolved comments about return_path

* Using strip() to remove leading/trailing spaces

Co-authored-by: nayef211 <[email protected]>
* Migrating enwik9 dataset to use torchdata

* Added typing to params

* Fixed PR comments. Updated to data_dp

* Added caching for extracted files

* Moved FileOpener after ondiskcache datapipe

Co-authored-by: nayef211 <[email protected]>
…ytorch#1530)

* add double caching for yelp polarity to speed up extracted reading.

* rename dps for consistency and simplify filepath_fn

* add FileOpener within caching block for more consistency.
* Migrate IMDB to datapipes

* add double cache for extracted reading

* update cache name
…1528)

* add double caching for yahoo to speed up extracted reading.

* simplify filepath_fn

* rename dps for consistency.

* add FileOpener within caching block for more consistency.
* Migrate WikiText2 to datapipes

* Address code review comments and add double caching
* First attempt at adding test for amazon review polarity

* Updated dataset to take validate_hash param. Finalized tests

* Created non empty tar file

* Remove formatting. Patch _hash_check method from torchdata during testing

* Added super().setUpClass()

* Remove commented import

Co-authored-by: nayef211 <[email protected]>
* Migrating SST2 from experimental to datasets folder

* Added SST2 to docs and to init file

* Removing empty line from docs

Co-authored-by: nayef211 <[email protected]>
* Rename amazon review polarity test

* Added renamed file to git

Co-authored-by: nayef211 <[email protected]>
* Added mock test for SST2

* Remove print line

* Resolving PR comments

* Updated comment to say zip

* updated ordering of splits in parameterization

* Using zip_equal for iteration in test_sst2

Co-authored-by: nayef211 <[email protected]>
* migrate IWSLT2017 to datapipes.

* refactor IWSLT2017 to use feedback from IWSLT2016.

* remove unused import.

* fix flake.

* fix typo in comment.

* add TODOs to IWSLT datasets.

* refactor common code out of IWSLTs and convert single quotes to double.

* fix typo.
…ch#1541)

* Implement CLIPEncoder in C++

Add case insensitive flag to CLIP pre tokenization regex

Add Python interface

Bring back gpt2

Add docstring

Update docs

* Fix stylecheck
* mock up IWSLT2016 test for faster testing.

* rename variable for consistency.
Nayef211 and others added 18 commits September 20, 2022 12:53
)

* Fix Sphinx-gallery display and pin sphinx-related packages

* Resolving PR comments

* Resolving PR comments

* Remove language = None from docs
* Resolve and reemove TODOs

* remove todo
…#1913)

* avoid to loop through the whole counter in bleu_score method

* fix bug when max_n > len(candidate)

* add comment to explain L88
* add decoding capability to GPT2BPE tokenizer

* use wstring_convert for all conversions

* minor update to comment and string creation logic

* move converter definition outside of for loop
…d avoid splitting on them (pytorch#1916)

* add_special_tokens and never split features added

* removed a comment and updated a type hint

* added explanation and example for how this change works

* move SPECIAL_TOKENS_ATTRIBUTES to utils

* rebase and address latest nit comments
)

* Add ability to load HF checkpoints into T5 model

* Add HuggingFace to integrations tests

* Remove duplicate code

* Revert fix

* Add setup

* Remove ability to download from remote URL

* Remove line break from docstring
* Fix upload channell using correct flag

* Fix version extraction
* Fixed on_disk_cache issues

[ghstack-poisoned]

* Update on "Fixed on_disk_cache issues"

Fixed issues with cache locks and cache files overwrites. Required to be compatible with meta-pytorch/data#810




[ghstack-poisoned]

* Update on "Fixed on_disk_cache issues"

Fixed issues with cache locks and cache files overwrites. Required to be compatible with meta-pytorch/data#810




[ghstack-poisoned]

Co-authored-by: Vitaly Fedyunin <[email protected]>
* update decoding logic to handle special tokens

* rebased and added example

* minor refactor: moved boolean assignment outside of for loop
* Move relative_buckets Tensor to same device as relative_position

* Update code pointer comments

* Reference self.device from within MultiHeadedAttention private methods

* Remove faulty call with device to t5 forward method

* Add device to Attention obj
* Add Character Level BPE Tokenizer (pytorch#1936)

Summary:
Pull Request resolved: pytorch#1936

This change adds a character level BPE tokenizer to the set of available transforms. It takes a pre-trained encoder dict (i.e vocab dict) and merge list as input. It is not using C++ for encoding / decoding at this time.

Reviewed By: langong347

Differential Revision: D40186470

fbshipit-source-id: 48bacc631f537e941a495e39ef9ccb17d3ef7896

* run linter

* add regex to requirements and CharBPETokenizer to transforms.rst

* fix docs and requirements

* try to fix docstring format

Co-authored-by: Roman Shraga <[email protected]>
@facebook-github-bot
Copy link
Contributor

@Nayef211 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Oct 19, 2022
Summary:
## Description
- Refer to #1949

Pull Request resolved: #1948

Reviewed By: rshraga

Differential Revision: D40490112

Pulled By: Nayef211

fbshipit-source-id: 687c2eb0765f8caea4872c64522dd8085bc23c51
@Nayef211
Copy link
Contributor Author

Closing PR as corresponding diff was merged

@Nayef211 Nayef211 closed this Oct 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.