Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Conversation

@pmabbo13
Copy link
Contributor

@pmabbo13 pmabbo13 commented Jun 14, 2022

Description

Add CNNDM dataset to TorchText using TorchData datapipes

Process

  1. Download CNN and DailyMail data from here. These downloads are cached, however the extractions from the tar files are not yet cached. See follow-up items.
  2. Download URL lists from here, which designate which stories belong to the train/dev/test splits. These download are not yet cached. See follow-up items.
  3. Use URL list to filter for stories that belong to target split.
  4. Parse these stories to separate the article text from the abstract text and return a datapipe that yields (article, abstract) for each story.

Testing

Create mock dataset that mimics the format of raw CNNDM files and ensure that the datapipe yields the correct output on this mock dataset

pytest test/datasets/test_cnndm.py

Follow-Up Items

Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the implementation looks great! Some followup items:

  • Update the PR description

  • Add a section in the description for followup items with a checklist of features that have yet to be implemented. We can then add a link to followup PRs that implement those features and check them off (similar to #1542)

  • Add unittests for the dataset

  • Let's try to remove new lines unless they add to readability of code. Here's what the guidance from pep8 style guide suggests

    Extra blank lines may be used (sparingly) to separate groups of related functions. Blank lines may be omitted between a bunch of related one-liners (e.g. a set of dummy implementations).

    Use blank lines in functions, sparingly, to indicate logical sections.

@pmabbo13 pmabbo13 marked this pull request as ready for review June 17, 2022 22:08
@pmabbo13 pmabbo13 requested review from Nayef211 and parmeet June 17, 2022 22:09
@pmabbo13 pmabbo13 changed the title [WIP] Add CNN-DM dataset to torchtext Add CNN-DM dataset to torchtext Jun 17, 2022
Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for addressing all the PR comments. Let's wait for a review from @parmeet before we merge this in. 😄

@parmeet
Copy link
Contributor

parmeet commented Jun 21, 2022

Thanks @pmabbo13 for adding this complex dataset to the repo. Overall it looks great. Just left few minor comments, but In general I think we should be good to land as such.

@pmabbo13 pmabbo13 merged commit a6eb3b7 into pytorch:main Jun 22, 2022
@pmabbo13 pmabbo13 deleted the feature/add-cnndm branch June 22, 2022 18:10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants