Skip to content

Conversation

@pmeier
Copy link
Contributor

@pmeier pmeier commented May 11, 2022

This simplifies the implementation of OnlineResource.load by quite a bit. The only functional difference is that we now only execute the preprocessing after the download. For the admittedly niche situation that the user downloads the data manually, they now also need to preprocess them manually.

In addition, this PR adds a bunch of tests.

@pmeier
Copy link
Contributor Author

pmeier commented May 11, 2022

The only functional difference is that we now only execute the preprocessing after the download.

I was mistaken and that also has some implications for us. The current datasets tests are set up in way that they inject a mock of the "raw" downloaded file into the root directory of the dataset. Previously, the loading logic of the resources detected this at runtime and performed the preprocessing.

Without this we get the failures as seen in this CI run. The PCAM dataset expects gzip decompressed resources

GDriveResource(file_name=file_name, id=gdrive_id, sha256=sha256, preprocess="decompress")

but with the implementation in this PR, it will get the compressed ones.

I see two ways out here:

  1. Reinstate the functionality to always perform the preprocessing if available and we only detect the raw version of the file.
  2. Change our test suite to provide the "downloaded" and preprocessed data rather than only the "downloaded" one.

@NicolasHug
Copy link
Member

Can we mock the download logic instead, and replace it with dataset_mock.prepare()?

@pmeier
Copy link
Contributor Author

pmeier commented May 11, 2022

Yes, we can mock

def download(self, root: Union[str, pathlib.Path], *, skip_integrity_check: bool = False) -> pathlib.Path:

Still, that will require more changes in the test suite. Currently we use the following idiom

def test_smoke(self, test_home, dataset_mock, config):
dataset_mock.prepare(test_home, config)
dataset = datasets.load(dataset_mock.name, **config)

If we want to patch in prepare we have two options:

  1. Use the mocker fixture and pass it to prepare together with the test_home. The patch will automatically be reset for every run of the test.

  2. Use unittest.mock. The only way this automatically resets is to patch with a context manager. We could do something like

    with dataset_mock.prepare(test_home, config):
        dataset = datasets.load(dataset_mock.name, **config) 

Another option is to merge the two calls into one like

dataset, mock_info = dataset_mock.load(test_home, config)

I've checked, we are always using the two calls together.

Thoughts?

Copy link
Contributor Author

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicolasHug in d627479 I added a PoC for merging the mock data preparation as proposed in #5990 (comment).

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Philip, I left a few comments, LMK what you think

# test/test_prototype_builtin_datasets.py
root = pathlib.Path(datasets.home()) / self.name
root.mkdir(exist_ok=True)
mock_data_folder = root / "__mock__"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think it's worth mentioning that it's a tmp dir in its name. Here's a suggestion below but feel free to change

Suggested change
mock_data_folder = root / "__mock__"
tmp_mock_data_folder = root / "__mock__"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we discussed offline, we could also use a more global home() / cache / folder where each sub-folder would contain the data for a specfic dataset and config. Looking at our config, look like they should be hashable fairly easily. But we should definitely leave this for another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we discussed offline, we could also use a more global home() / cache / folder

We can, but it can't be within home(), since that changes with every call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we can make the test_home() fixture global for the test run?

This reverts commit 5ed6eedef74865e0baa746a375d5ec1f0ab1bde7.
@pmeier
Copy link
Contributor Author

pmeier commented May 13, 2022

The changes needed to the test suited were factored out in #6010. That will need to be merged first before we continue here.

Comment on lines +187 to +206
def test_preprocess_extract(self, tmp_path):
files = None

def download_fn(resource, root):
nonlocal files
archive, files = self._make_tar(root, name=resource.file_name)
return archive

resource = self.DummyResource(file_name="folder.tar", preprocess="extract", download_fn=download_fn)

dp = resource.load(tmp_path)
assert files is not None, "`download_fn()` was never called"
assert isinstance(dp, FileOpener)

actual = {path: buffer.read().decode() for path, buffer in dp}
expected = {
path.replace(resource.file_name, resource.file_name.split(".")[0]): content
for path, content in files.items()
}
assert actual == expected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fairly complex TBH. The nonlocal logic, the fact that _make_tar returns a dict of files, etc.

IIUC the goal of this check is to make sure the file gets properly extracted. Surely there are simpler ways to assert that?

Copy link
Contributor Author

@pmeier pmeier May 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course there are other ways, but I'm not sure they are easier. We can't create the archive upfront in tmp_path, because in that case we would never trigger the preprocessing. One option would be to create the data upfront in a temporary directory and move it to tmp_path inside the download function similar to what we are doing with the actual resource loading in our dataset tests (#6010).

Given that you were ok with nonlocal in #5990 (comment), I don't think it will be much simpler. You choose.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pmeier , LGTM

file_name: str,
sha256: Optional[str] = None,
preprocess: Optional[Union[Literal["decompress", "extract"], Callable[[pathlib.Path], pathlib.Path]]] = None,
preprocess: Optional[Union[Literal["decompress", "extract"], Callable[[pathlib.Path], None]]] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own education, why did we need to specify None in the annotation here? I assume Optional[] would have been enough - is it just to be more explicit about it?

Copy link
Contributor Author

@pmeier pmeier May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It becomes clearer when we let black explode the annotation:

Suggested change
preprocess: Optional[Union[Literal["decompress", "extract"], Callable[[pathlib.Path], None]]] = None,
preprocess: Optional[
Union[
Literal["decompress", "extract"],
Callable[[pathlib.Path], None],
]
] = None,

I only changed the return type of the callable from pathlib.Path to None since we refactored the loading logic that the return is no longer needed.

Still, in general you are right. Optional[Foo] is equivalent to Union[None, Foo]. Plus, Optional[Union[Foo, Bar]] can be flattened to Union[None, Foo, Bar]. It is just my personal preference to be explicit about Optional in case it actually means an optional value. In case None is just another valid value to pass, I prefer to merge it into a Union.

@pmeier pmeier merged commit b430ba6 into pytorch:main May 17, 2022
@pmeier pmeier deleted the simplify-resource-load branch May 17, 2022 13:45
facebook-github-bot pushed a commit that referenced this pull request Jun 1, 2022
Summary:
* simplify OnlineResource.load

* [PoC] merge mock data preparation and loading

* Revert "cache mock data based on config"

This reverts commit 5ed6eedef74865e0baa746a375d5ec1f0ab1bde7.

* Revert "[PoC] merge mock data preparation and loading"

This reverts commit d627479.

* remove preprocess returning a new path in favor of querying twice

* address test comments

* clarify comment

* mypy

* use builtin decompress utility

Reviewed By: NicolasHug

Differential Revision: D36760923

fbshipit-source-id: 1d3d30be96c3226fc181c4654208b2d3c6fdf7cb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants