diff --git a/torchvision/prototype/datasets/_builtin/README.md b/torchvision/prototype/datasets/_builtin/README.md index fbe84856aeb..b0280071027 100644 --- a/torchvision/prototype/datasets/_builtin/README.md +++ b/torchvision/prototype/datasets/_builtin/README.md @@ -1,22 +1,19 @@ # How to add new built-in prototype datasets -As the name implies, the datasets are still in a prototype state and thus -subject to rapid change. This in turn means that this document will also change -a lot. +As the name implies, the datasets are still in a prototype state and thus subject to rapid change. This in turn means +that this document will also change a lot. -If you hit a blocker while adding a dataset, please have a look at another -similar dataset to see how it is implemented there. If you can't resolve it -yourself, feel free to send a draft PR in order for us to help you out. +If you hit a blocker while adding a dataset, please have a look at another similar dataset to see how it is implemented +there. If you can't resolve it yourself, feel free to send a draft PR in order for us to help you out. Finally, `from torchvision.prototype import datasets` is implied below. ## Implementation -Before we start with the actual implementation, you should create a module in -`torchvision/prototype/datasets/_builtin` that hints at the dataset you are -going to add. For example `caltech.py` for `caltech101` and `caltech256`. In -that module create a class that inherits from `datasets.utils.Dataset` and -overwrites at minimum three methods that will be discussed in detail below: +Before we start with the actual implementation, you should create a module in `torchvision/prototype/datasets/_builtin` +that hints at the dataset you are going to add. For example `caltech.py` for `caltech101` and `caltech256`. In that +module create a class that inherits from `datasets.utils.Dataset` and overwrites at minimum three methods that will be +discussed in detail below: ```python from typing import Any, Dict, List @@ -39,50 +36,39 @@ class MyDataset(Dataset): ### `_make_info(self)` -The `DatasetInfo` carries static information about the dataset. There are two -required fields: -- `name`: Name of the dataset. This will be used to load the dataset with - `datasets.load(name)`. Should only contain lowercase characters. +The `DatasetInfo` carries static information about the dataset. There are two required fields: + +- `name`: Name of the dataset. This will be used to load the dataset with `datasets.load(name)`. Should only contain + lowercase characters. There are more optional parameters that can be passed: -- `dependencies`: Collection of third-party dependencies that are needed to load - the dataset, e.g. `("scipy",)`. Their availability will be automatically - checked if a user tries to load the dataset. Within the implementation, import +- `dependencies`: Collection of third-party dependencies that are needed to load the dataset, e.g. `("scipy",)`. Their + availability will be automatically checked if a user tries to load the dataset. Within the implementation, import these packages lazily to avoid missing dependencies at import time. -- `categories`: Sequence of human-readable category names for each label. The - index of each category has to match the corresponding label returned in the - dataset samples. [See - below](#how-do-i-handle-a-dataset-that-defines-many-categories) how to handle - cases with many categories. -- `valid_options`: Configures valid options that can be passed to the dataset. - It should be `Dict[str, Sequence[Any]]`. The options are accessible through - the `config` namespace in the other two functions. First value of the sequence - is taken as default if the user passes no option to - `torchvision.prototype.datasets.load()`. +- `categories`: Sequence of human-readable category names for each label. The index of each category has to match the + corresponding label returned in the dataset samples. + [See below](#how-do-i-handle-a-dataset-that-defines-many-categories) how to handle cases with many categories. +- `valid_options`: Configures valid options that can be passed to the dataset. It should be `Dict[str, Sequence[Any]]`. + The options are accessible through the `config` namespace in the other two functions. First value of the sequence is + taken as default if the user passes no option to `torchvision.prototype.datasets.load()`. ## `resources(self, config)` -Returns `List[datasets.utils.OnlineResource]` of all the files that need to be -present locally before the dataset with a specific `config` can be build. The -download will happen automatically. +Returns `List[datasets.utils.OnlineResource]` of all the files that need to be present locally before the dataset with a +specific `config` can be build. The download will happen automatically. Currently, the following `OnlineResource`'s are supported: -- `HttpResource`: Used for files that are directly exposed through HTTP(s) and - only requires the URL. -- `GDriveResource`: Used for files that are hosted on GDrive and requires the - GDrive ID as well as the `file_name`. -- `ManualDownloadResource`: Used files are not publicly accessible and requires - instructions how to download them manually. If the file does not exist, an - error will be raised with the supplied instructions. -- `KaggleDownloadResource`: Used for files that are available on Kaggle. This - inherits from `ManualDownloadResource`. - -Although optional in general, all resources used in the built-in datasets should -comprise [SHA256](https://en.wikipedia.org/wiki/SHA-2) checksum for security. It -will be automatically checked after the download. You can compute the checksum -with system utilities e.g `sha256-sum`, or this snippet: +- `HttpResource`: Used for files that are directly exposed through HTTP(s) and only requires the URL. +- `GDriveResource`: Used for files that are hosted on GDrive and requires the GDrive ID as well as the `file_name`. +- `ManualDownloadResource`: Used files are not publicly accessible and requires instructions how to download them + manually. If the file does not exist, an error will be raised with the supplied instructions. +- `KaggleDownloadResource`: Used for files that are available on Kaggle. This inherits from `ManualDownloadResource`. + +Although optional in general, all resources used in the built-in datasets should comprise +[SHA256](https://en.wikipedia.org/wiki/SHA-2) checksum for security. It will be automatically checked after the +download. You can compute the checksum with system utilities e.g `sha256-sum`, or this snippet: ```python import hashlib @@ -97,61 +83,123 @@ def sha256sum(path, chunk_size=1024 * 1024): ### `_make_datapipe(resource_dps, *, config)` -This method is the heart of the dataset, where we transform the raw data into -a usable form. A major difference compared to the current stable datasets is -that everything is performed through `IterDataPipe`'s. From the perspective of -someone that is working with them rather than on them, `IterDataPipe`'s behave -just as generators, i.e. you can't do anything with them besides iterating. +This method is the heart of the dataset, where we transform the raw data into a usable form. A major difference compared +to the current stable datasets is that everything is performed through `IterDataPipe`'s. From the perspective of someone +that is working with them rather than on them, `IterDataPipe`'s behave just as generators, i.e. you can't do anything +with them besides iterating. -Of course, there are some common building blocks that should suffice in 95% of -the cases. The most used are: +Of course, there are some common building blocks that should suffice in 95% of the cases. The most used are: -- `Mapper`: Apply a callable to every item in the datapipe. +- `Mapper`: Apply a callable to every item in the datapipe. - `Filter`: Keep only items that satisfy a condition. - `Demultiplexer`: Split a datapipe into multiple ones. - `IterKeyZipper`: Merge two datapipes into one. -All of them can be imported `from torchdata.datapipes.iter`. In addition, use -`functools.partial` in case a callable needs extra arguments. If the provided -`IterDataPipe`'s are not sufficient for the use case, it is also not complicated +All of them can be imported `from torchdata.datapipes.iter`. In addition, use `functools.partial` in case a callable +needs extra arguments. If the provided `IterDataPipe`'s are not sufficient for the use case, it is also not complicated to add one. See the MNIST or CelebA datasets for example. -`make_datapipe()` receives `resource_dps`, which is a list of datapipes that has -a 1-to-1 correspondence with the return value of `resources()`. In case of -archives with regular suffixes (`.tar`, `.zip`, ...), the datapipe will contain -tuples comprised of the path and the handle for every file in the archive. -Otherwise the datapipe will only contain one of such tuples for the file -specified by the resource. +`make_datapipe()` receives `resource_dps`, which is a list of datapipes that has a 1-to-1 correspondence with the return +value of `resources()`. In case of archives with regular suffixes (`.tar`, `.zip`, ...), the datapipe will contain +tuples comprised of the path and the handle for every file in the archive. Otherwise the datapipe will only contain one +of such tuples for the file specified by the resource. + +Since the datapipes are iterable in nature, some datapipes feature an in-memory buffer, e.g. `IterKeyZipper` and +`Grouper`. There are two issues with that: 1. If not used carefully, this can easily overflow the host memory, since +most datasets will not fit in completely. 2. This can lead to unnecessarily long warm-up times when data is buffered +that is only needed at runtime. + +Thus, all buffered datapipes should be used as early as possible, e.g. zipping two datapipes of file handles rather than +trying to zip already loaded images. + +There are two special datapipes that are not used through their class, but through the functions `hint_sharding` and +`hint_shuffling`. As the name implies they only hint part in the datapipe graph where sharding and shuffling should take +place, but are no-ops by default. They can be imported from `torchvision.prototype.datasets.utils._internal` and are +required in each dataset. + +Finally, each item in the final datapipe should be a dictionary with `str` keys. There is no standardization of the +names (yet!). + +## Tests + +To test the dataset implementation, you usually don't need to add any tests, but need to provide a mock-up of the data. +This mock-up should resemble the original data as close as necessary, while containing only few examples. + +To do this, add a new function in [`test/builtin_dataset_mocks.py`](../../../../test/builtin_dataset_mocks.py) with the +same name as you have defined in `_make_config()` (if the name includes hyphens `-`, replace them with underscores `_`) +and decorate it with `@register_mock`: -Since the datapipes are iterable in nature, some datapipes feature an in-memory -buffer, e.g. `IterKeyZipper` and `Grouper`. There are two issues with that: 1. -If not used carefully, this can easily overflow the host memory, since most -datasets will not fit in completely. 2. This can lead to unnecessarily long -warm-up times when data is buffered that is only needed at runtime. +```py +# this is defined in torchvision/prototype/datasets/_builtin +class MyDataset(Dataset): + def _make_info(self) -> DatasetInfo: + return DatasetInfo( + "my-dataset", + ... + ) + +@register_mock +def my_dataset(info, root, config): + ... +``` -Thus, all buffered datapipes should be used as early as possible, e.g. zipping -two datapipes of file handles rather than trying to zip already loaded images. +The function receives three arguments: -There are two special datapipes that are not used through their class, but -through the functions `hint_sharding` and `hint_shuffling`. As the name implies -they only hint part in the datapipe graph where sharding and shuffling should -take place, but are no-ops by default. They can be imported from -`torchvision.prototype.datasets.utils._internal` and are required in each -dataset. +- `info`: The return value of `_make_info()`. +- `root`: A [`pathlib.Path`](https://docs.python.org/3/library/pathlib.html#pathlib.Path) of a folder, in which the data + needs to be placed. +- `config`: The configuration to generate the data for. This is the same value that `_make_datapipe()` receives. -Finally, each item in the final datapipe should be a dictionary with `str` keys. -There is no standardization of the names (yet!). +The function should generate all files that are needed for the current `config`. Each file should be complete, e.g. if +the dataset only has a single archive that contains multiple splits, you need to generate all regardless of the current +`config`. Although this seems odd at first, this is important. Consider the following original data setup: + +``` +root +├── test +│ ├── test_image0.jpg +│ ... +└── train + ├── train_image0.jpg + ... +``` + +For map-style datasets (like the one currently in `torchvision.datasets`), one explicitly selects the files they want to +load. For example, something like `(root / split).iterdir()` works fine even if only the specific split folder is +present. With iterable-style datasets though, we get something like `root.iterdir()` from `resource_dps` in +`_make_datapipe()` and need to manually `Filter` it to only keep the files we want. If we would only generate the data +for the current `config`, the test would also pass if the dataset is missing the filtering, but would fail on the real +data. + +For datasets that are ported from the old API, we already have some mock data in +[`test/test_datasets.py`](../../../../test/test_datasets.py). You can find the test case corresponding test case there +and have a look at the `inject_fake_data` function. There are a few differences though: + +- `tmp_dir` corresponds to `root`, but is a `str` rather than a + [`pathlib.Path`](https://docs.python.org/3/library/pathlib.html#pathlib.Path). Thus, you often see something like + `folder = pathlib.Path(tmp_dir)`. This is not needed. +- Although both parameters are called `config`, the value in the new tests is a namespace. Thus, please use `config.foo` + over `config["foo"]` to enhance readability. +- The data generated by `inject_fake_data` was supposed to be in an extracted state. This is no longer the case for the + new mock-ups. Thus, you need to use helper functions like `make_zip` or `make_tar` to actually generate the files + specified in the dataset. +- As explained in the paragraph above, the generated data is often "incomplete" and only valid for given the config. + Make sure you follow the instructions above. + +The function should return an integer indicating the number of samples in the dataset for the current `config`. +Preferably, this number should be different for different `config`'s to have more confidence in the dataset +implementation. + +Finally, you can run the tests with `pytest test/test_prototype_builtin_datasets.py -k {name}`. ## FAQ ### How do I start? -Get the skeleton of your dataset class ready with all 3 methods. For -`_make_datapipe()`, you can just do `return resources_dp[0]` to get started. -Then import the dataset class in -`torchvision/prototype/datasets/_builtin/__init__.py`: this will automatically -register the dataset and it will be instantiable via -`datasets.load("mydataset")`. On a separate script, try something like +Get the skeleton of your dataset class ready with all 3 methods. For `_make_datapipe()`, you can just do +`return resources_dp[0]` to get started. Then import the dataset class in +`torchvision/prototype/datasets/_builtin/__init__.py`: this will automatically register the dataset and it will be +instantiable via `datasets.load("mydataset")`. On a separate script, try something like ```py from torchvision.prototype import datasets @@ -163,35 +211,27 @@ for sample in dataset: # Or you can also inspect the sample in a debugger ``` -This will give you an idea of what the first datapipe in `resources_dp` -contains. You can also do that with `resources_dp[1]` or `resources_dp[2]` -(etc.) if they exist. Then follow the instructions above to manipulate these +This will give you an idea of what the first datapipe in `resources_dp` contains. You can also do that with +`resources_dp[1]` or `resources_dp[2]` (etc.) if they exist. Then follow the instructions above to manipulate these datapipes and return the appropriate dictionary format. ### How do I handle a dataset that defines many categories? -As a rule of thumb, `datasets.utils.DatasetInfo(..., categories=)` should only -be set directly for ten categories or fewer. If more categories are needed, you -can add a `$NAME.categories` file to the `_builtin` folder in which each line -specifies a category. If `$NAME` matches the name of the dataset (which it -definitively should!) it will be automatically loaded if `categories=` is not -set. - -In case the categories can be generated from the dataset files, e.g. the dataset -follows an image folder approach where each folder denotes the name of the -category, the dataset can overwrite the `_generate_categories` method. It gets -passed the `root` path to the resources, but they have to be manually loaded, -e.g. `self.resources(config)[0].load(root)`. The method should return a sequence -of strings representing the category names. To generate the `$NAME.categories` -file, run `python -m torchvision.prototype.datasets.generate_category_files -$NAME`. +As a rule of thumb, `datasets.utils.DatasetInfo(..., categories=)` should only be set directly for ten categories or +fewer. If more categories are needed, you can add a `$NAME.categories` file to the `_builtin` folder in which each line +specifies a category. If `$NAME` matches the name of the dataset (which it definitively should!) it will be +automatically loaded if `categories=` is not set. + +In case the categories can be generated from the dataset files, e.g. the dataset follows an image folder approach where +each folder denotes the name of the category, the dataset can overwrite the `_generate_categories` method. It gets +passed the `root` path to the resources, but they have to be manually loaded, e.g. +`self.resources(config)[0].load(root)`. The method should return a sequence of strings representing the category names. +To generate the `$NAME.categories` file, run `python -m torchvision.prototype.datasets.generate_category_files $NAME`. ### What if a resource file forms an I/O bottleneck? -In general, we are ok with small performance hits of iterating archives rather -than their extracted content. However, if the performance hit becomes -significant, the archives can still be decompressed or extracted. To do this, -the `decompress: bool` and `extract: bool` flags can be used for every -`OnlineResource` individually. For more complex cases, each resource also -accepts a `preprocess` callable that gets passed a `pathlib.Path` of the raw -file and should return `pathlib.Path` of the preprocessed file or folder. +In general, we are ok with small performance hits of iterating archives rather than their extracted content. However, if +the performance hit becomes significant, the archives can still be decompressed or extracted. To do this, the +`decompress: bool` and `extract: bool` flags can be used for every `OnlineResource` individually. For more complex +cases, each resource also accepts a `preprocess` callable that gets passed a `pathlib.Path` of the raw file and should +return `pathlib.Path` of the preprocessed file or folder.