torchvision transforms for video and sequence data

### 📚 The doc issue

Hello everyone,

I'd like to discuss how `torchvision.transforms` are applied to video and more generally, to sequence data. My examples will refer to `torchvision.transforms.Resize` which I have tested but it might apply to other transforms as well.

Before I present you the case, let me be clear that I do not search for an imediate solution to my problem! I'm well aware of alternatives such as PyTorchVideo, custom transforms, etc. The idea of this post is to discuss the topic with the community first, before I work on a pull request, i.e. to ensure I'm not completely wrong on this (which I well might be :D).

So, lets start with the documentation of [torchvision.transforms.Resize](https://pytorch.org/vision/0.9/transforms.html#torchvision.transforms.Resize) which states:

> Resize the input image to the given size. If the image is torch Tensor, it is expected to have […, H, W] shape, where … means **an arbitrary number of leading dimensions**

I think this solution is quite elegant for sequences but I found it not working as expected in some cases. Consider the following data:

```python
import torch

video = torch.rand(size=[25, 3, 64, 64])
masks = torch.rand(size=[25, 10, 1, 64, 64])
```

In this case, `video` is a *normal* sequence of 64x64 RGB images in TCHW format. Similarly, `masks` is a sequence of corresponding object masks for 10 objects, each represented as 64x64 grayscale image, so TMCHW. Now if we want to resize the video, we have to resize the masks sequence as well and according to the docs, we can do something like this:

```python
from torchvision import transforms

resize = transforms.Resize(size=32) # or size=(32,32) 

print(resize(video).shape)
# > torch.Size([25, 3, 32, 32])

print(resize(masks).shape)
# > ValueError: Input and output must have the same number of spatial dimensions, but got input with spatial dimensions of [1, 64, 64] and output size of [32, 32]. Please provide input tensor in (N, C, d1, d2, ...,dK) format and output size in (o1, o2, ...,oK) format.
```

As you can see, this works for the *normal* video sequence but not for the *special* masks sequence which has an extra dimension. This was a little surprising because the docs *promised* it will work with **"an arbitrary number of leading dimensions"**.

In response, I did some digging and the *problem* arises from the following CallStack:

1. [`torchvision.transforms.Resize`](https://github.com/pytorch/vision/blob/0387b8821d67ca62d57e3b228ade45371c0af79d/torchvision/transforms/transforms.py#L283) calls
2. [`torchvision.transforms.functional.resize`](https://github.com/pytorch/vision/blob/0387b8821d67ca62d57e3b228ade45371c0af79d/torchvision/transforms/functional.py#L391) which first calculates
3. [`torchvision.transforms.functional._compute_resized_output_size`](https://github.com/pytorch/vision/blob/0387b8821d67ca62d57e3b228ade45371c0af79d/torchvision/transforms/functional.py#L366) (this always returns `[new_h, new_w]`, so from here on we are guaranteed to have a 2D `size = list[int,int]`), and then calls
4. [`torchvision.transforms._functional_tensor.resize`](https://github.com/pytorch/vision/blob/0387b8821d67ca62d57e3b228ade45371c0af79d/torchvision/transforms/_functional_tensor.py#L439) which finally calls
5. [`torch.nn.functional.interpolate`](https://github.com/pytorch/pytorch/blob/170a1c3ace7b3d0b6d7e1107c85202ef00abc909/torch/nn/functional.py#L3781) with `size=[new_h, new_w]` from step 3.

Now, the *problem* I like to discuss (and the reason the docs are a little misleading) is that `torch.nn.functional.interpolate` assumes the input to be:

```python
mini-batch x channels x [optional depth] x [optional height] x width
```

Because of this assumption (two leading mandatory dimensions) it calculates:

```python
dim = input.dim() - 2  # Number of spatial dimensions.
```

in line [3865](https://github.com/pytorch/pytorch/blob/170a1c3ace7b3d0b6d7e1107c85202ef00abc909/torch/nn/functional.py#L3865) and then checks in line [3877](https://github.com/pytorch/pytorch/blob/170a1c3ace7b3d0b6d7e1107c85202ef00abc909/torch/nn/functional.py#L3877):

```python
if len(size) != dim:
    raise ValueError(...)
```

which will raise the error mentioned. To fully understand whats going on, let's recreate the cases from above:

```python
import torch
from torchvision import transforms

size  = (32, 32)
video = torch.rand(size=[25, 3, 64, 64])
masks = torch.rand(size=[25, 10, 1, 64, 64])

torch.nn.functional.interpolate(video, size=size)
# this works because: (video.dim() - 2 == len(size) -> 2 == 2

torch.nn.functional.interpolate(masks, size=size)
# this does't work because: (masks.dim() - 2 != len(size) -> 3 != 2

# a working solution could be:
print(torch.nn.functional.interpolate(masks, size=(1, 32, 32)).shape)
# > torch.Size([25, 10, 1, 32, 32])

# however:
resize = transforms.Resize(size=(1, 32, 32))
# > ValueError: If size is a sequence, it should have 1 or 2 values
```

To summarize, it works for 4D video sequences (by accident?) because `interpolate` will interpret the sequence dimension as batch dimension. For higher dimensional sequences, like in my masks example, it breaks.

It could work though, it is currently just limited by the size check in [`torchvison.transforms.Resize`](https://github.com/pytorch/vision/blob/0387b8821d67ca62d57e3b228ade45371c0af79d/torchvision/transforms/transforms.py#L342) and/or by the dim assumptions made in [`torch.nn.functional.interpolate`](https://github.com/pytorch/pytorch/blob/170a1c3ace7b3d0b6d7e1107c85202ef00abc909/torch/nn/functional.py#L3781)

### Suggest a potential alternative/fix

What to do?

1. Update the documentation? - easy fix, but I actually want this to work as described which is the whole point of this lengthy post :D
2. Update the dimension check in: `torch.nn.functional.interpolate`?
3. Update how the `torchvison.transform` stack handles video data?

What do you think?

From what I can see, [`_functional_video.py`](https://github.com/pytorch/vision/blob/main/torchvision/transforms/_functional_video.py) is deprecated and [`v2/functional/_geometry.py`](https://github.com/pytorch/vision/blob/main/torchvision/transforms/v2/functional/_geometry.py) has a function [`resize_video`](https://github.com/pytorch/vision/blob/0387b8821d67ca62d57e3b228ade45371c0af79d/torchvision/transforms/v2/functional/_geometry.py#L247) which just calls [`resize_image_tensor`](https://github.com/pytorch/vision/blob/0387b8821d67ca62d57e3b228ade45371c0af79d/torchvision/transforms/v2/functional/_geometry.py#L159) which again, calls `torch.nn.functional.interpolate` with a 2D size and will therefore, suffer from the same *problem*. I'm also aware of https://pytorchvideo.org but they require CTHW format and add OpenCV as a dependency.

Personally, I think the solution as described in the current documentation is most elegant and least limiting. Also, a pull request should not be too hard to make it work as promised. On the other hand, I just checked this for `Resize` and it should probably be checked for other transforms as well...

If you made it this far, thx for reading :) I'd really appreciate your input on this!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

torchvision transforms for video and sequence data #7476

📚 The doc issue

Suggest a potential alternative/fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

torchvision transforms for video and sequence data #7476

Description

📚 The doc issue

Suggest a potential alternative/fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions