Skip to content

Reading is slow when requesting more bytes than available #691

@mxmlnkn

Description

@mxmlnkn

This may arguably be a questionable use case, but I noticed that read(1 << 30) (try to read 1 GiB from a 60 MB file) can be ~3x slower than read(-1). I would have expected both to be equally fast.

Consider this benchmark:

import random
import time
import numpy as np
import fsspec.implementations.sftp
import sshfs

hostname = "127.0.0.1"
port = 22
file_path = 'silesia.tar.gz'

fsspec_fs = fsspec.implementations.sftp.SFTPFileSystem(hostname, port=port)
sshfs_fs = sshfs.SSHFileSystem(hostname, port=port)

file_size = len(sshfs_fs.open(file_path).read())
print(f"Test file sized: {file_size} B")

for chunk_size_in_KiB in [-1, 4 << 20, 2 << 20, 1 << 20, 512 * 1024, 128 * 1024, 4 * 1024, 32]:
    chunk_size = chunk_size_in_KiB * 1024 if chunk_size_in_KiB >= 0 else chunk_size_in_KiB
    print(f"Try to read {chunk_size_in_KiB} KiB")
    for open_file_name in ['fsspec_fs', 'sshfs_fs']:
        file = globals()[open_file_name].open(file_path)
        t0=time.time()
        size = 0
        for i in range((file_size + chunk_size - 1) // chunk_size if chunk_size > 0 else 1):
            size += len(file.read(chunk_size))
        t1=time.time()
        assert size == file_size
        file.close()
        print(
            f"Read {size / 1e6:.2f} MB in {chunk_size_in_KiB} KiB chunks with {open_file_name} "
            f"in {t1-t0:.2f} s -> {size/(t1-t0)/1e6:.2f} MB/s"
        )

Output:

Test file sized: 68238807 B
Try to read -1 KiB
Read 68.24 MB in -1 KiB chunks with fsspec_fs in 16.92 s -> 4.03 MB/s
Read 68.24 MB in -1 KiB chunks with sshfs_fs in 2.08 s -> 32.74 MB/s
Try to read 4194304 KiB
Read 68.24 MB in 4194304 KiB chunks with fsspec_fs in 50.17 s -> 1.36 MB/s
Read 68.24 MB in 4194304 KiB chunks with sshfs_fs in 25.15 s -> 2.71 MB/s
Try to read 2097152 KiB
Read 68.24 MB in 2097152 KiB chunks with fsspec_fs in 42.09 s -> 1.62 MB/s
Read 68.24 MB in 2097152 KiB chunks with sshfs_fs in 13.55 s -> 5.04 MB/s
Try to read 1048576 KiB
Read 68.24 MB in 1048576 KiB chunks with fsspec_fs in 42.13 s -> 1.62 MB/s
Read 68.24 MB in 1048576 KiB chunks with sshfs_fs in 7.37 s -> 9.26 MB/s
Try to read 524288 KiB
Read 68.24 MB in 524288 KiB chunks with fsspec_fs in 43.73 s -> 1.56 MB/s
Read 68.24 MB in 524288 KiB chunks with sshfs_fs in 4.67 s -> 14.62 MB/s
Try to read 131072 KiB
Read 68.24 MB in 131072 KiB chunks with fsspec_fs in 42.39 s -> 1.61 MB/s
Read 68.24 MB in 131072 KiB chunks with sshfs_fs in 2.37 s -> 28.78 MB/s
Try to read 4096 KiB
Read 68.24 MB in 4096 KiB chunks with fsspec_fs in 14.38 s -> 4.74 MB/s
Read 68.24 MB in 4096 KiB chunks with sshfs_fs in 2.03 s -> 33.63 MB/s
Try to read 32 KiB
Read 68.24 MB in 32 KiB chunks with fsspec_fs in 14.33 s -> 4.76 MB/s
Read 68.24 MB in 32 KiB chunks with sshfs_fs in 4.35 s -> 15.69 MB/s

So, trying to read 2 GiB, but only getting 60 MB, takes ~25 s, while it only takes 2 s with read(-1).

I do not know whether this is an issue with this wrapper or asyncssh directly because I was unable to adjust my (synchronously running) benchmark to use asnycssh directly.

I would guess that some code tries to do something O(size) for whatever reason even when it should iterate in chunks according to the buffer size, I'd think. I don't see memory usage spiking, so at least nothing that large seems to get allocated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions