-
Notifications
You must be signed in to change notification settings - Fork 167
Closed
Description
This may arguably be a questionable use case, but I noticed that read(1 << 30) (try to read 1 GiB from a 60 MB file) can be ~3x slower than read(-1). I would have expected both to be equally fast.
Consider this benchmark:
import random
import time
import numpy as np
import fsspec.implementations.sftp
import sshfs
hostname = "127.0.0.1"
port = 22
file_path = 'silesia.tar.gz'
fsspec_fs = fsspec.implementations.sftp.SFTPFileSystem(hostname, port=port)
sshfs_fs = sshfs.SSHFileSystem(hostname, port=port)
file_size = len(sshfs_fs.open(file_path).read())
print(f"Test file sized: {file_size} B")
for chunk_size_in_KiB in [-1, 4 << 20, 2 << 20, 1 << 20, 512 * 1024, 128 * 1024, 4 * 1024, 32]:
chunk_size = chunk_size_in_KiB * 1024 if chunk_size_in_KiB >= 0 else chunk_size_in_KiB
print(f"Try to read {chunk_size_in_KiB} KiB")
for open_file_name in ['fsspec_fs', 'sshfs_fs']:
file = globals()[open_file_name].open(file_path)
t0=time.time()
size = 0
for i in range((file_size + chunk_size - 1) // chunk_size if chunk_size > 0 else 1):
size += len(file.read(chunk_size))
t1=time.time()
assert size == file_size
file.close()
print(
f"Read {size / 1e6:.2f} MB in {chunk_size_in_KiB} KiB chunks with {open_file_name} "
f"in {t1-t0:.2f} s -> {size/(t1-t0)/1e6:.2f} MB/s"
)Output:
Test file sized: 68238807 B
Try to read -1 KiB
Read 68.24 MB in -1 KiB chunks with fsspec_fs in 16.92 s -> 4.03 MB/s
Read 68.24 MB in -1 KiB chunks with sshfs_fs in 2.08 s -> 32.74 MB/s
Try to read 4194304 KiB
Read 68.24 MB in 4194304 KiB chunks with fsspec_fs in 50.17 s -> 1.36 MB/s
Read 68.24 MB in 4194304 KiB chunks with sshfs_fs in 25.15 s -> 2.71 MB/s
Try to read 2097152 KiB
Read 68.24 MB in 2097152 KiB chunks with fsspec_fs in 42.09 s -> 1.62 MB/s
Read 68.24 MB in 2097152 KiB chunks with sshfs_fs in 13.55 s -> 5.04 MB/s
Try to read 1048576 KiB
Read 68.24 MB in 1048576 KiB chunks with fsspec_fs in 42.13 s -> 1.62 MB/s
Read 68.24 MB in 1048576 KiB chunks with sshfs_fs in 7.37 s -> 9.26 MB/s
Try to read 524288 KiB
Read 68.24 MB in 524288 KiB chunks with fsspec_fs in 43.73 s -> 1.56 MB/s
Read 68.24 MB in 524288 KiB chunks with sshfs_fs in 4.67 s -> 14.62 MB/s
Try to read 131072 KiB
Read 68.24 MB in 131072 KiB chunks with fsspec_fs in 42.39 s -> 1.61 MB/s
Read 68.24 MB in 131072 KiB chunks with sshfs_fs in 2.37 s -> 28.78 MB/s
Try to read 4096 KiB
Read 68.24 MB in 4096 KiB chunks with fsspec_fs in 14.38 s -> 4.74 MB/s
Read 68.24 MB in 4096 KiB chunks with sshfs_fs in 2.03 s -> 33.63 MB/s
Try to read 32 KiB
Read 68.24 MB in 32 KiB chunks with fsspec_fs in 14.33 s -> 4.76 MB/s
Read 68.24 MB in 32 KiB chunks with sshfs_fs in 4.35 s -> 15.69 MB/s
So, trying to read 2 GiB, but only getting 60 MB, takes ~25 s, while it only takes 2 s with read(-1).
I do not know whether this is an issue with this wrapper or asyncssh directly because I was unable to adjust my (synchronously running) benchmark to use asnycssh directly.
I would guess that some code tries to do something O(size) for whatever reason even when it should iterate in chunks according to the buffer size, I'd think. I don't see memory usage spiking, so at least nothing that large seems to get allocated.
Metadata
Metadata
Assignees
Labels
No labels