Skip to content

Conversation

@twoertwein
Copy link
Member

@twoertwein twoertwein commented Aug 9, 2020

to_csv let's the user set all keyword arguments for gzip. Depending on whether the user provides a filename or a file object different keyword arguments can be set (gzip.open vs gzip.GzipFile).

This PR always uses gzip.GzipFile. The additional keyword arguments valid for gzip.open but not valid for gzip.GzipFile (encoding, errors, and newline) are still accessible:

g = TextIOWrapper(f, encoding=encoding, errors=errors, newline="")

Using gzip.GzipFile, also allows us to set mtime to create reproducible gzip archives.

Copy link
Contributor

@arw2019 arw2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

might we need to update the docstring or do you think it's good as is?

@twoertwein
Copy link
Member Author

updating the doc string is a good idea, will do that! I assume that this will affect multiple to_* methods. Is there a good strategy instead of copy pasting the same docstring multiple times?

@arw2019
Copy link
Contributor

arw2019 commented Aug 11, 2020

updating the doc string is a good idea, will do that! I assume that this will affect multiple to_* methods. Is there a good strategy instead of copy pasting the same docstring multiple times?

You could maybe add the more explicit explanation to doc/source/user_guide/io.rst, under Quoting, compression, and file format, and add brief Changed in version 1.2.0 notes in docstrings of affected methods

@twoertwein
Copy link
Member Author

The PR adding arguments for bz2/gzip #33398 mentioned that it affects to_csv, to_pickle, and to_json in the whatsnew but only updated the docstring for to_csv.

I could make sure that all three to_* methods have the same compression docstring and extend the mtime test case to also cover to_json and to_pickle.

@jreback jreback added the IO Data IO issues that don't fit into a more specific label label Aug 12, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. sligthly OT, we want to add typing for the compression arg (I think we have an issue for this), similar to StorageOptions whereby we define it in pandas._typing.py

@jreback jreback added this to the 1.2 milestone Aug 12, 2020
@jreback
Copy link
Contributor

jreback commented Aug 12, 2020

cc @gfyoung @WillAyd @TomAugspurger if comments.

@twoertwein
Copy link
Member Author

twoertwein commented Aug 12, 2020

looks good. sligthly OT, we want to add typing for the compression arg (I think we have an issue for this), similar to StorageOptions whereby we define it in pandas._typing.py

I will look into that, I assume it is going to be:

class CompressionArgs(TypedDict, total=False):
    method: str
    compresslevel: Optional[int]
    mtime:Optional[int]
    compression:int
    allowZip64:bool
    strict_timestamps:bool

technically, there are a few more but users should not pass them (filename, fileobj, buffer (deprecated since python 3.0), mode).

I could make sure that all three to_* methods have the same compression docstring and extend the mtime test case to also cover to_json and to_pickle.

Do you have opinions about that? compression does not only affect to_csv.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - nice PR

@pep8speaks
Copy link

pep8speaks commented Aug 12, 2020

Hello @twoertwein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-13 02:56:33 UTC

@twoertwein
Copy link
Member Author

twoertwein commented Aug 12, 2020

oh, I didn't know that TypedDict requires python 3.8. I will simply use Mapping[str, Optional[Union[str, int, bool]]].

except (IOError, AttributeError):
pass
for file_handle in self.file_handles:
file_handle.close()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably unrelated to the recent CI issues, but we should definitely close those handles.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, is there a ResoucceWarning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen any when reading/writing json files

@jreback jreback added the Typing type annotations, mypy/pyright type checking label Aug 13, 2020
@jreback jreback merged commit 59febbd into pandas-dev:master Aug 13, 2020
@jreback
Copy link
Contributor

jreback commented Aug 13, 2020

thanks @twoertwein very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

IO Data IO issues that don't fit into a more specific label Typing type annotations, mypy/pyright type checking

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deterministic gzip compressed outputs

6 participants