Skip to content

Conversation

@drew-heenan
Copy link
Contributor

@WillAyd
Copy link
Member

WillAyd commented Apr 8, 2019

I am -1 on adding another parameter here. What makes Zip extraction different than the other compressions methods?

@drew-heenan
Copy link
Contributor Author

drew-heenan commented Apr 8, 2019

Other compression methods apply to a single file and add an extension to the name (i.e. data.csv.gz extracted results in data.csv), but ZIP archives are meant to have each file within have a name. As is, calling df.to_csv('data.zip') creates an archive containing a CSV file named data.zip as well.

An alternative would be to infer the file name, but that would be a breaking change.

@drew-heenan drew-heenan marked this pull request as ready for review April 8, 2019 02:59
@gfyoung gfyoung added Enhancement IO CSV read_csv, to_csv labels Apr 8, 2019
@gfyoung
Copy link
Member

gfyoung commented Apr 8, 2019

So I understand the use-case. However, I do agree that adding another parameter is not necessarily the best way to do this. This parameter is really related to our compression parameter.

This leads me to think that a dict might make more sense. It allows us to configure the compression without the bloat. What do you think?

cc @jreback (related to #25990 (comment))

@codecov
Copy link

codecov bot commented Apr 8, 2019

Codecov Report

Merging #26024 into master will decrease coverage by <.01%.
The diff coverage is 92.59%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26024      +/-   ##
==========================================
- Coverage   91.96%   91.95%   -0.01%     
==========================================
  Files         175      175              
  Lines       52405    52425      +20     
==========================================
+ Hits        48193    48207      +14     
- Misses       4212     4218       +6
Flag Coverage Δ
#multiple 90.51% <92.59%> (-0.01%) ⬇️
#single 40.72% <37.03%> (-0.14%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 93.54% <100%> (ø) ⬆️
pandas/io/formats/csvs.py 98.23% <100%> (+0.03%) ⬆️
pandas/io/common.py 91.5% <90.9%> (-0.33%) ⬇️
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/core/frame.py 96.9% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b90f9db...5853a28. Read the comment docs.

@drew-heenan
Copy link
Contributor Author

@gfyoung I agree that this approach makes more sense - I'll modify the functions to optionally take a dict for compression, i.e. {'compression': 'zip', 'arcname': 'data.csv'}. Thanks for the feedback!

@gfyoung
Copy link
Member

gfyoung commented Apr 8, 2019

@drew-heenan : Sounds good! One minor thing: let's use method as the key instead of compression. Otherwise, we'll be extracting compression["compression"], which is a little redundant.

.. versionchanged:: 0.25.0
May now be a dict with key 'method' as compression mode
and 'arcname' as CSV file name if mode is 'zip'
Copy link
Member

@gfyoung gfyoung Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this whatsnew note a little more generic. In reality, we should be just accepting any keyword arguments to BytesZipFile. Also, we should have an example.

if path_or_buf is None:
path_or_buf = StringIO()

self._compression_arg = compression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't you change get_filepath_or_buffer to already handle this? why is this special cased here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not modify get_filepath_or_buffer, though I suppose I certainly could to support taking a dict as compression. The self._compression_arg is there to avoid changing self.compression from only ever holding the inferred compression method, while self._compression_arg would include any additional arguments if compression was passed as a dict. I do now think that I should have instead had a self.compression_args hold this dict with the method key popped.

Copy link
Contributor Author

@drew-heenan drew-heenan Apr 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback With regard to get_filepath_or_buffer (and similarly _infer_compression), I've added a function _get_compression_method which handles the case where compression is given as a dict to to_csv, CSVFormatter or _get_handle. It extracts the compression method string before passing to get_filepath_buffer or _infer_compression.
Would it be preferable then to keep both functions' original functionality where compression may only be a string/None, as neither need the additional arguments, or include a call to _get_compression_method in both to handle dicts?

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add type annotations for the changed / added parameters?

elif compression == 'zip':
zf = BytesZipFile(path_or_buf, mode)
elif compression_method == 'zip':
arcname = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename this to archive_name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd The specific instance of arcname above is no longer there; Though, do you mean change arcname to archive_name in general, including the dict key?

@drew-heenan drew-heenan force-pushed the issue-26023 branch 3 times, most recently from 6058bbb to b1889ef Compare April 12, 2019 06:48
@drew-heenan
Copy link
Contributor Author

@WillAyd I've added type annotations to parameters which I've changed, but it seems that doing so caused the typing validation check to fail on other parts of the files which I have not modified.

@drew-heenan drew-heenan changed the title ENH: Add arcname to to_csv for ZIP compressed CSV filename (#26023) ENH: Allow compression in NDFrame.to_csv to be a dict with optional arguments (#26023) Apr 14, 2019
@WillAyd
Copy link
Member

WillAyd commented Apr 14, 2019 via email

@gfyoung
Copy link
Member

gfyoung commented Jul 15, 2019

@WillAyd : Do you plan on annotating on affected methods in this PR? I see that some of the inputs to some of the affected methods are annotated, but not all of them.

@WillAyd
Copy link
Member

WillAyd commented Jul 15, 2019

I can take a look

@WillAyd
Copy link
Member

WillAyd commented Jul 16, 2019

OK @gfyoung added annotations where I think feasible. Two blockers prevented full annotations of modified funcs, particularly _get_handle:

  1. An actual bug in type shed (see TextIOWrapper.encoding missing Optional Argument python/typeshed#3124)
  2. Variable reuse (see f in _get_handle); this requires a refactor that I think expands the diff a little much so better done as follow up


# GH 17778
def __init__(self, file, mode, compression=zipfile.ZIP_DEFLATED, **kwargs):
def __init__(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might not be clear in the diff but note that I removed the compression argument here. The reason for this was that it is never actually called in code.

It makes annotations more complex, because the keyword argument unpacking of compression_args in this PR only ever has str as keys and therefore mypy complains that there is a type mismatch because int values are never unpacked. Figured easier to just remove than muck around types since it is not ever used

@WillAyd WillAyd added this to the 1.0 milestone Aug 24, 2019
@WillAyd
Copy link
Member

WillAyd commented Aug 24, 2019

Merged again to keep fresh. There's a Mypy failure I'll have to look at later.

@TomAugspurger @gfyoung would you still have time to review this one coming up? Have a lot in the queue so just want to prioritize workflow with you

@gfyoung
Copy link
Member

gfyoung commented Aug 24, 2019

It looks pretty good overall!

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I think good for review again @gfyoung

path_or_buf, mode, encoding=None, compression=None, memory_map=False, is_text=True
path_or_buf,
mode: str,
encoding=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't annotate this particular argument due to a minor bug in typeshed. Fixed on master so maybe something we can come back to soon (typeshed updates are pretty quick)

see python/typeshed#3125

@WillAyd WillAyd merged commit 0d0daa8 into pandas-dev:master Aug 26, 2019
@WillAyd
Copy link
Member

WillAyd commented Aug 26, 2019

Thanks @drew-heenan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Enhancement IO CSV read_csv, to_csv

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pd.DataFrame.to_csv('filename.zip') doesn't extract with a '.csv' extension

6 participants