ENH: Allow compression in NDFrame.to_csv to be a dict with optional arguments (#26023) #26024

drew-heenan · 2019-04-08T01:51:16Z

closes pd.DataFrame.to_csv('filename.zip') doesn't extract with a '.csv' extension #26023
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

WillAyd · 2019-04-08T02:00:19Z

I am -1 on adding another parameter here. What makes Zip extraction different than the other compressions methods?

drew-heenan · 2019-04-08T02:08:18Z

Other compression methods apply to a single file and add an extension to the name (i.e. data.csv.gz extracted results in data.csv), but ZIP archives are meant to have each file within have a name. As is, calling df.to_csv('data.zip') creates an archive containing a CSV file named data.zip as well.

An alternative would be to infer the file name, but that would be a breaking change.

gfyoung · 2019-04-08T03:20:06Z

So I understand the use-case. However, I do agree that adding another parameter is not necessarily the best way to do this. This parameter is really related to our compression parameter.

This leads me to think that a dict might make more sense. It allows us to configure the compression without the bloat. What do you think?

cc @jreback (related to #25990 (comment))

codecov · 2019-04-08T03:34:39Z

Codecov Report

Merging #26024 into master will decrease coverage by <.01%.
The diff coverage is 92.59%.

@@            Coverage Diff             @@
##           master   #26024      +/-   ##
==========================================
- Coverage   91.96%   91.95%   -0.01%     
==========================================
  Files         175      175              
  Lines       52405    52425      +20     
==========================================
+ Hits        48193    48207      +14     
- Misses       4212     4218       +6

Flag	Coverage Δ
#multiple	`90.51% <92.59%> (-0.01%)`	⬇️
#single	`40.72% <37.03%> (-0.14%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/generic.py	`93.54% <100%> (ø)`	⬆️
pandas/io/formats/csvs.py	`98.23% <100%> (+0.03%)`	⬆️
pandas/io/common.py	`91.5% <90.9%> (-0.33%)`	⬇️
pandas/io/gbq.py	`75% <0%> (-12.5%)`	⬇️
pandas/core/frame.py	`96.9% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b90f9db...5853a28. Read the comment docs.

drew-heenan · 2019-04-08T16:15:30Z

@gfyoung I agree that this approach makes more sense - I'll modify the functions to optionally take a dict for compression, i.e. {'compression': 'zip', 'arcname': 'data.csv'}. Thanks for the feedback!

gfyoung · 2019-04-08T18:24:51Z

@drew-heenan : Sounds good! One minor thing: let's use method as the key instead of compression. Otherwise, we'll be extracting compression["compression"], which is a little redundant.

…rcname'

pandas/io/common.py

gfyoung · 2019-04-09T07:16:03Z

pandas/io/common.py

+        .. versionchanged:: 0.25.0
+
+           May now be a dict with key 'method' as compression mode
+           and 'arcname' as CSV file name if mode is 'zip'


I would make this whatsnew note a little more generic. In reality, we should be just accepting any keyword arguments to BytesZipFile. Also, we should have an example.

pandas/io/common.py

jreback · 2019-04-09T12:27:14Z

pandas/io/formats/csvs.py

        if path_or_buf is None:
            path_or_buf = StringIO()

+        self._compression_arg = compression


didn't you change get_filepath_or_buffer to already handle this? why is this special cased here?

I did not modify get_filepath_or_buffer, though I suppose I certainly could to support taking a dict as compression. The self._compression_arg is there to avoid changing self.compression from only ever holding the inferred compression method, while self._compression_arg would include any additional arguments if compression was passed as a dict. I do now think that I should have instead had a self.compression_args hold this dict with the method key popped.

@jreback With regard to get_filepath_or_buffer (and similarly _infer_compression), I've added a function _get_compression_method which handles the case where compression is given as a dict to to_csv, CSVFormatter or _get_handle. It extracts the compression method string before passing to get_filepath_buffer or _infer_compression.
Would it be preferable then to keep both functions' original functionality where compression may only be a string/None, as neither need the additional arguments, or include a call to _get_compression_method in both to handle dicts?

pandas/tests/io/formats/test_to_csv.py

WillAyd

Could you also add type annotations for the changed / added parameters?

WillAyd · 2019-04-09T19:19:19Z

pandas/io/common.py

-        elif compression == 'zip':
-            zf = BytesZipFile(path_or_buf, mode)
+        elif compression_method == 'zip':
+            arcname = None


Can you rename this to archive_name?

@WillAyd The specific instance of arcname above is no longer there; Though, do you mean change arcname to archive_name in general, including the dict key?

pandas/io/common.py

drew-heenan · 2019-04-12T10:35:43Z

@WillAyd I've added type annotations to parameters which I've changed, but it seems that doing so caused the typing validation check to fail on other parts of the files which I have not modified.

WillAyd · 2019-04-14T13:47:32Z

Yea I think a general rename would make things clearer

…

Sent from my iPhone

On Apr 13, 2019, at 6:24 PM, Drew Heenan ***@***.***> wrote: @drew-heenan commented on this pull request. In pandas/io/common.py: > if is_path: f = bz2.BZ2File(path_or_buf, mode) else: f = bz2.BZ2File(path_or_buf) # ZIP Compression - elif compression == 'zip': - zf = BytesZipFile(path_or_buf, mode) + elif compression_method == 'zip': + arcname = None @WillAyd The specific instance of arcname above is no longer there; Though, do you mean change arcname to archive_name in general, including the dict key? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

gfyoung · 2019-07-15T23:04:27Z

@WillAyd : Do you plan on annotating on affected methods in this PR? I see that some of the inputs to some of the affected methods are annotated, but not all of them.

WillAyd · 2019-07-15T23:19:42Z

I can take a look

WillAyd · 2019-07-16T21:40:38Z

OK @gfyoung added annotations where I think feasible. Two blockers prevented full annotations of modified funcs, particularly _get_handle:

An actual bug in type shed (see TextIOWrapper.encoding missing Optional Argument python/typeshed#3124)
Variable reuse (see f in _get_handle); this requires a refactor that I think expands the diff a little much so better done as follow up

WillAyd · 2019-07-16T21:43:12Z

pandas/io/common.py


    # GH 17778
-    def __init__(self, file, mode, compression=zipfile.ZIP_DEFLATED, **kwargs):
+    def __init__(


It might not be clear in the diff but note that I removed the compression argument here. The reason for this was that it is never actually called in code.

It makes annotations more complex, because the keyword argument unpacking of compression_args in this PR only ever has str as keys and therefore mypy complains that there is a type mismatch because int values are never unpacked. Figured easier to just remove than muck around types since it is not ever used

pandas/io/common.py

WillAyd · 2019-08-24T08:13:34Z

Merged again to keep fresh. There's a Mypy failure I'll have to look at later.

@TomAugspurger @gfyoung would you still have time to review this one coming up? Have a lot in the queue so just want to prioritize workflow with you

gfyoung · 2019-08-24T09:13:58Z

It looks pretty good overall!

WillAyd

OK I think good for review again @gfyoung

WillAyd · 2019-08-25T20:19:38Z

pandas/io/common.py

-    path_or_buf, mode, encoding=None, compression=None, memory_map=False, is_text=True
+    path_or_buf,
+    mode: str,
+    encoding=None,


Couldn't annotate this particular argument due to a minor bug in typeshed. Fixed on master so maybe something we can come back to soon (typeshed updates are pretty quick)

see python/typeshed#3125

WillAyd · 2019-08-26T14:27:48Z

Thanks @drew-heenan

…arguments (pandas-dev#26023) (pandas-dev#26024)

drew-heenan added 3 commits April 7, 2019 21:08

ENH/BUG: Add arcname to to_csv for ZIP compressed csv filename (panda…

4e73dc4

…s-dev#26023)

DOC: Updated docs for arcname in NDFrame.to_csv (pandas-dev#26023)

ab7620d

conform to line length limit

2e782f9

Fixed test_to_csv_zip_arcname for Windows paths

83e8834

drew-heenan marked this pull request as ready for review April 8, 2019 02:59

gfyoung added Enhancement IO CSV read_csv, to_csv labels Apr 8, 2019

Merge remote-tracking branch 'upstream/master' into issue-26023

d238878

to_csv compression may now be dict with possible keys 'method' and 'a…

b41be54

…rcname'

gfyoung reviewed Apr 9, 2019

View reviewed changes

pandas/io/common.py Outdated Show resolved Hide resolved

gfyoung reviewed Apr 9, 2019

View reviewed changes

jreback requested changes Apr 9, 2019

View reviewed changes

WillAyd requested changes Apr 9, 2019

View reviewed changes

drew-heenan added 2 commits April 9, 2019 18:55

test_to_csv_compression_dict uses compression_only fixture

60ea58c

delegate dict handling to _get_compression_method, type annotations

8ba9082

drew-heenan force-pushed the issue-26023 branch 3 times, most recently from 6058bbb to b1889ef Compare April 12, 2019 06:48

fix import order, None type annotations

0a3a9fd

compression args passed as kwargs, update relevant docs

a1cb3f7

drew-heenan force-pushed the issue-26023 branch from b1889ef to a1cb3f7 Compare April 14, 2019 00:19

drew-heenan changed the title ~~ENH: Add arcname to to_csv for ZIP compressed CSV filename (#26023)~~ ENH: Allow compression in NDFrame.to_csv to be a dict with optional arguments (#26023) Apr 14, 2019

WillAyd added 3 commits July 16, 2019 13:52

Merge remote-tracking branch 'upstream/master' into issue-26023

780eb04

Added annotations where feasible

6c4e679

Black and lint

1b567c9

WillAyd reviewed Jul 16, 2019

View reviewed changes

gfyoung reviewed Jul 17, 2019

View reviewed changes

pandas/io/common.py Show resolved Hide resolved

WillAyd added 3 commits July 17, 2019 08:10

Merge remote-tracking branch 'upstream/master' into issue-26023

9324b63

isort fixup

7cf65ee

Docstring fixup and more annotations

29374f3

WillAyd added this to the 1.0 milestone Aug 24, 2019

Merge remote-tracking branch 'upstream/master' into issue-26023

6701aa4

WillAyd added 5 commits August 24, 2019 16:44

lint fixup

0f5489d

mypy fixup

e04138e

whatsnew fixup

6f2bf00

Annotation and doc fixups

865aa81

mypy typeshed bug fix

8d1deee

WillAyd reviewed Aug 25, 2019

View reviewed changes

gfyoung approved these changes Aug 26, 2019

View reviewed changes

WillAyd approved these changes Aug 26, 2019

View reviewed changes

WillAyd merged commit 0d0daa8 into pandas-dev:master Aug 26, 2019

jreback mentioned this pull request Sep 5, 2019

EHN: Add encoding_errors option in pandas.DataFrame.to_csv (#27750) #27899

Closed

5 tasks

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

ENH: Allow compression in NDFrame.to_csv to be a dict with optional …

892233e

…arguments (pandas-dev#26023) (pandas-dev#26024)

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

ENH: Allow compression in NDFrame.to_csv to be a dict with optional …

ba39c48

…arguments (pandas-dev#26023) (pandas-dev#26024)

flutefreak7 mentioned this pull request Feb 12, 2020

to_csv compression dict option 'archive_name' should accept os.PathLike #31934

Open

MarcoGorelli mentioned this pull request Mar 2, 2021

TYP ensure bool_t is always used in pandas/core/generic.py #40175

Merged

1 task

Uh oh!

ENH: Allow compression in NDFrame.to_csv to be a dict with optional arguments (#26023) #26024

ENH: Allow compression in NDFrame.to_csv to be a dict with optional arguments (#26023) #26024

Uh oh!

Conversation

drew-heenan commented Apr 8, 2019

Uh oh!

WillAyd commented Apr 8, 2019

Uh oh!

drew-heenan commented Apr 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gfyoung commented Apr 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

drew-heenan commented Apr 8, 2019

Uh oh!

gfyoung commented Apr 8, 2019

Uh oh!

Uh oh!

gfyoung Apr 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jreback Apr 9, 2019

Choose a reason for hiding this comment

Uh oh!

drew-heenan Apr 9, 2019

Choose a reason for hiding this comment

Uh oh!

drew-heenan Apr 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

WillAyd Apr 9, 2019

Choose a reason for hiding this comment

Uh oh!

drew-heenan Apr 13, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

drew-heenan commented Apr 12, 2019

Uh oh!

WillAyd commented Apr 14, 2019 via email

Uh oh!

gfyoung commented Jul 15, 2019

Uh oh!

WillAyd commented Jul 15, 2019

Uh oh!

WillAyd commented Jul 16, 2019

Uh oh!

WillAyd Jul 16, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WillAyd commented Aug 24, 2019

Uh oh!

gfyoung commented Aug 24, 2019

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

WillAyd Aug 25, 2019

Choose a reason for hiding this comment

Uh oh!

WillAyd commented Aug 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

drew-heenan commented Apr 8, 2019 •

edited

Loading

gfyoung commented Apr 8, 2019 •

edited

Loading

codecov bot commented Apr 8, 2019 •

edited

Loading

gfyoung Apr 9, 2019 •

edited

Loading

drew-heenan Apr 10, 2019 •

edited

Loading