Skip to content

Conversation

charlesdong1991
Copy link
Contributor

@pep8speaks
Copy link

pep8speaks commented Jan 17, 2019

Hello @charlesdong1991! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 04, 2019 at 17:16 Hours UTC

@codecov
Copy link

codecov bot commented Jan 17, 2019

Codecov Report

Merging #24819 into master will decrease coverage by 49.47%.
The diff coverage is 0%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #24819       +/-   ##
===========================================
- Coverage   92.38%    42.9%   -49.48%     
===========================================
  Files         166      166               
  Lines       52379    52385        +6     
===========================================
- Hits        48392    22478    -25914     
- Misses       3987    29907    +25920
Flag Coverage Δ
#multiple ?
#single 42.9% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/reshape/merge.py 9.43% <0%> (-84.84%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.35%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.15%) ⬇️
... and 124 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 512830b...cbb36b2. Read the comment docs.

@codecov
Copy link

codecov bot commented Jan 17, 2019

Codecov Report

Merging #24819 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24819      +/-   ##
==========================================
- Coverage   92.37%   92.37%   -0.01%     
==========================================
  Files         166      166              
  Lines       52420    52406      -14     
==========================================
- Hits        48423    48409      -14     
  Misses       3997     3997
Flag Coverage Δ
#multiple 90.79% <100%> (-0.01%) ⬇️
#single 42.87% <0%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/reshape/merge.py 94.48% <ø> (ø) ⬆️
pandas/core/internals/managers.py 96.06% <100%> (-0.01%) ⬇️
pandas/core/common.py 98.37% <0%> (-0.04%) ⬇️
pandas/core/frame.py 96.81% <0%> (-0.02%) ⬇️
pandas/io/formats/html.py 99.34% <0%> (-0.01%) ⬇️
pandas/core/accessor.py 98.79% <0%> (ø) ⬆️
pandas/core/ops.py 94.28% <0%> (ø) ⬆️
pandas/core/internals/blocks.py 94.17% <0%> (ø) ⬆️
pandas/core/sparse/scipy_sparse.py 100% <0%> (ø) ⬆️
pandas/core/generic.py 96.63% <0%> (ø) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb43726...71729b2. Read the comment docs.

Copy link
Member

@jschendel jschendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Can you add a whatsnew note?

@jschendel jschendel added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 17, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a bug here, where None in a tuple is coerced to a string, however, I am a bit iffy about automagically coercing None to empty string. Yes its convient, but if you have non-string labels then it is unexpected.

b = pd.DataFrame({col2: [4, 5, 6]})

df = a.merge(b, left_index=True, right_index=True, suffixes=suffixes)
assert df.columns.tolist() == expected_cols
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fully construct the result frame here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, will change

self.right_on = com.maybe_make_list(right_on)

self.copy = copy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this not break if suffixes=None is passed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will indeed break, but do we expect to get None for suffixes? the default is ("_x", "_y"). I could understand that people may want to use suffixes like: (None, "_y") if they want to keep the name of one of original columns... but use case for suffixes=None seems rare...

- :meth:`Series.unstack` and :meth:`DataFrame.unstack` no longer convert extension arrays to object-dtype ndarrays. Each column in the output ``DataFrame`` will now have the same dtype as the input (:issue:`23077`).
- Bug when grouping :meth:`Dataframe.groupby()` and aggregating on ``ExtensionArray`` it was not returning the actual ``ExtensionArray`` dtype (:issue:`23227`).
- Bug in :func:`pandas.merge` when merging on an extension array-backed column (:issue:`23020`).
- Bug in :func:`pandas.merge` when setting None in suffixes (:issue: `24782`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be more explicit on what is changing here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I will change later once collecting more feedbacks from other committers. @jreback

@jreback
Copy link
Contributor

jreback commented Jan 18, 2019

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring needs to be updated. Currently it's

suffixes : tuple of (str, str), default ('_x', '_y')
    Suffix to apply to overlapping column names in the left and right
    side, respectively. To raise an exception on overlapping columns use
    (False, False).

That'll need to be something like tuple of (str or None, str or None).

Even that type isn't quite right... Since (False, False) is apparently valid.


@pytest.mark.parametrize("col1, col2, suffixes, expected_cols", [
(0, 0, ("", "_dup"), ["0", "0_dup"]),
(0, 0, (None, "_dup"), ["0", "0_dup"]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this get's the expected output of @simonjayhawkins in #24782, does it?

IIUC, @simonjayhawkins expected for (None, '_dup') to not cast the original input to a string. So the expected columns would be [0, "0_dup"]. Is that correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even that type isn't quite right... Since (False, False) is apparently valid.

Yeah, it looks like pretty much anything is actually valid right now, e.g.

In [8]: df.merge(df, left_index=True, right_index=True, suffixes=(np.sum, pd.DataFrame))
Out[8]:
   0<function sum at 0x000002550602C840>  0<class 'pandas.core.frame.DataFrame'>
0                                      1                                       1

In [9]: df.merge(df, left_index=True, right_index=True, suffixes=(list('abc'), pd.Series(list('xyz'))))
Out[9]:
   0['a', 'b', 'c']  00    x\n1    y\n2    z\ndtype: object
0                 1                                       1                                   1                                       1

Looks to be caused by items_overlap_with_suffix just hitting whatever is passed with a format:

def items_overlap_with_suffix(left, lsuffix, right, rsuffix):
"""
If two indices overlap, add suffixes to overlapping entries.
If corresponding suffix is empty, the entry is simply converted to string.
"""
to_rename = left.intersection(right)
if len(to_rename) == 0:
return left, right
else:
if not lsuffix and not rsuffix:
raise ValueError('columns overlap but no suffix specified: '
'{rename}'.format(rename=to_rename))
def lrenamer(x):
if x in to_rename:
return '{x}{lsuffix}'.format(x=x, lsuffix=lsuffix)
return x
def rrenamer(x):
if x in to_rename:
return '{x}{rsuffix}'.format(x=x, rsuffix=rsuffix)
return x
return (_transform_index(left, lrenamer),
_transform_index(right, rrenamer))

IIUC, @simonjayhawkins expected for (None, '_dup') to not cast the original input to a string.

Yeah, I overlooked this earlier but agree; it makes sense to me that None should not cast to string and instead retain the original type of the column label in question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! @TomAugspurger @jschendel i overlooked it... i changed the code a bit, and added several more test cases
And @jschendel I also feel like we should add some restrictions to the suffix type, apparently, it looks wired if any type is given as suffixes... is it worth it if i open another PR to address it?

expected = pd.DataFrame([[1, 4], [2, 5], [3, 6]],
columns=expected_cols)

result = a.merge(b, left_index=True, right_index=True, suffixes=suffixes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many tests in test_merge.py repeat using both the left.merge(right, **kwargs) and the pd.merge(left, right, **kwargs) syntax.

i'm not sure how important that is, so unless one of the maintainers comments otherwise this lgtm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your feedback, @simonjayhawkins ! if one of maintainers brings this up and wants it to be standardised in this PR, i will then fix it! otherwise, as you said, I will open another PR to tackle it!

@simonjayhawkins
Copy link
Member

does this not break if suffixes=None is passed?

i guess that since duplicate labels are allowed in an index then suffixes=None should be a valid option. future PR though.

suffixes : 2-length sequence (tuple, list, ...)
Suffix to apply to overlapping column names in the left and right
side, respectively
side, respectively. Except for tuple of (str, str), it also allows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. Is suffixes=[None, None] valid?

Would the following be correct?

Length-2 sequence. Each element should be a str or None.

    * str : the string is appended to the overlapping column label.
    * None : the column label is left as-is.

Ah, I see now that it's invalid, because we won't return a frame with duplicate labels. But presumably, a [None, 'a'] is valid (could you add a test for that?). So perhaps in the docstring note that at least one of the values must not be None.

Copy link
Contributor Author

@charlesdong1991 charlesdong1991 Jan 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Tom for your comment! @TomAugspurger
you are right, a (None, None) will raise an error which was in the code already. And a (None, 'a') will be valid, and None refers to the column label is left as it is... I think i already had this (None, 'a') in the pytest file. Shall I add another one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see a test where suffixes was a list like [None, 'a'], just tuples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, sorry, i misunderstood... my bad...
Thx for your review! tests added! And also docstring is updated! @TomAugspurger

Copy link
Member

@toobaz toobaz Feb 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see now that it's invalid, because we won't return a frame with duplicate labels

So perhaps in the docstring note that at least one of the values must not be None.

@TomAugspurger Not following you there... we do return frames with duplicate labels if 1) we don't pass "suffixes", or 2) we pass "suffixes" such that they result duplicated labels. So I don't see anything offensive in having suffixes=(None, None) result in duplicated labels.

EDIT: had missed @simonjayhawkins 's comment below, stating basically the same thing.

Suffix to apply to overlapping column names in the left and right
side, respectively. Except for tuple of (str, str), it also allows
tuple of (None, str) or (str, None)
side, respectively. And each element should be either a str or None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this is still a bit confusing, and I don't think listing each combination really helps. How about.



suffixes : Sequence
    A length-2 sequence where each element is a optionally
    a string indicating the suffix to add to overlapping column
    names in `left` and `right` respectively. Pass a value of
    `None` instead of a string to indicate that the column names
    from `left` or `right` should be left as-is, with no suffix.
    At least one of the values must not be None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Tom! and sorry for my bad docstring! and yours does look much clearer! I just removed a redundant 'a' ^^ @TomAugspurger

suffixes : 2-length sequence (tuple, list, ...)
Suffix to apply to overlapping column names in the left and right
side, respectively
suffixes : Sequence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the docstring will also need the default values included

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, yeah, added! thx for the carefulness!! @simonjayhawkins

("a", "a", ("_x", None), ["a_x", "a"]),
("a", "b", ("_x", None), ["a", "b"]),
("a", "a", [None, "_x"], ["a", "a_x"]),
(0, 0, ["_a", None], ["0_a", 0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for added assurance can you also add a couple of regression tests here

("a", "a", None, ["a_x", "a_y"]),
(0, 0, None, ["0_x", "0_y"])

should be sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean ("a", "a", ("_x", "_y"), ["a_x", "a_y"]) probably? suffix cannot accept None for now. @simonjayhawkins

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! just added a small new test function to cover it!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this not break if suffixes=None is passed?

i guess that since duplicate labels are allowed in an index then suffixes=None should be a valid option. future PR though.

suffixes=None currently raises TypeError: 'NoneType' object is not iterable.

Is suffixes=[None, None] valid?

suffixes=[None, None] currently raises ValueError: columns overlap but no suffix specified:

so i guess suffixes=None should be equivalent to suffixes=[None, None]. future PR thoiugh.


@pytest.mark.parametrize("suffixes", [
(None, None),
('', None),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the changes you've made the left label would be cast to a string and the right label unchanged. so technically this shouldn't fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, they are indeed different types, however, before applying the renamer function, there is a suffix check, which means None and '' cannot be assigned at the same time, so ('', None) will raise an error.

if not lsuffix and not rsuffix:
     raise ValueError('columns overlap but no suffix specified: '
             '{rename}'.format(rename=to_rename))

Do you want it to be changed? @simonjayhawkins

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the docstring reads At least one of the values must not be None. and ('',None) satisfies that.

Do you want it to be changed?

what would be the extent of the changes to make this work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added already!
allow (None, '') for number column, and if column name is string, (None, '') or (None, None) will raise error

`right` should be left as-is, with no suffix. At least one of the
values must not be None.
values must not be None. A combination of `''` and `None` is will
raise error for columns which type is string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my opinion is that adding A combination of `''` and `None` is will raise error for columns which type is string to the docstring is probably unnecessary.

it makes me wonder whether the (None,None) combination should be addressed within this PR after all. if the 'columns overlap but no suffix specified' check is removed, what breaks?

Copy link
Contributor Author

@charlesdong1991 charlesdong1991 Jan 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean (None, None) combination, i am a bit confused? we don't expect column names to be overlapped, do we?
I think now the suffixes takes in length-2 sequence correctly given different types of suffix, the only thing now is suffixes=None will raise an error, since now it requires a length-2 sequence, the code should slightly be changed in _MergeOperation class. But looks like it's out of scope of this PR although it's easy to fix...
I would advise to create another PR to address this issue, and i can work on this new PR to tackle it. and thanks for your review, look forward to hearing your opinion! @simonjayhawkins

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean (None, None) combination

i mean the ability to return a DataFrame with duplicate column labels if so desired. i'm not sure why this isn't allowed. maybe it breaks things or maybe that check is not necessary.

if it was allowed would it make for a simpler api description in the docstring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could leave it for maintainer's discussion, if they want it, I will open a follow-up PR to address it.
Thanks for your feedback, and I just shortened the docstring! @simonjayhawkins

@charlesdong1991
Copy link
Contributor Author

charlesdong1991 commented Jan 27, 2019

@jreback @TomAugspurger @jschendel @simonjayhawkins I just moved whatsnew from 0.24.0 to 0.24.1 because the new version of Pandas had been released. Please take a look at this PR to see what else I should change. Thanks!

@jreback
Copy link
Contributor

jreback commented Feb 1, 2019

@charlesdong1991 this would be for 0.25.0 anyhow, pls merge master

@charlesdong1991
Copy link
Contributor Author

charlesdong1991 commented Feb 3, 2019

After discussing with @jorisvandenbossche , and as he suggested, this PR will only solve the case when at most one element is None, like (None, str) or (str, None) will remain column as-is when suffix is None. And for cases like (None, None) or None or False, (False, False), I will open a follow-up issue and PR to discuss and fix.

@jorisvandenbossche
Copy link
Member

To be clear, the reason to leave out the (None, None) for now is that it already does something else: it does the same as (False, False) (raising an error if there will be duplicates). This is not documented or tested, so we could maybe just change it (or otherwise first deprecate it), but let's leave that for the follow-up issue for discussion.

@charlesdong1991
Copy link
Contributor Author

charlesdong1991 commented Feb 3, 2019

I just did another change based on what we discussed before i left the sprint @jorisvandenbossche , and @toobaz please also take a look to see if we could merge and close this, and then, I will open another issue and PR addressing the potential issue and change for suffixes.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Some small comments.
Can you also update the docstring of pd.merge ?

a = pd.DataFrame({col1: [1, 2, 3]})
b = pd.DataFrame({col2: [3, 4, 5]})

msg = "columns overlap but no suffix specified"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add here a comment like # TODO reconsider current behaviour of raising, see #.... (with a link to the issue).
Just in case later on when we want to change this, we won't think like "oh no, we cannot change it, because it is tested behaviour")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, added!

suffixes : 2-length sequence (tuple, list, ...)
Suffix to apply to overlapping column names in the left and right
side, respectively
suffixes : Sequence or None, default is ("_x", "_y")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The or None can be removed I think (we removed that change from this PR)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. minor comments, ping on green.

Reshaping
^^^^^^^^^

- Bug in :func:`pandas.merge` doesn't work correctly if None is in suffixes (:issue: `24782`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you be a bit more clear on what the previous sympton was, instead of 'doesn't work correctly'

double backticks on None

no space after the colon (:issue:`24782`)

return '{x}{lsuffix}'.format(x=x, lsuffix=lsuffix)
return x
def renamer(x, suffix):
"""Rename the left and right indices.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make a proper doc-string (Parameters / Returns)

(0, 0, [None, None]),
(0, 0, (None, ""))
])
def test_merge_error(col1, col2, suffixes):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_merge_suffix_error

`left` and `right` respectively. Pass a value of `None` instead
of a string to indicate that the column name from `left` or
`right` should be left as-is, with no suffix. At least one of the
values must not be None.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a versionchanged 0.25.0 here

@jreback
Copy link
Contributor

jreback commented Feb 4, 2019

suffixes=None is still invalid right (e.g. passing an actual None value)? this has a test?

@charlesdong1991
Copy link
Contributor Author

thanks @jreback for your review... all changed based on review, pls let me know if you have other questions.

@jreback jreback added this to the 0.25.0 milestone Feb 6, 2019
@jreback jreback merged commit 09633b8 into pandas-dev:master Feb 6, 2019
@jreback
Copy link
Contributor

jreback commented Feb 6, 2019

thanks @charlesdong1991

@jorisvandenbossche
Copy link
Member

@charlesdong1991 do you want open a new issue for changing the (None, None) case and allowing None and False ?

@charlesdong1991
Copy link
Contributor Author

@jorisvandenbossche yes! I will open a new issue addressing None related case later today or tomorrow!!

@simonjayhawkins
Copy link
Member

@charlesdong1991 : the Panel.join tests have now been removed, xref #25191. so there should no longer be issues with the code shared between Panel.join and DataFrame.merge with regard to the suffixes=(False, False) behaviour that was causing us problems at europandas.

@charlesdong1991
Copy link
Contributor Author

@simonjayhawkins thanks a lot!! I also noticed it when merging my branch to master!!

@jorisvandenbossche sorry that recently i was quite busy so didn't have time to work on new PR, i just created a new PR to address None/False behaviour.

@simonjayhawkins simonjayhawkins mentioned this pull request May 16, 2022
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DataFrame.merge(suffixes=) does not respect None
8 participants