Skip to content

Conversation

@scottgigante
Copy link
Contributor

@scottgigante scottgigante commented May 13, 2020

@scottgigante scottgigante marked this pull request as ready for review May 13, 2020 15:56
@scottgigante
Copy link
Contributor Author

I don't have authority to request reviews, so @TomAugspurger @mroeschke @jorisvandenbossche

@jreback jreback added the Sparse Sparse Data Type label May 14, 2020
@scottgigante scottgigante force-pushed the bugfix/sparse-fill-value branch from a4744ea to 3c96b60 Compare May 14, 2020 15:15
@jreback jreback added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label May 14, 2020
@jreback jreback added this to the 1.1 milestone May 14, 2020
@jorisvandenbossche
Copy link
Member

I am not sure this fix is entirely correct. While it certainly fixed the original buggy case, like:

In [7]: df1 = pd.DataFrame({"A": pd.SparseArray([0, 0, 0]), 'B': [1,2,3]}) 

In [8]: df1.loc[df1['B'] != 2]       
Out[8]: 
   A  B
0  0  1
2  0  3

(before, this gave NaNs in the first column).

But it also seems to introduce a new bug, eg when doing a reindex:

In [9]: df1.reindex([0,1,3]) 
Out[9]: 
   A    B
0  0  1.0
1  0  2.0
3  0  NaN

This should give a NaN and not 0 at row 3, I think?

@scottgigante
Copy link
Contributor Author

So here's the problem as far as I understand: in https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals/blocks.py#L1219 , blk.fill_value is used to mean the value which is used when you take an index that doesn't exist. In sparse, dtype.fill_value is used to mean the value which is used for filling sparse entries. I don't see an easy way of disentangling these.

@jorisvandenbossche
Copy link
Member

The naming is potentially a bit confusing here (fill_value in take is generally pointing to what is called "na_value" elsewhere).
But so the question is: for SparseArray, should we be using the "SparseDtype.fill_value" when doing a reindex, or rather the "SparseDtype.na_value". It's not directly clear to me which of the two is correct / desired.

Using "na_value" feels more correct (that's what reindexing does for all other array types), but using "fill_value" can avoid densifying the sparse array in a reindexing operation.

Looking at SparseArray.take itself:

In [15]: a = pd.arrays.SparseArray([1.0, 0.0, np.nan, 0.0], fill_value=0.0)    

In [16]: a       
Out[16]: 
[1.0, 0.0, nan, 0.0]
Fill: 0.0
IntIndex
Indices: array([0, 2], dtype=int32)

In [18]: a.take([0, 1, -1, -1], allow_fill=True)                                                                                                                                                                   
Out[18]: 
[1.0, 0.0, nan, nan]
Fill: 0.0
IntIndex
Indices: array([0, 2, 3], dtype=int32)

Here it is clearly "na_value" that is used by default.

@scottgigante scottgigante force-pushed the bugfix/sparse-fill-value branch from 3c96b60 to 0e8c204 Compare May 17, 2020 21:39
@scottgigante
Copy link
Contributor Author

I'm fundamentally unsure how to make this happen, if this is the desired functionality:

>>> df1 = pd.DataFrame({"A": pd.SparseArray([0, 0, 0]), 'B': [1,2,3]}) 
>>> df1.reindex([0,1,3]) 
   A    B
0  0  1.0
1  0  2.0
3  NaN  NaN

@scottgigante scottgigante force-pushed the bugfix/sparse-fill-value branch 2 times, most recently from e8d7648 to c161ae9 Compare May 18, 2020 01:24
@scottgigante scottgigante force-pushed the bugfix/sparse-fill-value branch from c161ae9 to 5188e03 Compare May 20, 2020 14:21
@jreback
Copy link
Contributor

jreback commented May 25, 2020

this is fine, can merge on green.

@jreback jreback merged commit a94b13a into pandas-dev:master Jun 1, 2020
@jreback
Copy link
Contributor

jreback commented Jun 1, 2020

thanks @scottgigante

@jorisvandenbossche
Copy link
Member

@jreback as shown by the discussion above (and @scottgigante's "I'm fundamentally unsure how to make this happen"), this PR was clearly not ready to be merged.

@scottgigante sorry for the slow follow-up. It's however a complicated topic with not directly a clear easy solution (as you experienced yourself), so it requires some time to delve in, which I didn't find yet

@jorisvandenbossche
Copy link
Member

@jreback can you react here? Otherwise I am planning to revert this change. It's not clear to me that this is the right fix, as it possibly causes another regression

@jreback
Copy link
Contributor

jreback commented Jun 3, 2020

@jorisvandenbossche I am unclear what you think is the problem. This closed a few issues w/o causing others. If you want to revert ok by me, but the soln looked fine.

@scottgigante
Copy link
Contributor Author

scottgigante commented Jun 3, 2020

@jreback @jorisvandenbossche it's not clear to me that the (unintended) consequence of this PR is a regression but it's worth discussing: what should be the outcome of this action?

df1 = pd.DataFrame({"A": pd.arrays.SparseArray([1, 2, 0]), 'B': [1,2,3]})
df1.reindex([0,1,3])

Current behaviour:

>>> df1.reindex([0,1,3])
    A    B
0 1.0  1.0
1 2.0  2.0
3 NaN  NaN
>>> df1.reindex([0, 1, 3])['A'].sparse.density
1.0

New behaviour:

>>> df1.reindex([0,1,3])
    A    B
0 1.0  1.0
1 2.0  2.0
3 0.0  NaN
>>> df1.reindex([0, 1, 3])['A'].sparse.density
0.6666666666666666

I don't know that filling a sparse array with non-sparse values when reindexing by a nonexistent index is the right thing to do, but if that is desirable then this is a regression (and I don't know how to fix it).

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 4, 2020

Sparse aside: in principle, reindexing always introduces missing values when new labels are present in the indexer.
(as the non sparse column B does in the example above)

So the question for sparse is: do we follow that rule, or do we deviate by using the "fill value" instead of a "missing value" (na_value).

And as I showed in #34158 (comment), SparseArray.take uses na_value and not fill_value. If we think this behaviour is correct, then reindex should also do that IMO

@TomAugspurger
Copy link
Contributor

Reindex should always introduce missing values. This behavior on master looks incorrect:

In [6]: pd.Series([0, 1, 2], dtype=pd.SparseDtype(int)).reindex([0, 1, 3])
Out[6]:
0    0.0
1    1.0
3    NaN
dtype: Sparse[float64, 0]

In [7]: pd.DataFrame({"A": pd.SparseArray([0, 1, 2], dtype=pd.SparseDtype(int))}).reindex([0, 1, 3])
/Users/taugspurger/.virtualenvs/pandas-dev/bin/ipython:1: FutureWarning: The pandas.SparseArray class is deprecated and will be removed from pandas in a future version. Use pandas.arrays.SparseArray instead.
  #!/Users/taugspurger/Envs/pandas-dev/bin/python
Out[7]:
   A
0  0
1  1
3  0

In [8]: _.dtypes
Out[8]:
A    Sparse[int64, 0]
dtype: object

Series is correct, but DataFrame is filling with fill_value rather than the na value.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Jul 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Sparse Sparse Data Type

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sparse DataFrame with all-zeros column returns NA for fancy index Filtering dataframe with sparse column leads to NAs in sparse column

4 participants