Skip to content

Conversation

@DPeterK
Copy link
Member

@DPeterK DPeterK commented Mar 13, 2015

Added a warning when subsetting a cube will cast a masked array to a numpy array.

This will happen when a cube's data attribute is a filled (i.e. no unmasked points) masked array. This currently happens silently, which is undesirable if not expected.

@pelson
Copy link
Member

pelson commented Mar 16, 2015

This currently happens silently, which is undesirable if not expected.

Out of interest, why is this undesirable? If it is so undesirable, I'd back not having a warning and just not doing it. I'm not too fussed which way it goes, but I'm not a fan of the middle ground (aka warnings).

@ajdawson
Copy link
Member

I don't like changing the type of the array when slicing, I think we just shouldn't do it. Does anyone know why we did it like this in the first place? Was it for efficiency?

@rhattersley
Copy link
Member

Was it for efficiency?

That's what I remember. Switching to masked-arrays made everything slower, so the workaround was to avoid masked arrays where possible. A lot has changed since then though, including Iris optimisation of masked array creation, so it's quite possible the workaround is now unnecessary.

@DPeterK
Copy link
Member Author

DPeterK commented Mar 17, 2015

In the interests of experimentation I tried running the tests with the if block that fills the mask (i.e. cube L1936-L1939) removed. This caused no test failures other than the test I added in this PR.

@rhattersley
Copy link
Member

In the interests of experimentation I tried running the tests with the if block that fills the mask (i.e. cube L1936-L1939) removed.

It would be interesting to check the performance impact. For example, the execution time/memory load of running the tests, the examples, specific performance metrics, etc.

@ajdawson
Copy link
Member

ajdawson commented Jun 8, 2016

@dkillick - In light of #2046 + several threads on the Google Group referring to the loss of mask when slicing/extracting, is there a chance you could revive this work? If we can determine any performance impact of removing the fill of masked arrays we could make a decision and sort this out.

@DPeterK
Copy link
Member Author

DPeterK commented Jun 8, 2016

@ajdawson we've a bit of internal pressure at the moment, so I won't be able to jump on it immediately. I'll add it to the v2.0 milestone though so that this isn't forgotten when time does become available.

@DPeterK DPeterK added this to the v2.0 milestone Jun 8, 2016
@ajdawson
Copy link
Member

ajdawson commented Jun 8, 2016

Great, thanks @dkillick

@DPeterK
Copy link
Member Author

DPeterK commented Aug 1, 2017

It appears that this still a live issue, and we still need to reach a decision on whether to return a masked array or a filled array (i.e. ndarray).

@lbdreyer and I just tested for this behaviour on the dask masked array branch. We found that:

  • realised masked data that is subsetted will be filled with the result being an ndarray (that is, the existing behaviour), but
  • dask lazy masked data that is subsetted and then realised will not be filled, with the result remaining a masked array with no masked points.

The difference in behaviour between lazy and real data is highly undesirable and drives the point that a consistent solution to this issue is still required.

@ajdawson
Copy link
Member

ajdawson commented Aug 1, 2017

The obvious way to reach consistency is to leave masked arrays alone and never fill them. I think we came to the conclusion that we did the filling for performance reasons, but we don't really know exactly what performance hit we might expect... Are there situations where we need to do this on slice, as opposed to allowing the user to convert to a normal array if desired?

@pelson
Copy link
Member

pelson commented Oct 19, 2017

@djkirkham - could you take a look at where we stand on this now that dask-mask has been merged into master. Still a live issue? As far as I understand, we are now at the mercy of dask/numpy.ma, right?

@djkirkham
Copy link
Contributor

Hmm.. it looks like we're currently in a "worst of both worlds" situation. Calling .data on a cube constructed from a masked array, whether lazy or not, will return a masked array even if there are no masked points. But slicing that cube and calling .data returns an ndarray if there are no masked points. There are a few other places where a no-mask masked array is converted to an ndarray - most notably CubeList.merge().

@djkirkham
Copy link
Contributor

According to @pp-mo the above has always been the case unless the data was lazy, so it's not such a big issue. Still, we need a resolution. The easiest thing to do would be to just do what we did before: unmask when slicing and when realising lazy data. In the absence of any performance tests indicating that there's little or no difference performing operations on masked arrays I think it's safest to try to ensure we're operating on unmasked data.

@DPeterK
Copy link
Member Author

DPeterK commented Oct 23, 2017

@djkirkham I still think that Iris should not be changing the type of an object...

@djkirkham
Copy link
Contributor

@dkillick I don't see it as such a big issue; users shouldn't be relying on the type anyway (they can't currently anyway). But I still don't like having to do the check every time, especially since we're not consistent about applying it throughout the code base. If there really is a performance hit from using a masked array maybe it would be better to unmask it in computationally heavy parts of the code. But that seems like a lot of extra work for little gain.

@djkirkham
Copy link
Contributor

Replaced by #2856

@djkirkham djkirkham closed this Oct 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants