-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
Description
I propose adding a MultiIndex._data that is of type List[Categorical], where all the underlying data of a MultiIndex would be stored. A multiIndex.array property would also be added, that accesses the _data.
This has the advantage of collecting the data that is underlying MultiIndex into one data structure, that is human readable, and also makes access to zero-copy data very easy, e.g. would mi.array[1] return the data of the second level as a Categorical, in a easy-to-read form.
A MultiIndex could with the above changes be explained as just "a container over a list of Categoricals", which is easier to explain than the current mode. The MultiIndex could also be related to CategoricalIndex, which is "a container over a single Categorical".
This change means that MultiIndex.levels will become a property that returns a FrozenList(cat.categories for cat in self._data), and MultiIndex.codes will be a property that returns FrozenList(cat.codes for cat in self._data).
MultiIndex.array will be added and will simply be a property that returns a FrozenList of self._data.
Performance will not be affected, as most operations would still go through MultiIndex.codes and MultiIndex.levels.
Moving names from MultiIndex.levels to MultiIndex._names
Currently the levels' names are stored at each level's name attribute. This is not very compatible with extracting the categories from _data. (the .categories is actually part of the dtype, which ideally should be immutable, so we shouldn't set or change its name attribute).
To make my suggestion practically possible, the level names should be stored in MultiIndex._names instead, and MultiIndex.names will become a property that reads from/writes to MultiIndex._names. I think this change simplifies the MultiIndex a bit, as data and names are dealt with separately. This is a small backward breaking change though.
So, I suggest making two PRs:
- Separating the names from the levels (to be included in 0.25)
- Add
_data,arrayand changelevelsandcodesinto properties.