-
Notifications
You must be signed in to change notification settings - Fork 77
Closed
Labels
C APIIssue is about the C APIIssue is about the C API
Milestone
Description
It seems that requiring metadata columns to being < 4G is an unwelcome limitation in forward simulation applications, so we should consider upgrading the metadata_offset columns (and other offset columns) to 64 bit integers. Here is what it would entail:
- Create a
tsk_offset_t
typedef and go through the C tables API making sure this is used for all offset columns (metadata_offset, ancestral_state_offset, etc). Typedef this to uint64_t. - Increment the file-format minor version. In the file writing code, look at the last value in each offset array. If it's < UINT32_MAX, store it as a 32 bit value; if not, store as a 64 bit. In the reading code, check which type is being used to store and update accordingly. Note that this means we don't be able to use the current zero-copy behaviour where we use the memory in the kastore to back the arrays. Figure out the best approach here (note, using the memory in the kastore was motivated by using mmap for io, which we've dropped now as it's inherently dangerous and hard to do properly cross-platform).
- Clean up _tskitmodule.c to use the correct new sizes for numpy arrays (possibly also needing to backport this over to msprime where we use the LightweightTableCollection for interchange).
From a user perspective, this will mean that old versions of tskit won't be able to read newer files. New versions of tskit will continue to read older files without issues, as we're making the code more flexible in terms of expected types.
pinging @petrelharp and @molpopgen for opinions.
Metadata
Metadata
Assignees
Labels
C APIIssue is about the C APIIssue is about the C API