Skip to content

Conversation

lwasser
Copy link
Member

@lwasser lwasser commented Sep 19, 2025

This pr is a rework of #110 . Let's plan to run a sprint on this pr in the next few weeks to see if we can get to a place where it's good.

NOTE: I added some code examples that I literally found online and DID NOT TEST. so we will want to definitely test what is there before merging.

ALSO - because this topic is not my expertise, in some cases I ran with a section after a Google search and tried to flesh it out, but it also could be off. Any feedback is welcome on this!!

@lwasser lwasser added help wanted We welcome a contributor to work on this issue! thank you in advance! 🚀 ready-for-review labels Sep 19, 2025
* **[Open Science Framework (OSF)](https://osf.io)** - Comprehensive research platform
* **[Figshare](https://figshare.com)** - User-friendly with good visualization tools
* **[Dryad](https://datadryad.org)** - Focused on research data (subscription model for some features)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would huggingface fit here or is this more for academic ? Perhaps https://docs.source.coop ?

* **[Google Cloud Platform](https://cloud.google.com/storage)** - Cloud Storage with strong AI/ML tool integration
* **[Linode](https://www.linode.com/products/object-storage/)** - Object Storage with straightforward pricing and developer-friendly tools

<!-- I don't understand how these platforms are different from things like figshare - can we clarify that? and how / why someone would pick these vs figshare / dryad? -->

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is moving into data versioning which is important for scientific studies and reproducibility, but may be out of scope for a package. Having a small subset or testable data for a package makes a lot of sense.

- **[Pachyderm](https://www.pachyderm.com/)** - Data pipeline platform with version control
```

<!-- I am not sure how this section relates to data stored in a package. I understand it's important, but does it belong in a page focused on how and where to store data for your Python package? It might be that I just don't understand as written!-->

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to removing


### Use Pytest fixtures for data access

Pytest fixtures provide a clean way to set up and share data across your test suite. They're especially useful for scientific packages where you need consistent access to test datasets.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is just personal preference, but I think we should move pytest up and pooch down.
Pytest fits better with the whole 'testing your data thing' along with testing your code. Pooch seems like an additional tool or service to download data from an external source, I've not heard of it before.

<!-- I am not sure about this statement in terms of what it means and whether we have tools that consider standards or not we might also want to link to FAIR-->
:::

```{admonition} Field specific standards + metadata

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A link to FAIR would be good. I haven't heard of the ones mentioned 😓

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted We welcome a contributor to work on this issue! thank you in advance! 🚀 ready-for-review
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants