-
Notifications
You must be signed in to change notification settings - Fork 72
enh(data): add page on data to the tests section #580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
* **[Open Science Framework (OSF)](https://osf.io)** - Comprehensive research platform | ||
* **[Figshare](https://figshare.com)** - User-friendly with good visualization tools | ||
* **[Dryad](https://datadryad.org)** - Focused on research data (subscription model for some features) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would huggingface fit here or is this more for academic ? Perhaps https://docs.source.coop ?
* **[Google Cloud Platform](https://cloud.google.com/storage)** - Cloud Storage with strong AI/ML tool integration | ||
* **[Linode](https://www.linode.com/products/object-storage/)** - Object Storage with straightforward pricing and developer-friendly tools | ||
|
||
<!-- I don't understand how these platforms are different from things like figshare - can we clarify that? and how / why someone would pick these vs figshare / dryad? --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is moving into data versioning which is important for scientific studies and reproducibility, but may be out of scope for a package. Having a small subset or testable data for a package makes a lot of sense.
- **[Pachyderm](https://www.pachyderm.com/)** - Data pipeline platform with version control | ||
``` | ||
|
||
<!-- I am not sure how this section relates to data stored in a package. I understand it's important, but does it belong in a page focused on how and where to store data for your Python package? It might be that I just don't understand as written!--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to removing
|
||
### Use Pytest fixtures for data access | ||
|
||
Pytest fixtures provide a clean way to set up and share data across your test suite. They're especially useful for scientific packages where you need consistent access to test datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is just personal preference, but I think we should move pytest up and pooch down.
Pytest fits better with the whole 'testing your data thing' along with testing your code. Pooch seems like an additional tool or service to download data from an external source, I've not heard of it before.
<!-- I am not sure about this statement in terms of what it means and whether we have tools that consider standards or not we might also want to link to FAIR--> | ||
::: | ||
|
||
```{admonition} Field specific standards + metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A link to FAIR would be good. I haven't heard of the ones mentioned 😓
This pr is a rework of #110 . Let's plan to run a sprint on this pr in the next few weeks to see if we can get to a place where it's good.
NOTE: I added some code examples that I literally found online and DID NOT TEST. so we will want to definitely test what is there before merging.
ALSO - because this topic is not my expertise, in some cases I ran with a section after a Google search and tried to flesh it out, but it also could be off. Any feedback is welcome on this!!