enh(data): add page on data to the tests section #580

lwasser · 2025-09-19T01:17:45Z

This pr is a rework of #110 . Let's plan to run a sprint on this pr in the next few weeks to see if we can get to a place where it's good.

NOTE: I added some code examples that I literally found online and DID NOT TEST. so we will want to definitely test what is there before merging.

ALSO - because this topic is not my expertise, in some cases I ran with a section after a Google search and tried to flesh it out, but it also could be off. Any feedback is welcome on this!!

yeelauren · 2025-09-25T01:20:52Z

tests/package-data.md

+* **[Open Science Framework (OSF)](https://osf.io)** - Comprehensive research platform
+* **[Figshare](https://figshare.com)** - User-friendly with good visualization tools
+* **[Dryad](https://datadryad.org)** - Focused on research data (subscription model for some features)
+


Would huggingface fit here or is this more for academic ? Perhaps https://docs.source.coop ?

yeelauren · 2025-09-25T01:24:02Z

tests/package-data.md

+* **[Google Cloud Platform](https://cloud.google.com/storage)** - Cloud Storage with strong AI/ML tool integration
+* **[Linode](https://www.linode.com/products/object-storage/)** - Object Storage with straightforward pricing and developer-friendly tools
+
+<!-- I don't understand how these platforms are different from things like figshare - can we clarify that? and how / why someone would pick these vs figshare / dryad? -->


This section is moving into data versioning which is important for scientific studies and reproducibility, but may be out of scope for a package. Having a small subset or testable data for a package makes a lot of sense.

yeelauren · 2025-09-25T01:24:25Z

tests/package-data.md

+- **[Pachyderm](https://www.pachyderm.com/)** - Data pipeline platform with version control
+```
+
+<!-- I am not sure how this section relates to data stored in a package. I understand it's important, but does it belong in a page focused on how and where to store data for your Python package? It might be that I just don't understand as written!-->


+1 to removing

yeelauren · 2025-09-25T01:27:52Z

tests/package-data.md

+
+### Use Pytest fixtures for data access
+
+Pytest fixtures provide a clean way to set up and share data across your test suite. They're especially useful for scientific packages where you need consistent access to test datasets.


Maybe this is just personal preference, but I think we should move pytest up and pooch down.
Pytest fits better with the whole 'testing your data thing' along with testing your code. Pooch seems like an additional tool or service to download data from an external source, I've not heard of it before.

yeelauren · 2025-09-25T01:31:08Z

tests/package-data.md

+<!-- I am not sure about this statement in terms of what it means and whether we have tools that consider standards or not  we might also want to link to FAIR-->
+:::
+
+```{admonition} Field specific standards + metadata


A link to FAIR would be good. I haven't heard of the ones mentioned 😓

NickleDave and others added 3 commits October 27, 2023 09:17

WIP: Add data/intro.md

277c576

Merge branch 'main' into add-data-section-to-guide

22c275f

enh: add data page to tests section

ec668a4

lwasser added help wanted We welcome a contributor to work on this issue! thank you in advance! 🚀 ready-for-review labels Sep 19, 2025

github-project-automation bot added this to pyOpenSci Help Wanted Project Board Sep 19, 2025

yeelauren reviewed Sep 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

enh(data): add page on data to the tests section #580

enh(data): add page on data to the tests section #580

Uh oh!

lwasser commented Sep 19, 2025

Uh oh!

yeelauren Sep 25, 2025

Uh oh!

yeelauren Sep 25, 2025

Uh oh!

yeelauren Sep 25, 2025

Uh oh!

yeelauren Sep 25, 2025

Uh oh!

yeelauren Sep 25, 2025

Uh oh!

Uh oh!


		### Use Pytest fixtures for data access

		Pytest fixtures provide a clean way to set up and share data across your test suite. They're especially useful for scientific packages where you need consistent access to test datasets.

enh(data): add page on data to the tests section #580

Are you sure you want to change the base?

enh(data): add page on data to the tests section #580

Uh oh!

Conversation

lwasser commented Sep 19, 2025

Uh oh!

yeelauren Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

yeelauren Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

yeelauren Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

yeelauren Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

yeelauren Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!