Skip to content

Conversation

@lunebellec
Copy link
Member

I tried to build my first reproducible analysis with datalad repos for input / output, using data hosted on elm via ssh. And I struggled :) so I put together this tutorial with the help of chatgpt. I think I got it to work, but it's very possible I made mistakes. And I really struggled with datalad's docs. So hopefully more knowledgeable people can review and confirm this is in order, and the tutorial can save time to others (and myself) in the future.

I tried to build my first reproducible analysis with datalad repos for input / output, using data hosted on elm via ssh. And I struggled :) so I put together this tutorial with the help of chatgpt. I think I got it to work, but it's very possible I made mistakes. And I really struggled with datalad's docs. So hopefully more knowledgeable people can review and confirm this is in order, and the tutorial can save time to others (and myself) in the future.
```bash
datalad create-sibling \
--name elm \
--site datalad \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what is the --site option, it is not in the docs, maybe a GPT-hallulu.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, I used chatgpt to create a summary of the steps, and he got this one wrong. Apologies I should have double checked. I believe that's what I used from my bash history: datalad create-sibling -s elm ssh://elm/data/simexp/pbellec/image10k-zooniverse --existing=skip

--name elm \
--site datalad \
--sshurl ssh://elm/data/simexp/pbellec/image10k-zooniverse \
--shared all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want datasets to be writable by the group and readable by all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I was thinking to let people deal with permissions in their own space. Then if we want to publish the dataset either upload a version of it on zooniverse (for open data) or create a new sibling on S3. You're suggesting we would get a single folder for the lab hosting all datalad datasets?

Comment on lines 34 to 36
datalad create-sibling-github courtois-neuromod image10k-zooniverse \
--github-organization courtois-neuromod \
--access-protocol ssh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
datalad create-sibling-github courtois-neuromod image10k-zooniverse \
--github-organization courtois-neuromod \
--access-protocol ssh
datalad create-sibling-github courtois-neuromod/image10k-zooniverse \
--access-protocol ssh

fix deprecation.

This requires a personal access token with adequate permissions to create repos for that org/user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed. I'm going to describe the method where the repo is created manually on github then added as a sibling.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, I've used the syntax your suggested for org name, but with the datalad version I get through pip on my machine this does not seem to work, and I had to use the soon-obsolete flag --github-organization


### ⚠️ Tips & Troubleshooting

* If `datalad get` fails with `annex-ignore`, you likely cloned from GitHub only. Clone once from `elm` to propagate sibling config.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using --as-common-datasrc NAME see above would fix that. Or setting the create sibling as autoenabled afterward git-annex configremote elm autoenable=true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this has been a point I'm still struggling with!! I could not get it to work such that installing from github would download from elm. So if I add --as-common-datasrc when I create the elm siblings it should fix it? or is that configuration staying local?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK experimented a bit and could not get it to work. I tried to remove the elm siblings then adding it back with: datalad siblings add --name elm --url ssh://elm/data/simexp/pbellec/image10k-zooniverse --as-common-datasrc origin Got this error:
add-sibling(impossible): . (sibling) [cannot configure as a common data source, URL protocol is not http or https] .: elm(+) [ssh://elm/data/simexp/pbellec/image10k-zooniverse (git)]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants