In this tutorial we introduce
git-annex and its use within a
research laboratory to help share data and code among lab members,
external collaborators and anonymous users. git-annex is a software
tool that extends the more famous software
git in convenient ways when dealing with
large files and large repositories. In the following, we introduce
some basic concepts and then describe the scenario and the workflow
that we implemented in our Lab which, we believe, can be useful to
other people in a similar setting. git and git-annex require some
dedication before reaching fruitful use. During that process, it is
common to make mistakes, as it was for us. For this reason, in this
tutorial, we also describe common errors and how to recover from
them. Notice that, basic familiarity with git is assumed as a
pre-requisite for this tutorial.
The tutorial is structured as follows. First we describe the scenario
in which git-annex is used. Then, we provide some preliminary
information about what git-annex is and some additional technical
details. After that, we describe how to set up a centralized
repository that will host a copy of all data. This is the main part of
the tutorial, in which we describe how to make the repository easily
accessible from a web server and from Github. The last part of the
tutorial describes the use of the repository from the point of view of
standard users that need just to access the data and get updates, as
well as of content creators, i.e. those having the rights to add new
content to the repository from remote.
In our lab, we have large datasets - terabytes of data - which also comprise many large files and code that generated part of the data. Such datasets are kept in a storage server. Small portions of the data are frequently needed on local desktop and laptop computers of lab members and collaborators, for processing and analysis. Moreover, new data are frequently generated by lab members, on local computers, by further processing of available data. One main aim is to share such new data and the code with others. Additionally, the data already shared are not static: from time to time, code is updated or bugs are fixed, so some of the preprocessed data is re-generated and shared, substituting the previous version. In such a setting, is is important for lab members and collaborators to get updates of data and code in a simple way.
Simply put, git-annex is an
extension of git that provides some extra functionalities:
- Large files in the repository are not locally copied, when cloning or fetching/pulling. Of course, they can be retrieved on request. Additionally, local copies of large files can be removed to free some space.
- git-annexkeeps track of how many and where copies of each file are.
- TODO
Here, we do not describe git other than the most used version
control system, to our knowledge. There is an enourmous amount of
information already available about git. git helps keep tracks of
updates of files and support collaborative work among multiple
users. Unfortunately, git do not provide native support to handle
large files in a convenient way. That is what git-annex adds to it.
In this tutorial we refer to git-annex version 6.20171211, on
GNU/Linux machines using Ubuntu 16.04. As of this version of
git-annex, the default format of the repository is v5. In
future, we plan to upgrade to
v6, following the default
settings of git-annex. At that time, we plan to update the parts of
this tutorial that are affected by this change.
When issuing git-annex from the command line, two alternative ways
can be used, either git-annex or git annex. To our knowledge,
there is no difference between them.
Until v5 of the repository format, git-annex uses certain filesystem
features that may not be available on all filesystems, like symbolic
links and FIFOs. For example, the FAT filesystem does not provide
them. When initializing the repository with git annex init (see
below for further details), a clear warning will appear on the screen,
in case you are using such crippled filesystems. Nevertheless,
git-annex has ways to (partially) address such problems. In this
tutorial, we do not discuss such issues and we assume that a
non-crippled filesystem is available, like the EXT4 filesystem,
default on GNU/Linux systems.
TODO
In the following example, we create a directory /labdata on a
storage-server, where we store a copy of all the data with git and
git-annex, so that they can be shared with lab members and external
collaborators. The repository hosts both the git database, in
/labdata/.git/, and a copy of the actual files and directories, the
working tree, in /labdata/, for easy browsing.
Permission to add or modify the data in the repository is enforced
through filesystem permissions by creating a group of users, named
dataowners. Everyone else can (only) read the data in the
repository.
Here we describe the step-by-step procedure to create the repository from scratch, with example commands followed by their detailed explanation:
cd /
mkdir labdata
addgrup dataowners
adduser contributor dataowners
chgrp dataowners labdata
chmod g+rwx labdata
chmod o+rx-w
chmod g+s labdata
cd labdata
This first group of commands creates the directory to host the
repository /labdata, creates a new system group dataowners and
sets such group to /labdata, with write permissions. Then, the user
contributor is added to that group - and others may be added in the
same way. Additionally, read (r), write (w) and access (x)
permissions are granted to the group (g+rwx) and read and access
(but not write) permissions are granted to everyone else
(o+rx-w). Finally, the setgid
permission
is enabled for the group (g+s), so that all future files and
directories created inside /labdata will automatically inherit the
group dataowners and the setgid bit.
git init --shared=group
git annex init storage-server
This second group of commands creates the git repository and the
additional git-annex part of it. Notice that, the git-annex part
of the repository can only be initialized within an existing git
repository. In order to let the repository be group-writable and
accessible to everyone, the initialization of the git repository
requires --shared=group. This
will properly set permissions within /labdata/.git/. The
initialization of git-annex creates a /labdata/.git/annex/
directory, called the annex, where git-annex stores all its
information. To conclude, we added the optional storage-server
description when initializing the git-annex part of the
repository. This is convenient to set a desired human-readable label
to the repository.
At this point, content/changes can be added to the repository in two main ways:
- Directly on the storage-server, by copying files and directories in
/labdataand then:- either via git annex add <file>andgit commit -m <message>. In this case, The file is added to the annex, i.e. moved to/labdata/.git/annex/objects/, set read-only, renamed according to its checksum and a symbolic link pointing to it is created in the original location of the file. Only the symbolic link is added to thegitgit repository, whilegit-annexkeeps track of the content. From the user perspective, the initial file is still accessible, through the link, in read-only mode. Notice that, when cloning this repository, only the symbolic link of this file will be present and not its content, unless explicitly requested.
- Or via git add <file>andgit commit -m <message>. In this case the file is added to thegitrepository and not to the annex. Notice that, when cloning this repository, a copy of this file will be present, as always withgit.
 
- either via 
- From remote repositories, through git pushorgit annex sync. In this second case, the repository must be configured properly, as explained below.
Using git annex add <file> instead of git add <file> can be
decided for each file, individually, and depends on the purpose of the
file and of the repository. Typically, code should be added via git add <file> and data via git annex add <file>. Nevertheless, it is
possible to use git annex add <file> for everything. If, at a later
stage, a file needs to be moved from the git repository to the
annex, or viceveresa,
Here follows an example transcript of what happens when executing git annex add <file> on a file foo present in the repository:
> ls -al
total 16
drwxrwsr-x  3 ele  dataowners 4096 dic 26 16:19 .
drwxr-xr-x 26 root root       4096 dic 26 16:13 ..
-rw-rw-r--  1 ele  dataowners    4 dic 26 16:19 foo
drwxrwsr-x  9 ele  dataowners 4096 dic 26 16:18 .git
> git annex add foo
> ls -al
total 16
drwxrwsr-x  3 ele  dataowners 4096 dic 26 16:21 .
drwxr-xr-x 26 root root       4096 dic 26 16:13 ..
lrwxrwxrwx  1 ele  dataowners  178 dic 26 16:19 foo -> .git/annex/objects/g7/9v/SHA256E-s4--7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730/SHA256E-s4--7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730
drwxrwsr-x  9 ele  dataowners 4096 dic 26 16:21 .git
> git commit -m "added foo"
[master (root-commit) 3e461c6] added foo
 1 file changed, 1 insertion(+)
 create mode 120000 foo
> ls -al .git/annex/objects/g7/9v/SHA256E-s4--7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730/SHA256E-s4--7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730
-rw-rw-rw- 1 ele dataowners 4 dic 26 16:19 .git/annex/objects/g7/9v/SHA256E-s4--7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730/SHA256E-s4--7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730
TODO: explain the branches below
> git branch -a
  git-annex
* master
  synced/master
  remotes/origin/HEAD -> origin/master
  remotes/origin/git-annex
  remotes/origin/master
  remotes/origin/synced/git-annex
  remotes/origin/synced/master
Content and changes can be created on remote clones of the repository, i.e. local computers of lab members and collaborators. Such contents and changes need to be pushed to the storage-server, in order to be shared. For this reasons, the storage-server needs to be properly configured in order to allow that, in two steps. The first is:
git config receive.denyCurrentBranch updateInstead
With this command, we allow remote users to push to the
repository. Normally, this is not permitted, because the repository is
non bare, i.e.  it has a working tree of files and directories,
besides the .git/ database. If you do not plan to push changes from
remote, then you do not need this configuration. Notice that, if you
push changes to the repository after enabling the previous
configuration, the working tree of the repository will not be
updated. See below how to enable the automatic update
of the working tree. The second step is:
cd /labdata
git annex wanted . standard
git annex group . backup
The storage-server is meant to keep copies of all files in the
repository. When content is created remotely, it is very important to
tell the storage-server to enforce this desideratum, when operations
like git annex sync are performed (see TODO). In order to enforce
such behavior and other similar ones, git annex provides rich
expressions to
be set, see. But it also offers standard groups of
preferences. The
commands above tells to all the repository to use a standard group of
preferences called backup, which means "All content is wanted. Even
content of old/deleted files".
If you want to check whether standard groups are enabled in the
repository, you just need to use the commands above, without
specifying standard and backup. The following trascript shows an
example:
> git annex wanted .
standard
> git annex group .
backup
Notice that you can set multiple standard groups, whose effect is left as exercise to the reader. Continuing the previous example:
> git annex group . client
group . ok
(recording state in git...)
> git annex group .
client backup
If you added a standard group by mistake and want remove it, you need
to use git annex ungroup, as here:
> git annex ungroup . client
> git annex group .
backup
Information on how to access the repository when the storage server directory with the data is exposed via web server.
Basically, git update-server-info should be executed whenever the
repository is remotely updated, e.g. via push, or after a local
commit. In order to do that automatically, git hooks must be enabled:
cd /labdata
mv .git/hooks/post-update.sample .git/hooks/post-update
cp -a .git/hooks/post-update .git/hooks/post-commit
Warning: hooks needs to be executable.
Add new special remote via http.
git remote add httpdata HTTPURL/.git
git annex initremote datasrc type=git location=HTTPURL/.git autoenable=true
git annex merge  # necessary?
git remote rm httpdata  # not needed anymore
Note: after pulling/syncing in remote clones, git annex init should
be re-run, according to the man page. Maybe it is necessary to run
git annex enableremote datasrc on the user computer. TODO.
TODO
The idea is to keep a copy of the repository on github, without the
contents of the annex, so that it is more visible and can be easily
cloned by anonymous users. Moreover, it can be set up so that content
can be retrieved via git annex get <file> leveraging the access to
the storage-server and/or the public access for the web.
....create repository on github....
git remote add github <github-URL-to-repository>
git push -u github master
git push -u github git-annex
After populating and using the repository, it is common to realize
that it may not be smart to have all files stored with git-annex
and that is would be better to have them simply stored in git. The
following commands migrate files from git-annex to git:
git unannex <file>
git add <file>
git commit -m <message>
Notice that git unannex <file> does not need a commit.
Viceversa: TODO.
TODO
In this section, we describe the use of git-annex from the point of
view of users, when the centralized repository is already
available. We make a distinction between users that just access the
repository to obtain the data and, from time to time, the updates,
from users that contribute to the repository, by creating new content
or code to be sent to the central repository.
As user, the first action to do is to clone the repository hosted on the storage server. Notice that repository may be reached in several ways, like via SSH, if you have an account on the storage server, or via HTTP, if the repository has been published with a web server, or via Github if this option has been set up. In this last case, the content of the files in the repository is not available and at least one of the other means should be available to reach the content.
git clone user@storage-server:/labdata
The directory labdata/ is then created, with all the tree of
directories and symbolic links to the (missing) content of the files,
if they had been added with git annex add <file>, or the actual
files, if added with git add <file>. Additionally, as in every git
repository, it is present the labdata/.git directory hosting all the
git history and internal files. Notice that the directory
labdata/.git/annex, created by git annex, is not present yet. Still,
the information necessary to git-annex to retrieve the content of
the files in the annex is already available because it is stored in
the git-annex branch. The list of all available branches shows it:
> git branch -a
* master
  remotes/origin/HEAD -> origin/master
  remotes/origin/git-annex
  remotes/origin/master
  remotes/origin/synced/git-annex
  remotes/origin/synced/master
For this reason, the content of the files currently appearing just as broken links can be easily retrieved with:
git annex get <file>
where <file> is a filename, a directory, or an expression with
wildcards that address the content we require.
From time to time, the user can retrieve updates of the repository by executing:
git pull
The user can also ask git-annex information on where to find the
content of a given file:
git annex whereis <file>
If a user is also a contributor to the repository, then he/she can
create new content and push it to the repository on the storage
server. In order to do that, some additional steps should be done on
the local clone of the repository. For clarity, the following
instructions start from cloning the repository as contributor:
git clone contributor@storage-server:/labdata
cd labdata/
git annex init contributor1-desktop
here, git annex init <label> is not mandatory but it is good
practice for a collaborator to add a human-readable label to describe
the local repository, because it will show up in the information
stored by git-annex and shared with others.
A second important step is to inform git-annex that the local
repository should only get the content explicitly requested by the
collaborator. This is important when, later, the contributor will send
new content to the main repository on the storage server, with git annex sync. git annex provides a rich and flexible set of
expressions to set the preferences of content automatically retrieved
during certain operations. See [allow-remote-content] for a more
detailed explanation. Here, the main step is to set the preferences of
the content for the local repository to a standard group, called
manual, meaning that content will only be manually retrieved by the
contributor via git annex get <file> and manually removed when
needed with git annex drop <file>:
cd labdata/
git annex wanted . standard
git annex group . manual
At this point, the contributor can create new files and add them to
the annex, via git annex add <file> and commit that:
...creating new files...
git annex add <newfiles>
git commit -m "created <newfiles>
At this point, local changes can be sent to the repository on the
storage server with git push but the content will not be sent, only
the symbolic link and some metadata. In order to copy the content of
the file to the annex on the storage server, git-annex provides the
command git annex copy <newfiles> --to=origin. Notice that, instead
of indicating the specific filename, it is sufficient to indicate the
name of the directory with the new files, when copying, and git annex will figure out what content will need to be copied, e.g.:
git annex copy . --to=origin
Since pushing content to a repository often requires to pull first and
merge changes, then git-annex provides a more convenient way to
perform all these operatations, through the sync command:
git annex sync origin --content
Internally, git annex sync --content performs the following steps:
- git commit
- git pull
- git merge
- git push
- git copy . --to=origin
- git annex get .
Notice that the last two steps will be avoided if --content is
omitted. Moreover, had the standard group manual not being set in
the local repository, then all files available on the storage server
would have been copied locally. Anyway, if that happens, interrupting
the retrieval with CTRL-C is safe.
If a <file> is stored with in the annex and changes to it needs to
be made, then the file must be unlocked first:
git annex unlock <file>
...edit...edit...edit...
git annex add <file>
git commit -m "updated <file>"
Notice that git annex unlock <file> removes the symbolic link and
copies the content of the file in its place, with write
permission. This is a second copy of the file because the one . After
changing the file, git-annex add and git commit can be performed
as usual. Notice that, if you need to frequently change a file, it may
be more convenient to store it with git instead of git-annex.
What happens if you attempt to edit a file without unlocking first?
Files added with git-annex appears as symbolic links in the
filesystem. An application, such as an editor, should warn that you
are opening a link and not a file. Secondly, the content of the file,
pointed by the link, is stored in .git/annex/objects/ and set as
write-protected. This is the only copy of the content of the file in
the local repository, that is why it is protected. The application
attempting to write on this file should either fail, with
permission denied, or clearly ask confirmation to write on a write
protected file, e.g. Sublime Text 3. If the user insists to write on
the file and the application allows that, basically the internal copy
of git-annex is damaged. With git annex fsck <file>, git-annex
will tell first that the local copy of the file is not good anymore
and will put it in .git/annex/bad/. In order to solve such a
situation, it is necessary to retrieve a pristine copy of the file,
with git annex get <file>, then unlock it, re-editing again or
copying the the file in .git/annex/bad/ on the unlocked file, then
adding and committing.
git clone http://storage-server.mydomain.com/labdata
cd labdata
git annex get <files>
[...]
git pull
git annex get <files>
Thanks to Michael Hanke's post, for inspiring parts of this tutorial and showing interesting solutions.
Thanks to Yaroslav Halchenko and Michael Hanke for their continuous
effort in improving and maintaining
NeuroDebian which, among many other
things, provides Debian/Ubuntu repositories with the latest
git-annex, within the package git-annex-standalone.
A special thank to Joey Hess, author of git-annex, for the beautiful
and intriguing piece of software that sometimes tease us like a
puzzle, like git does.