Version Control System – Managing Your Projects

Note This part of the lecture note has been partially extracted and modified from Prof. Randy LeVeque’s class website on HPC.

In this class we will use git for

  • Homework submission

  • Final project submission

  • Giving feedback to you on your submissions

See the below for more information on using git and the repositories required for this class. There are many other version control systems that you may come across, such as cvs, Subversion, Mercurial, and Bazaar. However, in the modern world of computing and software git is by far the most common.

Version control systems were originally developed to aid in the development of large software projects with many authors working simultaneously on inter-related pieces.

The essential idea is to maintain a detailed history of a code base, which allows you to see when something was changed, who changed it, and what it looked like before the change. The container that holds the code base, and all of the history associated with it, is called a repository.

git in particular is special in that it is distributed version control system, in contrast to the older style centralized version control systems. In centralized systems, working on the code required checking it out from some central server. This typically required a persistent connection to that server, and could also lock other people out of parts of the repository. The distributed nature of git means that many local copies of the repositories can exist on different machines with no knowledge of each other. Many people can work simultaneously. As a consequence, git must be able to synchronize seemingly incompatible repositories with each other.

Remote hosting services (e.g. GitHub, BitBucket, GitLab, etc.) are not the same as git the version control system, though they are often conflated as such. These services provide a place for another copy of a repository to exist. Pedantically, this copy of the repository is special compared to any other copy on other machines. The significance of this copy is that it exists in a location accessible by any one working on a project, and thus serves as a way for people to synchronize with each other. Many more elaborate tools have spouted out of this service, and each remote host does slightly different things.

It may sound like a hassle to be tracking all of this, but there are a number of advantages to this system that make version control an extremely useful tool even for use with your own solo projects. Once you get comfortable with it you may wonder how you ever lived without it.

Advantages

  • You can revert to a previous version of a file if you decide the changes you made were incorrect. You can also easily compare different versions to see what changes you made, e.g. to locate when/where a bug was introduced.

  • If you use a computer program and some set of data to produce results for a publication, you can check in exactly the code (and possibly the data) used. If you later want to modify the code to produce new results, you will still have access to the version that existed at the time of publication without having to keep elaborate archives. Working in this manner is crucial if you want to be able to later reproduce earlier results, as is often necessary if you need to tweak the plots for some journal’s specifications or if a reader of your paper wants to know exactly what parameter choices you made to get a certain set of results. This is an important aspect of doing ‘reproducible research’, as should be required in science. If nothing else you can save yourself hours of headaches down the road trying to figure out how you got your own results.

    • Note Tracking data with git is generally a bad idea. It can be done, but you should really keep it separate from your code, lest your repositories become multi-gigabyte behemoths that are troublesome to clone and manage.

  • If you work on more than one machine, e.g. a desktop and laptop, version control systems are one easy way to keep your projects synchronized between machines (as mediated through one of the remote hosts discussed above).

Two Types of Version Control Systems: SVN vs. Git

Centralized client-server systems (e.g., CVS, SVN)

The original version control systems all used a client-server model, in which there is one computer that contains “the repository” and everyone else checks code into and out of that repository.

Systems such as CVS and Subversion (svn) have this form. An important feature of these systems is that only the central repository has the full history of all changes made.

For those interested, consider this article comparing the two.

Distributed systems (e.g., Git)

Git, and other systems such as Mercurial and Bazaar, use a distributed system in which there is not necessarily a “master repository’’. Any working copy contains the full history of changes made to this copy.

The best way to get a feel for how git works is to use it, for example by following the instructions in the next section.

Remark Please also check out this git commands cheat sheet:

Git for the class, and repository hosting on UCSC-GitLab

Creating your own GitLab repository

For this class we will use the UCSC GitLab instance as the primary choice for remote hosting. It is entirely reasonable to use git for your own work without hosting a repository remotely on a site such as GitHub, Bitbucket, or the UCSC servers. There are several reasons you may want to host a remote repo: sharing a project between different people or multiple machines, project recovery in the event of local data loss, or providing public access to your code.

  • You should learn how to use GitLab for more than just pulling changes.

  • You will use this repository to “submit” your solutions to homework. You will give the instructor and TA permission to clone your repository so that we can grade the homework.

    • Do not give these permissions to any other students (or anyone else for that matter).

  • By the end of the quarter you should be comfortable using git to track and manage more of your own projects.

Below are the instructions for creating your own repository. Note that this should be a private repository so nobody can view or clone it unless you grant permission.

Creating your course repository

  1. On the machine you’re working on run the following to configure your user with git

    $ git config --global user.name "Your Name"
    $ git config --global user.email [email protected]
    

    These will be used when you commit changes. If you don’t do this, you will get a warning message each time you try to commit.

    • If you’ve already used git then you will have already set these configurations. If you have these set globally to a different username/email than your UCSC one then you can set up a local configuration just for this class. After creating your AM129 repository cd into it and run:

      $ git config user.name "Your Name"
      $ git config user.email [email protected]
      
  2. Go to http://git.ucsc.edu/ and either sign in, or click register now if you don’t have an account yet. When signing up you must use your UCSC email address (CruzID).

  3. You should then be taken to your account. Click on the plus symbol at the top of the sidebar, then click on “New project/repository”, and select “Create blank project”.

  4. You should now see a form where you can specify the name of the project (repository) and a description. The repository name need not (and should not) be the same as your user name (a single user might have several repositories). Name your project with following the naming convention Lastname Firstname AM129 Fall24, and note that the repository URL will sensibly change the spaces into hyphens. For example, the project for your peer Albert Einstein would be named Einstein Albert AM129 Fall24, which will yield the repository einstein-albert-am129-fall24. To avoid confusion, please follow this naming convention.

    Note that the box labeled Project slug holds the actual name of the repository. Note also that this gets modified from what you type into the Project name box, namely spaces get hyphenated and everything is made lowercase. On other hosting services you might specify the repository name directly. In those cases you should avoid spaces and varying cases, as well any special symbols.

  5. Set the visibility level to Private, and tick the box to initialize the project with a README. Leave the static testing box unticked, it is not relevant to us.

  6. Click on “Create project”.

  7. You should now see the home page for this repository. Click the Clone button and copy the SSH URL (make sure you have already done the steps in Setting up SSH keys for GitLab before cloning). Open a terminal and navigate to where you want this repository to live. Important: Do not put this repository inside another one, git does weird stuff with nested repositories (git submodules are the correct way to do such a thing). Clone the repository to your chosen location by running:

    $ git clone [email protected]:aeinstein/einstein-albert-am129-fall24.git
    

    Of course you should use the URL associated to your repository.

  8. You should now be able to cd into the directory this created.

Getting something into your repository

The directory you are now in will contain the auto-generated README

$ ls
README.md

But it will look slightly different if you try

$ ls -a
./  ../  .git/   README.md

Recall that the -a option causes ls to also list files starting with a dot, which are normally suppressed. See Basic Unix/Linux Commands for a discussion of ./ and ../. The directory .git is the directory that stores all the information about the contents of this repository and a complete history of every file and every change ever committed. You shouldn’t touch or modify the files in this directory because they are used by git to control versions, commit changes and their history, etc.

Create a new file called testfile.txt to your directory, which has two lines

$ cat > testfile.txt
This is a new file
with only two lines so far.
^D

The Unix cat command simply redirects everything you type on the following lines into a file called testfile.txt. This goes on until you type a <ctrl>-d (the 4th line in the example above). After typing <ctrl>-d (which sends the EOF character) you should get the shell prompt back. Alternatively, you can use any of your favorite text editors (see Items for the Class).

To see a shortened status of your folder, type

$ git status -s

The response should be

?? testfile.txt

The ?? means that this file is not being tracked by git. The -s flag results in this short status list. Leave it off for more information.

To indicate to git that this file should be tracked use git add

$ git add testfile.txt
$ git status -s
A  testfile.txt

The A means it has been added. However, at this point the file has not been recorded by git. To do so you will commit the file:

$ git commit -m "My first commit of a test file."

The string following the -m is a comment about this commit that may help you in general remember why you committed new or changed files. You can also type git commit with no options. This will bring up your system editor for you to write your commit message into. Every commit must have a commit message.

You should get a response like

[main 31cb6ed] My first commit of a test file.
1 file changed, 2 insertions(+)
create mode 100644 testfile.txt

We can now see the status of our directory via

$ git status
# On branch main
nothing to commit (working directory clean)

Alternatively, you can check the status of a single file with

$ git status testfile.txt

You can get a list of all the commits you have made (only one so far) using

$ git log

commit 31cb6ed38310eed36f47d3d3aed769e03da540c9
Author: bananaslug <[email protected]>
Date:   Sun Oct 01 00:04:14 20xx -0700

My first commit of a test file.

The number 31cb6ed38310eed36f47d3d3aed769e03da540c9 above is the commit hash for this commit, and you can always get back to the state of your files as they existed when you made it by using this number. You don’t have to remember it, you can use commands like git log to find it later.

Yes, this is a number… it is a 40 digit hexadecimal number, meaning it is in base 16 so in addition to 0, 1, 2, …, 9, there are 6 more digits a, b, c, d, e, f representing 10 through 15. This number is almost certainly guaranteed to be unique among all commits you will ever do (or anyone has ever done, for that matter). It is computed based on the state of all the files in this snapshot as a SHA-1 Cryptographic hash function, called a SHA-1 Hash for short. Indeed, you can refer to a single commit by using the first few digits. Generally 6 digits is already enough to uniquely identify a single commit within a project.

Modifying a file, and adding another

Now let’s modify this file

$ cat >> testfile.txt
Adding a third line
^D

Here the >> tells cat that we want to add on to the end of an existing file rather than creating a new one. (Or you can edit the file with your favorite editor and add this third line.)

Now try the following

$ git status -s
M testfile.txt

The M indicates this file has been modified relative to the most recently committed version.

To see what changes have been made, try

$ git diff testfile.txt

This will produce something like

diff --git a/testfile.txt b/testfile.txt
index d80ef00..fe42584 100644
--- a/testfile.txt
+++ b/testfile.txt
@@ -1,2 +1,3 @@
This is a new file
with only two lines so far
+Adding a third line

The + in front of the last line shows that it was added. The two lines before it are printed to show the context. If the file were longer, git diff would only print a few lines around any change to indicate the context.

Now let’s add another file, but this time one that actually serves a purpose. Notice how git status reports information about all files anywhere inside the repository, even sub-directories. This also effects things like git add . where the period says to add all files recursively under the current location.

There are a lot of files that we’ll want to ignore, so lets tell git about them. This way we won’t have to be so wary about adding and committing (though still be a little wary). Create the file .gitignore using your favorite text editor. For me this is

code .gitignore

and edit it to look like this

*.o
*.d
*.ex
*.mod
*.dat

# ignore autogenerated macOS file explorere files 
# (don't need this if not on macOS)
*.DS_Store

This tells git to ignore all files that end with .o, .d, .ex, or .mod. Files that end with .o, .d, .ex, or .mod are usually generated during compilation and are specific to the computer you are compiling on so we don’t want git to track them. A .DS_Store file is a type of file that is automatically generated by the macOS Finder application and they can get annoying so if you have a mac I recommend that you ignore these as well. As the quarter progresses, if there are other types of files that you want git to ignore you should add them to your .gitignore file as well!

Now let’s add the .gitignore and the changes made to testfile.txt and commit everything.

$ git add .
$ git commit

Since we didn’t use the -m flag in the commit, this will instead spawn our text editor. Yours might default to vim. Write a commit message, then save and exit your editor to finalize the commit.

Remark: You can change the editor used for commit messages by running:

$ git config --global core.editor emacs

Try doing

$ git log

or

$ git log --graph

now and you should see something like:

commit 271bd14e5b8d68840e7e6481ad7e99e5708e50e7
Author: bananaslug <bananaslug@ucsc.edu>
Date:   Sun Oct 01 00:04:14 20xx -0700


Added a third line to the test file, and added a gitignore file

commit 0c20925f98b5d76d0b973d25fdc78fd43941792e
Author: bananaslug <bananaslug@ucsc.edu>
Date:   Sun Oct 01 00:04:24 20xx -0700

My first commit of a test file.

If you want to revert your working directory back to the first snapshot you could do

$ git checkout 0c20925
Note: switching to '0c20925'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

Take a look at the file, it should be back to the state with only two lines. You are now in a situation called a detached HEAD state. To learn more about it and how to fix the situation, take a look at the following StackOverflow post:

Note that you don’t need the full SHA-1 hash, the first few digits are enough to uniquely identify it.

You can go back to the most recent version with

$ git checkout main
Switched to branch 'main'

We won’t discuss branches much here, but unless you create a new branch, the default name for your main branch is main and this checkout command just goes back to the most recent commit.

Working with remotes

So far you have been using git to keep track of changes in your own directory, on your computer. None of these changes have been seen by GitLab, so if someone else cloned your repository from there, they would not see testfile.txt, or .gitignore.

Now let’s push these changes back to the GitLab remote. First do

$ git status

to make sure there are no changes that have not been committed. This should print nothing.

Now do

$ git push

This will prompt for your blue password and should then print something indicating that it has uploaded all of your local commits to the remote GitLab repository.

Not only has it copied the 2 files over, it has added all of change sets, so the entire history of your commits is now stored in the repository. If someone else clones the repository, they get the entire commit history and could revert to any previous commit, for example.

If your repository started out as a local one that you later added a remote host to you will need to first run

$ git push -u origin main

to associate the main branch with the remote host origin. You only need to do this once, and most remote hosts will give you the needed commands when you first set them up.

Try running

$ git remote -v

to see a list of all remotes. By default there is only one, the place you cloned the repository from. (Or none if you had created a new repository using git init rather than cloning an existing one.)

You should check that the files are in your remote repository. Go back to the web page for your repository and refresh it. You should now see both of the files you worked on locally, as well as a time stamp from when the commit(s) were pushed.

Now click on the “Commits” tab at the top. It should show that you made two commits and display the messages you wrote for each one.

If you click on a particular commit, it will show the change set for this commit. You should see something similar to the git diff output from before.

Finally, you can transfer changes from the remote repository into your local one. To do this type:

$ git pull

This actually does 2 things at once. This fetches the changes from the remote (a network operation) and then merges those changes into your local repository (a local operation). You can use git fetch --all if you just want the changes from the remote, but do not want to actually merge them in.

Setting up SSH keys for GitLab

You may notice that typing your username and password every time you want to push/pull from a remote repository is tedious. Fortunately there is a way around this that simultaneously makes your connection more secure, which is really a win-win. You can associate SSH keys with your account. In fact, many remote hosts (like GitHub) strongly recommend going this route.

We are going to talk more about SSH in general soon, so for now let us just look at how to set up keys for GitLab. These keys come in pairs, one half of the pair is a private key, and the other half is the public key. After generating a key pair you will upload the public key to GitLab.

Important: Do not share your private key with anyone, or upload it anywhere. In fact, there is little reason to ever even open the private key. Simply leave it in place and ignore it.

Okay, with that warning in place these are the steps to generate a key pair:

  1. Run ls ~/.ssh and if you see the file id_ed25519.pub skip ahead to step 6

  2. Run ssh -V and verify that it is at least version 6.5 (assuming that OpenSSH is being used).

  3. Run ssh-keygen -t ed25519 -C "<youremail@ucsc.edu>"

  4. Press enter to accept the default storage location

  5. Press enter again to use a blank passphrase

    • A blank passphrase is perfectly fine for this. What this really means is that the security is inherited from your computer’s log in process.

  6. You should now have the file ~/.ssh/id_ed25519.pub. cat this file out and copy its contents. Make sure you are copying the public key!

  7. Sign into UCSC GitLab, click your avatar in the upper left, and select Preferences

  8. On the left bar select SSH Keys, and click Add new key.

  9. Paste the entire contents of your public key into the key box

  10. Give the key a useful name, and click Add key

  11. Finally, run ssh -T git@git.ucsc.edu to test that everything worked.

Note: Each key pair should correspond to a single user on a single computer. If you have a laptop and a desktop you like to work from, each should have their own key pairs. You can associate multiple public keys to your GitLab account for this reason. Obviously, each user should have their own keys. Again, do not copy, upload, or share the private key.

Now that you have SSH keys set up go to your repository and run git pull. Hmmm, it still wants a username and password doesn’t it? We need to change the remote to use the SSH protocol. Recall what we saw when running git remote -v

$ git remote -v
origin https://git.ucsc.edu/aeinstein/einsteinalbert-am129-fall24.git (fetch)
origin https://git.ucsc.edu/aeinstein/einsteinalbert-am129-fall24.git (push)

We can change origin to use SSH by running:

$ git remote set-url origin [email protected]:aeinstein/einsteinalbert-am129-fall24.git

You can get the correct URL for your repository by going back to GitLab and clicking clone again. This time you’ll select the SSH option. If we re-run git remote -v we should see:

$ git remote -v
origin [email protected]:aeinstein/einsteinalbert-am129-fall24.git (fetch)
origin [email protected]:aeinstein/einsteinalbert-am129-fall24.git (push)

Finally, try running git pull and you should see that everything works!

Rolling back to a previous state

Let’s take a look at the case where you do not like your last change you made to your repo, and you want to revert your repo status back to a previous state, say,

  • commit 1b82c2168 is the current unsatisfactory revision

  • commit c27d1bdf0 is the previous revision you wish to roll back to

You can roll back to the previous commit using the git reset command, which comes in three flavors. First you can use

$ git reset --soft c27d1bdf0

This will wind you back to the state of things before you made commit 1b82c2168, and in particular it will leave all changes you made present and staged. Alternatively, you can use

$ git reset c27d1bdf0

which is equivalent to git reset --mixed <commit hash>. This rolls back the state to the older commit and unstages the changes you’ve made, but leaves the individual files untouched. Finally, and most drastically, you can use

$ git reset --hard c27d1bdf0

This will send the whole repository back to the state it was in at the time of the older commit. All changes are lost. Be careful!! It is unlikely that you’ll need this version.

In case you want to recover files that are deleted locally, you can do

$ git ls-files -d | xargs git checkout --

Similarly, to recover modified files back to the previous states

$ git ls-files -m | xargs git checkout --

See more examples at https://git-scm.com/docs/git-ls-files.

Remark Wait a minute… what is the command xargs above? It is a particularly powerful command that allows you to convert output from one command into arguments for another command.

  • Pop quiz: How is this different from piping?

  • Take a look at this article for more examples of how to use it.

In some cases, you may wish to forget about all your local changes and want git to overwrite the entire local files. In general, if you have some changes in your local files that git sees as potential conflicts, git pull will not allow you to bring in the most recent updates committed to the git by others. Git will give you errors such as:

error: Your local changes to the following files would be
overwritten by merge:

or:

error: The following untracked working tree files would be
overwritten by merge:

In this case if you don’t mind overwriting your local changes with whatever available in the main branch on the remote repository, you can do the following

$ git fetch --all
$ git reset --hard origin/main

or you can combine the two in a single line command using &&

$ git fetch --all && git reset --hard origin/main

Again, with this command, all of your local changes will be lost with or without --hard option, and therefore any local commits that haven’t been pushed will be lost. So, you should only do this if you know what you’re doing and trust the recent updates by pulling from the git repo.

Understanding Git Workflows

Please read this nice article.

Summary

The commands we discussed so far will give you a good start with git. As you’re getting used git you will learn that only a handful git commands are needed in many cases. This is in particular true unless you work on the project with many other project members. In our class it will primarily be yourself only who will keep checking in and out changes from your remote repo hosted in GitLab. One exception will be to pull the grades and remarks about each assignment from the TA.

In this simple project environment, you will most likely only need to use the following commands

$ git status
$ git add
$ git commit
$ git push
$ git pull

Remark You can pull up the man-page for most git commands by replacing the space with a hyphen, e.g. man git-commit will give the man page for the git commit command.

Remark: Git is used extensively. Pretty much any issue you may have can be answered by a little Googling. It sometimes feels like you already need to know the answer to know what to search, but you can get pretty far by pasting in whatever error message you get. Of course, you can also ask the instructor or TA!