September 16th, 2009
|11:39 am - Fun with updating the cached contents in the index by staging|
Yuck, what a title.
I've been writing from time to time this "Fun with..." series on git, and I think it is time to talk about cache, index and staging.
When Linus started writing git, his aim was to allow him to reproduce each and every intermediate state produced by his original "tarball and patches" workflow he used before the BitKeeper days. Starting from a 2.6.12 tarball, he queues patch-1, patch-2,... so 2.6.12 itself, 2.6.12 with patch-1 applied, 2.6.12 with both patch-1 and patch-2 applied, become three versions.
But this won't obviously scale if you have to shuffle hundreds of patches a day. So he invented "directory cache"; as a concept, this roughly corresponds to "tree" objects in today's git: a collection of records, each of which is a compact representation of what a whole directory structure contains. The way to build it was to "add the contents to the cache, or update the contents in the cache".
The control directory to host the collection of such version control records was named ".dircache" (this was renamed to ".git" after some time). There was a file called ".dircache/index", and the contents of this file was read and manipulated in a set of variables in C that were named after a noun, "cache". Back then, the concept of what we today call the index, a buffer area to build up the collection of contents you intend to write out as a tree object, was called "cache". Everybody talked about "cache" and "index" interchangeably, as the file that records what is in the "cache" was named "index". It was (and it still is) an index to allow you to find the contents in the cache by giving it a pathname.
As more and more people started using git without having to read its code at all, the use of the word "index" has become more prevalent for obvious reasons. As something that is on the filesystem, it is much more visible than the variable name in the C source code. Eventually, we stopped using "cache" as a noun to name what we call "the index" today when explaining the use of git as the end-user. The word "cache" however is still used as a noun when we want to talk about the internal data structure in the context of discussing git implementation (e.g. "Let's make it possible for programs to work with more than one cache at the same time").
At the end user level, "cache" is only used as an adjective these days; "cached", meaning "contents cached in the index, not the contents in the work tree". We could have called it "indexed", but "cached contents" was an already established phrase from very early days to mean that exact concept, and we did not need another word that meant the same thing.
There are some commands that take --index and --cached options, and even some that can take both (but not at the same time). Many people find this confusing, but there is a pair of simple rules:
Here are a handful of examples.
- "--cached" always means "work only on contents cached in the index, ignoring the work tree";
- "--index" makes a command that usually works on files in the work tree also pay attention to the index.
In the earlier days, there was a distinction between "adding a new file to the index" and "updating a file that is already in the index with new contents". The former was done with "git update-index --add", while the latter was done with "git update-index" (the command by default did not add new files to prevent "add *" from adding object files---this dates back to the days before .gitignore), but that is prehistoric.
- "git apply" usually patches the files in the work tree without touching the index.
- "git apply --cached" only updates the contents in the index without modifying the file in the work tree.
- "git apply --index" patches both the contents in the work tree and in the index.
- "git diff HEAD" shows a patch to update the contents in the HEAD commit to contents in the work tree.
- "git diff --cached HEAD" shows a patch to update the contents in the HEAD commit to contents that is cached in the index. "git diff --cached" is a short-hand for "git diff --cached HEAD" only because the HEAD commit is what you most often would want to compare the cached contents with.
- There is no "git diff --index HEAD" (yet); it would imply showing a three-way diff between HEAD, the index and the work tree.
- "git grep" finds matches in the work tree.
- "git grep --cached" finds matches in the contents in the index.
- "git rm" removes both the file in the work tree and the corresponding path in the index.
- "git rm --cached" removes the path from the index, leaving the file in the work tree untracked.
Modern (and medieval) versions of git uses "git add" for both. We could have been just honest and called the act of updating-or-adding-to-the-index "add", but some people in "git training" industry started teaching the index as "the staging area for the next commit", and as an inevitable consequence, a verb "to stage" started to appear in many documentation to mean "the act of adding contents to the index". I sometimes use this verb myself, but that is only when I suspect that the audience might have learned git first from these new people. Strictly speaking this is a redundant and fairly recent word in git vocabulary.
Its adjective form, "--staged", is also supported in "git diff" as a synonym to "--cached", but it is not advertised. The distinction between "--cached" and "--index" is meaningful, but you do not need "--staged" that is purely a synonym for "--cached".
|Date:||October 9th, 2009 07:38 pm (UTC)|| |
Usage of "stage"
First, I want to thank for really nice article explaining the difference between "cached" and "index" in command line options. However, I still find the usage of these two words confusing. Maybe using two forms of the same word (like "cached" and "cached-only" or sthg like this) would be more clear?
Second, what do you think of using "to stage"? Is git going to replace "index" and "cache" with "stage" someday?