Log in

June 29th, 2009

Previous Entry Share Next Entry
11:27 pm - Linus's ultimate content tracking tool
I have kept saying that message number 217 in gmane archive is the most important message in the entire life of the git mailing list. I still think it is, and more importantly, it certainly was one of the most influential messages that shaped the various design decisions in git. Understanding of the ideas described in it may reveal the secrets behind different parts of git.

Or it may not. But it is fun thinking and writing about it, so here it is.

In the message, Linus outlined how an ideal content tracking system may let you find how a block of code came into the current shape. You'd start from the current block of code in a file, go back in the history to find the commit that changed the file. Then you inspect the change of the commit to see if the block of code you are interested in is modified by it, as a commit that changes the file may not touch the block of code you are interested in, but only some other parts of the file.

When you find that before the commit the block of code did not exist in the file, you inspect the commit deeper. You may find that it is one of the many possible situations, including:
  1. The commit truly introduced the block of code. The author of the commit was the inventor of that cool feature you were hunting its origin for (or the guilty party who introduced the bug); or
  2. The block of code did not exist in the file, but five identical copies of it existed in different files, all of which disappeared after the commit. The author of the commit refactored duplicated code by introducing a single helper function; or
  3. (as a special case) Before the commit, the file that currently contains the block of the code you are interested in itself did not exist, but another file with nearly identical contents did exist, and the block of the code you are interested in, together with all the other contents in the file existed back then, did exist in that other file. It went away after the commit. The author of the commit renamed the file while giving it a minor modification.
In git, Linus's ultimate content tracking tool does not yet exist in a fully automated fashion. But most of the important ingredients are available already.

Perhaps you can help us complete the dream.

1. Fast path-limited revision traversal

In git, the contents recorded by a commit is represented as a tree object that itself is a recursive structure. In the Linux kernel project, for example, there are 29,000 files stored under 21 top-level directories:
arch/           drivers/   ipc/         mm/             security/
block/          firmware/  Kbuild       net/            sound/
COPYING         fs/        kernel/      README          tools/
CREDITS                    lib/         REPORTING-BUGS  usr/
crypto/         include/   MAINTAINERS  samples/        virt/
Documentation/  init/      Makefile     scripts/
The top-level tree object has 30 entries, many of which record tree objects that represent these subdirectories, and others record blob objects.

Suppose you are interested in the history of a block of code in the file init.c in the arch/x86/mm directory. You would do this:
$ git log v2.6.30 -- arch/x86/mm/init.c
Because the project has nearly 30,000 files, only a small fraction of commits affect this particular file. Even if you count commits that touch other files in the same directory as this file is in, they are still very small portion of the whole.

For example, between v2.6.29 (Mar 23, 2009) and v2.6.30 (Jun 20, 2009), there are 12,000 individual commits, but only 10 commits touch the init.c file we are looking at. 144 commits touch the directory it is in (arch/x86/mm/), 1120 commits touch the directory one level higher (arch/x86/), and 2761 commits touch the top-level directory this file is found in (arch/)---which finally reaches only 1/4 of the whole.

Which means that the path-limited revision traversal can skip 3/4 of the commits without reading much. We need to read the commit object itself to learn the top-level tree object, its parent commit object, and the top-level tree object of the parent commit object. Then by comparing the entry for arch/ in the two top-level tree objects, we know that the commit does not touch the arch/ hierarchy without reading anything further, and move to that parent commit (whose tree object we have already read--so we can amortize the cost even more).

This optimization to read only the necessary part of the tree object works recursively, and we only need to read the full depth of the tree for a little more than 1% of the commits (144 commits among 12,000 touch arch/x86/mm directory). Note that even for these commits, we do not need to open other parts of the tree (e.g. sound/ directory) at all. Literally, we need to only read at most four tree objects from each commit to run the above "git log" command.

2. Merge Simplification

Linus's ultimate content tracking algorithm starts from a commit, and goes back the history, looking for commits that change the parts of a file we are interested in. How does a merge affect this?

For example, suppose we have a history that looks like this:
          *   *
     /           /

and only commits C and D touch the path we are interested in. Since the branches forked, between I, H, G and F, there is no change to the path. Between I and E there is no change, but D changes the path and C changes it again.

When we create a merge B, the lower branch did not do anything while the upper branch did some modification. A simple 3-way merge rule dictates that we take the contents of C as the result of this merge.
In other words, B has the same contents of the path we are interested in as it appeared in C.

When we have a merge (B in this case) whose content matches one of the parent (C in this case), by default we discard other parents while following the history. In other words, while we traverse the history with path limitation, we would pretend as if the history is like this:
          *   *

At a first glance, it may seem unintuitive that we take a parent that is the same as the merge result. But if you think of the lower branch the "mainline", and the upper branch a side topic that was created for the sole purpose of updating the path we are interested in, it makes perfect sense. There is a vast difference between F and B because that is a merge of the finished topic from a side branch (i.e. the upper one) that integrates the entire change in a single step. By following the upper branch commit by commit, however, we can get a much finer grained view of what happened to the path we are interested in.

Together with the optimization of path-limited traversal, this technique cuts down the number of commits we need to look at even further.

3. Pickaxe

The second ingredient in Linus's ultimate content tracking tool is to find if a given commit changes the block of the code in a file we are interested in. Suppose that we are interested in the following block of code in arch/x86/mm/init.c:
struct map_range {
        unsigned long start;
        unsigned long end;
        unsigned page_size_mask;

We would want to find the commit that changed the contents of the file to make this block of code into its final shape. For that, we do this:
$ git log --pretty=short -S'struct map_range {
        unsigned long start;
        unsigned long end;
        unsigned page_size_mask;
};' v2.6.30 -- arch/x86/mm/init.c

Note that we are not interested in commits for which "git show" output contains the given string. We are only interested in a commit whose tree has this string in the file literally, but whose parent's tree does not. In other words, we do not have to (nor want to) run textual diff and grep in the result. We count the number of occurences of the given string in the file in the tree of the commit, and the same for the commit's parent. If we get different number, the commit chnages the string, which is what we wanted to find. Counting occurrences of substring is much cheaper than first generating textual diff and grepping in it (which is not what we want to do anyway).

4. Blame

I actually consider this both a mere "checkbox" item in the sense that we have it only because other SCMs have it (often under the name annotate), and a "cop-out" because doing the ideal "refactoring identification" is harder, even though I believe our blame works much better than in other SCMs.

We start by treating the whole contents of the file as a single block of text we are interested in, and while following Linus's ultimate content tracking ideal to identify the commits that touch the block of text, iteratively break the block down into smaller blocks (until each line becomes its own block) and keep digging the history.

As the result:
  1. Content movement from other files is identified naturally, albeit that we identify only one single source (i.e. we do not notice that the block of text is a result of refactoring five identical copies); and
  2. Wholesale file rename falls out as a trivial and very narrow special case of (1) without recording any rename tracking information at the commit time.

5. GUI

This part of Linus's ultimate content tracking tool is sorely lacking from current git. The overall flow should look like this:
  1. Show the contents of any file in the tree, and let you highlight with mouse/keyboard a block of text in it;
  2. Run the path-limited pickaxe search to find the commit that touches the block of text you showed interest in step (1). We already have both of the two necessary ingredients needed to perform this step efficiently.
  3. Show the change of the commit found in step (2), and find blocks of text in the files in the tree of the parents' of the commit that match or are very similar to what we are interested in. We do have GUI that shows the change of the commit; we do not have a good tool (yet) to show similar blocks of text that are potential pre-refactored duplications to identify refactoring. Perhaps we could run (possibly fuzzy) grep in the tree of the parent commits. We have already a capability to run grep on a tree object.
  4. Let you go back to step (1), starting from the commit we inspected in step (3) and continue.
Cf. My previous message on this issue.

git-gui's blame frontend does some of this, but not the third step. Same for gitweb's blame interface.


(6 comments | Leave a comment)


Date:June 30th, 2009 09:29 am (UTC)

Petr Baudis pickaxe GUI: giddy

What about a "secret pickaxe project" Petr 'Pasky' Baudis was talking about at Git Together '08 (http://git.or.cz/gitwiki/GitTogether08), i.e. giddy (gitweb (http://repo.or.cz/w/giddy.git))?
Date:March 1st, 2010 12:38 pm (UTC)

Pickaxe by counting occurrences

Doesn't the approach of counting occurrences of the search string in the file have the shortcoming that _moving_ the string within the file is not detected?
[User Picture]
Date:March 1st, 2010 04:58 pm (UTC)

Re: Pickaxe by counting occurrences

Quite the contrary, it is one of its design goals to ignore such a case.

The pickaxe was invented primarily as a step (in the larger picture of "Linus's ultimate content tracking tool") to find the changeset that actually _modified_ the given block of text. If a changeset moved the block of text in question without touching what is inside that block, we would want to actively ignore it and continue digging further.

Please go back and read the introductory part of the article where I rephrased what steps Linus envisioned was necessary.
Date:March 2nd, 2010 03:50 pm (UTC)

Re: Pickaxe by counting occurrences

But isn't that scenario, while perfectly valid, a bit restrictive (and perhaps conveniently tailored to the efficient implementation of pickaxe)?

What about this scenario: I notice some odd UI behaviour, and when I look at the relevant source code I notice that two function calls appear to have been transposed. I'm certain that I had them in the other order when I wrote the module, and so I want to see when they were changed, by whom, and (hopefully) why. Unfortunately the file is frequently modified for other reasons - so a pickaxe search to only find modifications involving that function call seems a good way to eliminate irrelevant commits.

But pickaxe ignores such modifications, because nothing is being added or removed from the file.

Is pickaxe simply the wrong tool here?
[User Picture]
Date:March 2nd, 2010 05:36 pm (UTC)

Re: Pickaxe by counting occurrences

I think I know where your confusion comes from. Re-read what I wrote as a response, with an added strees to the phrase "block of text".

If the bug you are seeing is that you are sure you used to have:

but now you see:

then you don't ask "git log -Sgoodbye()".

You have to realize that the block of text you are interested in is not a single "goodbye()" (nor "hello()" for that matter).

The block of text you are interested in is both of these two lines, i.e. "The current code calls goodbye and then hello---what idiot decided to call them in that order?" is the question you are asking. So you would feed these two lines as the block of text you are interested in, like so:
  $ git log -S'  goodbye();
  >   hello();'

in order to find where that came from. You will see who swapped the lines in the history.
Date:March 2nd, 2010 04:00 pm (UTC)

Re: Pickaxe by counting occurrences

Hmm. Perhaps 'blame' is the tool for the specific scenario I've described, because we're looking at a specific line.

But I've been in situations where I've wanted to search for modifications that include a given string, without knowing where those modifications might be, and a pickaxe search returns (somewhat counter-intuitively) no results, because the net number of occurrences of the string is unchanged despite it having moved around. So I think the question is still valid.
Linus's ultimate content tracking tool - gitster's journal

> Recent Entries
> Archive
> Friends
> Profile

Pages at the k.org
The latest blog by Gitster

> Go to Top