Home

Advertisement

July 14th, 2009


06:34 pm - Fun with git user's survey 2009
Jakub announced this year's survey, so I am relaying it here.  He's done a good job summarizing the difference since the version from 1 year ago, which was the only real point in which I found the draft survey a bit problematic.

The survey would be open from 15 July till 15 September 2009.

Please devote a few minutes of your time to fill this simple
questionnaire, it will help a lot the git community to understand your
needs, what you like of Git, and of course what you don't like  of it.

The survey can be found here:

  http://tinyurl.com/GitSurvey2009




Tags:

(Leave a comment)

July 9th, 2009


11:25 pm - Unfair Advantage
I've been working on a git book for a Japanese publisher since the beginning of this year.  Partly because I had to take about 10 weeks break from the book in the middle, the progress has been much slower than I had hoped.  But it is coming along.  Hopefully I can finish the initial draft by the middle of this month.  I even have already secured a page or two of foreword by Linus.  Very nice.

In an early part of the book, I wrote that the description is based on v1.6.1, but now we are about to go into the pre-release feature freeze for v1.6.4, so I'll need to revisit these early chapters and update external references.

I was writing the section on the gitattributes last night, and had the pleasure of saying "In the latest version of git, git grep -p would also use the diff.*.xfuncname patterns", knowing that René's patches will be included in the upcoming release.  I'd say this is an unfair advantage I have over other git-book authors ;-)

I have to finish the chapter on common mistakes and recovery procedure now.

 
Tags:

(Leave a comment)

June 29th, 2009


11:27 pm - Linus's ultimate content tracking tool
I have kept saying that message number 217 in gmane archive is the most important message in the entire life of the git mailing list. I still think it is, and more importantly, it certainly was one of the most influential messages that shaped the various design decisions in git. Understanding of the ideas described in it may reveal the secrets behind different parts of git.

Or it may not. But it is fun thinking and writing about it, so here it is.

In the message, Linus outlined how an ideal content tracking system may let you find how a block of code came into the current shape. You'd start from the current block of code in a file, go back in the history to find the commit that changed the file. Then you inspect the change of the commit to see if the block of code you are interested in is modified by it, as a commit that changes the file may not touch the block of code you are interested in, but only some other parts of the file.

When you find that before the commit the block of code did not exist in the file, you inspect the commit deeper. You may find that it is one of the many possible situations, including:
  1. The commit truly introduced the block of code. The author of the commit was the inventor of that cool feature you were hunting its origin for (or the guilty party who introduced the bug); or
  2. The block of code did not exist in the file, but five identical copies of it existed in different files, all of which disappeared after the commit. The author of the commit refactored duplicated code by introducing a single helper function; or
  3. (as a special case) Before the commit, the file that currently contains the block of the code you are interested in itself did not exist, but another file with nearly identical contents did exist, and the block of the code you are interested in, together with all the other contents in the file existed back then, did exist in that other file. It went away after the commit. The author of the commit renamed the file while giving it a minor modification.
In git, Linus's ultimate content tracking tool does not yet exist in a fully automated fashion. But most of the important ingredients are available already.

Perhaps you can help us complete the dream.

1. Fast path-limited revision traversal

In git, the contents recorded by a commit is represented as a tree object that itself is a recursive structure. In the Linux kernel project, for example, there are 29,000 files stored under 21 top-level directories:
arch/           drivers/   ipc/         mm/             security/
block/          firmware/  Kbuild       net/            sound/
COPYING         fs/        kernel/      README          tools/
CREDITS                    lib/         REPORTING-BUGS  usr/
crypto/         include/   MAINTAINERS  samples/        virt/
Documentation/  init/      Makefile     scripts/
The top-level tree object has 30 entries, many of which record tree objects that represent these subdirectories, and others record blob objects.

Suppose you are interested in the history of a block of code in the file init.c in the arch/x86/mm directory. You would do this:
$ git log v2.6.30 -- arch/x86/mm/init.c
Because the project has nearly 30,000 files, only a small fraction of commits affect this particular file. Even if you count commits that touch other files in the same directory as this file is in, they are still very small portion of the whole.

For example, between v2.6.29 (Mar 23, 2009) and v2.6.30 (Jun 20, 2009), there are 12,000 individual commits, but only 10 commits touch the init.c file we are looking at. 144 commits touch the directory it is in (arch/x86/mm/), 1120 commits touch the directory one level higher (arch/x86/), and 2761 commits touch the top-level directory this file is found in (arch/)---which finally reaches only 1/4 of the whole.

Which means that the path-limited revision traversal can skip 3/4 of the commits without reading much. We need to read the commit object itself to learn the top-level tree object, its parent commit object, and the top-level tree object of the parent commit object. Then by comparing the entry for arch/ in the two top-level tree objects, we know that the commit does not touch the arch/ hierarchy without reading anything further, and move to that parent commit (whose tree object we have already read--so we can amortize the cost even more).

This optimization to read only the necessary part of the tree object works recursively, and we only need to read the full depth of the tree for a little more than 1% of the commits (144 commits among 12,000 touch arch/x86/mm directory). Note that even for these commits, we do not need to open other parts of the tree (e.g. sound/ directory) at all. Literally, we need to only read at most four tree objects from each commit to run the above "git log" command.

2. Merge Simplification

Linus's ultimate content tracking algorithm starts from a commit, and goes back the history, looking for commits that change the parts of a file we are interested in. How does a merge affect this?

For example, suppose we have a history that looks like this:
          *   *
      E---D---C---B---A
     /           /
 ---I---H---G---F

and only commits C and D touch the path we are interested in. Since the branches forked, between I, H, G and F, there is no change to the path. Between I and E there is no change, but D changes the path and C changes it again.

When we create a merge B, the lower branch did not do anything while the upper branch did some modification. A simple 3-way merge rule dictates that we take the contents of C as the result of this merge.
In other words, B has the same contents of the path we are interested in as it appeared in C.

When we have a merge (B in this case) whose content matches one of the parent (C in this case), by default we discard other parents while following the history. In other words, while we traverse the history with path limitation, we would pretend as if the history is like this:
          *   *
      E---D---C---B---A
     /
 ---I

At a first glance, it may seem unintuitive that we take a parent that is the same as the merge result. But if you think of the lower branch the "mainline", and the upper branch a side topic that was created for the sole purpose of updating the path we are interested in, it makes perfect sense. There is a vast difference between F and B because that is a merge of the finished topic from a side branch (i.e. the upper one) that integrates the entire change in a single step. By following the upper branch commit by commit, however, we can get a much finer grained view of what happened to the path we are interested in.

Together with the optimization of path-limited traversal, this technique cuts down the number of commits we need to look at even further.

3. Pickaxe

The second ingredient in Linus's ultimate content tracking tool is to find if a given commit changes the block of the code in a file we are interested in. Suppose that we are interested in the following block of code in arch/x86/mm/init.c:
struct map_range {
        unsigned long start;
        unsigned long end;
        unsigned page_size_mask;
};

We would want to find the commit that changed the contents of the file to make this block of code into its final shape. For that, we do this:
$ git log --pretty=short -S'struct map_range {
        unsigned long start;
        unsigned long end;
        unsigned page_size_mask;
};' v2.6.30 -- arch/x86/mm/init.c

Note that we are not interested in commits for which "git show" output contains the given string. We are only interested in a commit whose tree has this string in the file literally, but whose parent's tree does not. In other words, we do not have to (nor want to) run textual diff and grep in the result. We count the number of occurences of the given string in the file in the tree of the commit, and the same for the commit's parent. If we get different number, the commit chnages the string, which is what we wanted to find. Counting occurrences of substring is much cheaper than first generating textual diff and grepping in it (which is not what we want to do anyway).

4. Blame

I actually consider this both a mere "checkbox" item in the sense that we have it only because other SCMs have it (often under the name annotate), and a "cop-out" because doing the ideal "refactoring identification" is harder, even though I believe our blame works much better than in other SCMs.

We start by treating the whole contents of the file as a single block of text we are interested in, and while following Linus's ultimate content tracking ideal to identify the commits that touch the block of text, iteratively break the block down into smaller blocks (until each line becomes its own block) and keep digging the history.

As the result:
  1. Content movement from other files is identified naturally, albeit that we identify only one single source (i.e. we do not notice that the block of text is a result of refactoring five identical copies); and
  2. Wholesale file rename falls out as a trivial and very narrow special case of (1) without recording any rename tracking information at the commit time.

5. GUI

This part of Linus's ultimate content tracking tool is sorely lacking from current git. The overall flow should look like this:
  1. Show the contents of any file in the tree, and let you highlight with mouse/keyboard a block of text in it;
  2. Run the path-limited pickaxe search to find the commit that touches the block of text you showed interest in step (1). We already have both of the two necessary ingredients needed to perform this step efficiently.
  3. Show the change of the commit found in step (2), and find blocks of text in the files in the tree of the parents' of the commit that match or are very similar to what we are interested in. We do have GUI that shows the change of the commit; we do not have a good tool (yet) to show similar blocks of text that are potential pre-refactored duplications to identify refactoring. Perhaps we could run (possibly fuzzy) grep in the tree of the parent commits. We have already a capability to run grep on a tree object.
  4. Let you go back to step (1), starting from the commit we inspected in step (3) and continue.
Cf. My previous message on this issue.

git-gui's blame frontend does some of this, but not the third step. Same for gitweb's blame interface.

Tags:

(1 comment | Leave a comment)

June 14th, 2009


02:08 pm - Well, I'm ... sorry...
あちこちウロチョロしていたら、こういうのを見つけた。

こんな事が書いてある。

Gitとはどう違うのかと聞くと、「Git!? そんなの学生しか使わないよ」と一刀両断。真剣な大規模開発を行っているところならPerforceを使ってるのだといいます。これを聞いて、グーグル本社でGitについて講演したリーナス・トーバルス氏が、グーグルがPerforceを使っていることを聞くに及んで「それは、、、悪かった」(I'm....sorry...)と苦笑いしたニュアンスが少し分かった気がしました。それまでほかのSCM(主にSubversion)の悪態を付きまくっていたリーナスも、Perforceには、ごめん、かなわないよという意味だったのかもしれません。

言葉尻をなるべくあいまいにする日本人のボクならともかく、あのはっきり辛口のLinusがそんなコトを言うわけがなかろう。上は単なる的外れな誤訳で、あの、I'm... sorry... は「それは、、、お気の毒に」である。大体、それでなければ、その後の聴衆の笑いにつながるまい。

まぁ、ネット上の記事は出版社がやっているサイトであっても誤訳やそれに基づく誤解はそれほど不思議でもないわけで、一々目くじら立てる程のことでもないけれど、そういう誤解から始まる記事を読まされて納得している読者もきっと大勢居るわけで、Well, I'm .... sorry... というところかね。

Current Mood: [mood icon] amused
Tags:

(2 comments | Leave a comment)

June 11th, 2009


10:17 pm - Fun with dreaming of a new machine
Somebody asked about the spec of new box that will hopefully become the primary development box for git.
Here is what is planned, hardware-wise:

PartDescriptionCostProcured?
CaseAntec Super Mid Tower Case NSK6580B$114Yes
FanAntec 3Pin Ball Bearing Fan 92MM x 2$10Yes
CPUIntel Core 2 Quad Q9550 Quad-Core Processor, 2.83 GHz, 12M L2 Cache, 1333MHz FSB, LGA775$242Yes
MotherboardGIGABYTE GA-X48-DS5 LGA 775 Intel X48 ATX Intel Motherboard$180Yes
RAMKingston 4GB (2 x 2GB) 240-Pin DDR2 SDRAM DDR2 800 (PC2 6400) ECC Unbuffered(gift)Yes
GraphicsSapphire Radeon HD3450 512MB DDR2 PCI-Express$38Yes
OpticalSony Optiarc DVD/CD Rewritable Drive Black SATA Model AD-7240S-0B OEM$28Yes
HDDWestern Digital Caviar Green 1 TB, SATA II, WD10EADS x 2(gift)Yes

This is loosely based on the spec recommended by DJB, with some parts replaced.
The motherboard has two Ethernet interfaces, one will be attached to the primary LAN, and the other will go to the local side of an wireless router (whose Internet side is unconnected) so that I can SSH in from another room over the airwave.  I do not do multimedia and won't be sitting in front of the machine most of the time to begin with, so I won't need speakers (the sound comes on-board, though), keyboard-and-mouse (I'll borrow my wife's machine's while installing) nor monitor (same).

Software-wise, the box will run Debian x86-64, and will host a few others on VMs.

The disks will be partitioned as follows (this is still a plan as I do not have all the hardware yet):

/dev/sda1, sdb10.2GRAID 1 component/dev/md0/boot
/dev/sda5, sdb50.5Gswap
/dev/sda6, sdb6200GRAID 0 component/dev/md2LVM PV (VG scratch)
/dev/sda7, sdb7800GRAID 1 component/dev/md1LVM PV (VG stable)

Two LVM volume groups serve the purpose of their respective names suggest.

"scratch" holds anything that can be reproduced.
  • Virtual disks for VM instances
  • "git new-workdir" build area for each branch
  • temporary (downloaded ISO images, torrents, etc.)
"stable" is meant to survive one disk crash.
  • /, /usr, /var, /home, ...
  • Project repositories and work trees
  • Daily snapshots of the above, including VM virtual disks.
Hopefully I can maintain at least the following virtual machines to build-check git on from time to time.
  • Debian 5.0 (i386)
  • Fedora 9 (i386); I need to keep one to cut RPMs for k.org
  • Fedora 11 (i386 and possibly x86_64); in preparation for the day k.org upgrades, perhaps
  • FreeBSD 7.2 (i386 and possibly x86_64)
  • OpenBSD 4.5 (i386)
  • OpenSolaris 0811 (i386)
  • Some vintage of Windows to try msysgit out (undecided, as I am not a Windows person)
As I do not have complete parts, I have to live with the current machine a bit longer, but often it is more fun while waiting and dreaming for things to happen than after it actually happened, so I would not exactly say "have to suffer living with the current machine" ;-).




Tags:

(2 comments | Leave a comment)

June 9th, 2009


03:12 pm - The new machine getting closer to reality
A fellow git user Marius sent 4GB memory from my Amazon wishlist entries.  Thanks!

I've ordered the motherboard, case and CPU myself, and I can recycle peripherals from the current machine, so the new development box for git work is getting closer to reality.  Hopefully I can survive with the current set-up until then.



Tags: ,

(2 comments | Leave a comment)

May 20th, 2009


05:04 pm - Happy git Wednesday
For the past few weeks the day-job managed to swamp me and worse yet it has started eating into my git Wednesdays. But finally I managed to spend the whole day on git today, and I actually did something useful. It felt good to get back into the excitement of a full development cycle for a change.

You think of something, code it up, find issues and go back to the drawing board while getting more and more frustrated in each iteration. After the iteration, passing the preliminary tests is the happiest moment. You start thinking of benchmarks and new tests and the anticipation of demonstrating the improvement successfully grows.


Current Mood: [mood icon] happy
Tags:

(Leave a comment)

May 14th, 2009


08:56 pm - Happy with http://sourceforge.net/projects/git-core/
Earlier I wrote that for me it was cumbersome to access git service at sourceforge.net, but it seems that I was mistaken.  They do have a page to attach SSH keys to an account.  Perhaps I didn't look deeply enough.

Or perhaps they added it very recently.  But I somehow doubt it.

It really does not matter.  As long as it works, I am happy.




Tags:

(1 comment | Leave a comment)

May 9th, 2009


12:55 am - Fun with gpodder, unfun with binary package
Gpodder has been my favourite podcatcher for some time, and I've been running its development version from the git repository.  But somehow my notebook decided to install an older version of the binary package from Ubuntu.

A small disaster.  Between the packaged one (version 0.14 or so) and the latest release (post 0.15) the on-disk format has changed.  The new one is supposed to read and convert the old format when it is first run, but there is no way an older one can convert the new library back to the format it understands.

Luckily, I did not have anything I haven't copied to my Sansa Clip/Fuze, and I usually do not listen the same thing twice, nor download backissues just in case before I think I have a reasonable chance of finding enough time to listen to, so nothing was lost.

While figuring out the cause of the above issue (I didn't know about the on-disk format change, even though I apparently did cross that version boundary while following the development version---it seems the auto-conversion in the forward direction did work well), noticed a link to wishlist page at Amazon (de) of its author (Thomas Perl), and found that he has Travis Swicegood's git book listed there.  I couldn't resist giving it to him, so ended up nagivating the site semi-blindly (I do not read German, but I can make a guess comparing the pages with the US Amazon site).




(1 comment | Leave a comment)

12:43 am - Post 1.6.3 status
Now a feature release is out.

I've been ignoring patches to add new features for the past few weeks, except for saving some of them (the more promising ones) in a separate mailbox to look at later. The official excuse being that we were already deep in the freeze period in the -rc cycle, but there were two real reasons.  One was that the day-job was unusually busy, and the other, which actually was bigger, was that I've been having issues with my development environment (I've briefly mentioned about running out of disk space and having to shuffle hardware earlier), and obviously the period just before the release is not a good time to swap machines (if I had enough material to build a new machine, that is).

The day-job situation seems to be stabilizing somewhat.  I still have to resolve the development environment situation somehow.

I guess I'll be looking at the backlog of patches in this weekend, after rewinding the next branch.



Tags:

(Leave a comment)

gitster's journal

> Recent Entries
> Archive
> Friends
> User Info
> previous 10 entries

Links
Pages at the k.org
Gifts
More Gifts
> previous 10 entries
> Go to Top
LiveJournal.com

Advertisement