Official Kindle subscription to The Guardian

It seems that finally The Guardian have an official Kindle version, that you can subscribe to here:

The Guardian and The Observer [Kindle Edition]

A subscription costs a very reasonable £9.99 per month, and like all the official Kindle periodicals, it appears on your device automatically if wireless is enabled.  It looks as if they’ve done a very nice job – I’d encourage anyone who was regularly using my unofficial Guardian for Kindle project to subscribe to the official version.  As well as the convenience of automatic delivery, you get more complete content (since a number of articles aren’t available for free via the Guardian’s API) and bigger photos for each story.  In addition, with newspapers in general struggling these days, it’s good to be able to support them by paying for this service.

Where does this leave my unofficial version?  My plan is to leave in place the system that generates the web-hosted version each day, but prominently link to the official version on that page.  The version generated from the Guardian’s API might be occasionally useful for people, and I think the project is of interest in its own right – figuring out how to generate Kindle books and then periodicals was hard work, and it’s still a satisfying project to have got working so well in the end.  Even masthead images now work, thanks to a tip from Marco Arment!  (Again I should thank Dominic Evans for figuring out how to switch from generating books to periodicals, and his other contributions.)

I’ve summarized what I’ve learned so far about generating periodicals using Kindlegen in this Stack Overflow answer, and I’ll update that if I find out anything new.

Making an encrypted partition on a USB drive

On Ubuntu or Debian, it’s really simple to create an encrypted partition on a newly-purchased USB mass storage device.  In my case, I had bought a 1TB hard drive which had very mixed reviews, some people saying their drives had failed very early.  I wanted to be able to return the drive under warranty if it broke without worrying about personal data.

It turns out that if you want to reformat a partition on an external USB drive so that it’s encrypted, this is just a matter of doing the following:

sudo luksformat -t ext4 /dev/partitiondevice

…. where /dev/partitiondevice is the device for the drive partition you want to overwrite.  Obviously, this will destroy everything that was previously on that partition.

I like to use a proper filesystem for USB mass storage devices, but if you leave out the -t ext4 then the default is to use VFAT.

When you next plug in that drive, you’ll be prompted to enter the password that you picked when creating the partition – if you type that correctly, the drive will be mounted and usable.  (If you mistype it, you’re not given another chance to enter the password, so you’ll need to go to the command line and do: gvfs-mount -d /dev/partitiondevice to try again.)

One small thing is that the mount point in /media will be based on a UUID by default, but if you set the ext4 partition label, it’ll be mounted under that name in /media/ instead.  To do this, starting from when your disk is mounted, you can run mount without parameters to find the unencrypted device name and then unmount it and change the label:

$ umount /dev/mapper/udisks-luks-uuid-b7bbb2c8-etc
$ e2label /dev/mapper/udisks-luks-uuid-b7bbb2c8-etc topsekrit

If you unplug and plug in the disk again, it should be mounted on /media/topsekrit

Missing git hooks documentation

One part of git’s documentation that is particularly lacking is that on the subject of hooks.  In particular, that page doesn’t explain:

  • What the current working directory is when the hooks run.
  • Which helpful environment variables are set in the environment when the hooks are run.

These omissions are particularly irritating since the current directory is not consistent across the different hooks, and the setting of the GIT_DIR environment variable can cause some very surprising results in some situations.

So, I did some quick tests to find what the behaviour of git 1.7.1 is for each of these hooks, for both bare and non-bare repositories where appropriate.  (If I were more confident about the details of these, I would try to contribute a documentation patch, but that’s probably best done by someone who knows the code well.)

applypatch-msg
post-applypatch
pre-applypatch

(Tested with “git am” in a working tree – it cannot be used with a bare repository for obvious reasons.)

The current directory is the top level of the working tree.  The following environment variables are set:

  • GIT_AUTHOR_DATE, e.g. set to ‘Sat, 9 Apr 2011 10:13:24 +0200’
  • GIT_AUTHOR_NAME, e.g. set to ‘Ada Lovelace’
  • GIT_AUTHOR_EMAIL, e.g. set to ‘whoever@whereever’
  • GIT_REFLOG_ACTION is set to ‘am’

Note that GIT_DIR is not set.

pre-commit
prepare-commit-msg
commit-msgt
post-commit

(Tested with “git commit” in a working tree – it cannot be used with a bare repository.)

The current directory is the top level of the working tree.  The following environment variables are set:

  • GIT_DIR is set to ‘.git’
  • GIT_INDEX_FILE is set to ‘.git/index’

post-checkout

(Tested with “git commit” in a working tree – it cannot be used with a bare repository.)

The current directory is the top level of the working tree.  The following environment variables are set:

  • GIT_DIR is set to ‘.git’

pre-receive
update
post-receive
post-update

These hooks can be run either in a bare or a non-bare repository.  In both cases, the current working directory will be the git directory.  So, if this is a bare repository called “/src/git/test.git/”, that will be the current working directory – if this is a non-bare repository and the top level of the working tree is “/home/mark/test/” then the current working directory will be “/home/mark/test/.git/”.

In both cases, the following environment variable is set:

  • GIT_DIR is set to ‘.’

With a working tree, this is unexpectedly awkward, as described in Chris Johnsen’s answer that I linked to earlier.  If only GIT_DIR is set then this comment from the git man page applies:

Note: If –git-dir or GIT_DIR are specified but none of –work-tree, GIT_WORK_TREE and core.worktree is specified, the current working directory is regarded as the top directory of your working tree.

In other words, your working tree will also be the current directory (the “.git” directory), which almost certainly isn’t what you want.

pre-auto-gc

(Not tested yet.)

Summary

I think the (quite obvious) lesson from this is just:

  • Always test your hooks carefully, probably starting with a script that just echos the current working directory and GIT_DIR.

A related tip is that since the rules about GIT_DIR / –git-dir and GIT_WORK_TREE / –work-tree / core.worktree are so complex, I follow the rule of thumb that if you need to set either one, set both, and make sure you set them to an absolute path.

In case you’re interested, the test hook I used for this was just:

#!/bin/bash
echo Running $BASH_SOURCE
set | egrep GIT
echo PWD is $PWD

git: Too Many Topic Branches

Another couple of git tips, that might conceivably be useful to someone somewhere :)

git makes it so easy to create topic branches, that it’s easy to lose track of which branches were for what. Here are a couple of recipes that might help with this:

Order branches by last commit date

I often want to order my branches according to how recently I was working on them. We can approximate that by saying we’d like to order the branches by the commit date of each branch tip, and you can do that with git for-each-ref --sort=committerdate. For example, the shell script:


for C in $(git for-each-ref --sort=committerdate refs/heads --format='%(refname)')
do
    git show -s --format="%ci $C" "$C"
done

… produces output like this:

...
2011-02-23 18:57:01 -0500 refs/heads/trainable-seg-gui
2011-03-09 11:25:38 +0100 refs/heads/snt-swing-menus3
2011-03-11 00:44:37 +0100 refs/heads/master

Show branches that introduced changes to particular paths

If that doesn’t turn up the branch I’m looking for, it might be useful to just list those branches that made changes to particular paths (with respect to master). Here’s a similar example of how to do this, in this case looking for all those branches that introduced changes to the path src-plugins/Simple_Neurite_Tracer:


P="src-plugins/Simple_Neurite_Tracer"
for C in $(git for-each-ref --sort=committerdate refs/heads/ --format='%(refname)')
do
    git diff master..."$C" --quiet -- "$P" || echo $C
done

… which produces output like this:

...
refs/heads/sholl-analysis
refs/heads/for-rebasing
refs/heads/sholl-analysis-wip
refs/heads/sholl-analysis-wip2
...

Ordering branches by the last time the branch was changed

This came up in a Stack Overflow question – sometimes you might want to know when the branch pointer was last changed, not the date of the last commit. Jefromi’s answer provides a nice recipe that you could alter to order the branches in that way.

An asymmetry between git pull and push

Although git is an excellent system, which has certainly changed my way of working for the better, occasionally one comes across an inconsistency that seems bizarre. In case you don’t want to read the whole of this post, the one sentence summary would be, “By default, git push origin will update branches on the destination with one with the same name on the source, instead of using the association defined by git branch --track, which git pull origin would use — the config option push.default can change this behaviour.” However, for a more detailed explanation, read on…

Suppose someone has told you that they’ve pushed a topic branch to GitHub that they’d like you to work on. Let’s say that you’ve set up a remote called github for that repository, and the branch there is called new-feature2.  With a recent git (>= 1.6.1) you can just do git fetch and then:

git checkout -t github/new-feature2

… which will create a branch in your repository called new-feature2 based on github/new-feature2, and set various config options to associate your new-feature2 branch with github/new-feature2.  It will also checkout that new branch so that you can start working on it.  However, let’s suppose that you want to give your branch a more helpful name – let’s say that’s “add-menu”.  Then you might instead do:

git checkout -t -b add-menu github/new-feature2

… which has the same effects to the previous command, except for giving the branch a different name locally.  The config options that will have been set by that command are:

branch.add-menu.remote=github
branch.add-menu.merge=refs/heads/new-feature2

The detailed semantics of these config options are given in the branch.<name>.remote and branch.<name> merge sections of git config’s documentation, but, for the moment, just understand that this sets up an association between your local add-menu branch, and the new-feature2 branch on GitHub.

This association makes various helpful features of git possible – for example, this is how you get this nice information from git status:

$ git status
# On branch add-menu
# Your branch is ahead of 'github/new-feature2' by 5 commits.

It’s also the mechanism by which, when you’re on the add-menu branch, typing:

$ git pull github

… will cause git to run a git fetch, and then merge github/new-feature2 into your add-menu branch.  That’s all very helpful.

So, what happens when you want to push your changes back to the upstream branch?  You might hope that because this association exists in your config, then typing any of the following three commands while you’re on the add-menu branch would work:

  1. git push github add-menu
  2. git push github
  3. git push
  4. git push github HEAD

However, with the default git setup, none of these commands will result in new-feature2 being updated with your new commits on add-menu.  What does happen instead?

1. [wrong] git push github add-menu

In this case git push parses add-menu as a refspec.  “refspecs” are usually of the form <src>:<dst>, telling you which local branch (src) you’d like to update the remote branch (def) with.  However, the default behaviour if you don’t add :<dst>, as in this example, is explained in here:

If :<dst> is omitted, the same ref as <src> will be updated.

So the command is equivalent to git push github add-menu:add-menu, which will create a new branch called add-menu on GitHub rather than updating new-feature2.

2. [wrong] git push github

In this case, the refspec is omitted.  The documentation for git push again explains what happens in this case:

The special refspec : (or +: to allow non-fast-forward updates) directs git to push “matching” branches: for every branch that exists on the local side, the remote side is updated if a branch of the same name already exists on the remote side. This is the default operation mode if no explicit refspec is found (that is neither on the command line nor in any Push line of the corresponding remotes file—see below).

… so the new commits on your add-menu branch won’t be pushed.  However, the changes for every other branch for which there’s a matching name in your repository on GitHub will be!

2. [wrong] git push

Again, we can find in the documentation for git push what happens if we miss out the remote as well:

git push: Works like git push <remote>, where <remote> is the current branch’s remote (or origin, if no remote is configured for the current branch).

In our example case, branch.add-menu.remote is set to github, so the behaviour in this case will be the same as in the previous one, i.e. probably not what you want.

4. [wrong] git push github HEAD

Thanks to David Ongaro for suggesting adding this fourth wrong command. The git push documentation explains that this is:

A handy way to push the current branch to the same name on the remote.

In other words, in this example, that will end up being the same as git push github add-menu:add-menu, again creating an unwanted add-menu branch in the remote repository.

So how should you push?

The simplest option, which will work everywhere, is just to specify both the source and destination parts of the refspec, i.e.:

git push github add-menu:new-feature2

That means that you have to remember what the remote name should be, but it’s the least ambiguous way to push a branch, and in any case it’s a good idea to understand how to use refspecs more generally.

However, another alternative (available since git version 1.6.3) is to set the push.default config variable.  The documentation for this in the git config man page is:

push.default: Defines the action git push should take if no refspec is given on the command line, no refspec is configured in the remote, and no refspec is implied by any of the options given on the command line. Possible values are:

  • nothing – do not push anything.
  • matching – push all matching branches. All branches having the same name in both ends are considered to be matching. This is the default.
  • tracking – push the current branch to its upstream branch.
  • current – push the current branch to a branch of the same name.

So if you set push.default to tracking with one of:

$ git config push.default tracking # just for the current repository
$ git config --global push.default tracking # globally for your account

… then when you’re on the add-menu branch, git push github will update new-feature2 on GitHub with your changes in add-menu, and no other branches will be affected.

The commit message that introduced this change suggests that the reason that this option was introduced was exactly to avoid the kind of confusion I’ve described above:

When “git push” is not told what refspecs to push, it pushes all matching branches to the current remote. For some workflows this default is not useful, and surprises new users. Some have even found that this default behaviour is too easy to trigger by accident with unwanted consequences.

Personally, I don’t actually use this option, since I use git on so many different systems it would be more confusing to have different settings for push.default on some of them.  However, I hope it’s useful for some people, and it’s a shame that this behaviour couldn’t reasonably be made the default at this stage.

Update: Thanks to David Ongaro, who points out below that since git 1.7.4.2, the recommended value for the push.default option is upstream rather than tracking, although tracking can still be used as a deprecated synonym. The commit message that describes that change is nice, since it suggests that there is an effort underway to deprecate the term “track” in the context of setting this association with the upstream branch in a remote repository. (The totally different meanings of “track” in git branch --track and “remote-tracking branches” has long irritated me when trying to introduce git to people.)

Update (2012-07-20) There has been an ongoing discussion in the git world about what the default behaviour for git push should be, given that the default behaviour is so surprising to newcomers. It seems that the decision is to introduce a new value for push.default, called simple and ultimately make that the default. This decision is described in a commit message as follows:

push: introduce new push.default mode “simple”

When calling “git push” without argument, we want to allow Git to do
something simple to explain and safe. push.default=matching is unsafe
when used to push to shared repositories, and hard to explain to
beginners in some contexts. It is debatable whether ‘upstream’ or
‘current’ is the safest or the easiest to explain, so introduce a new
mode called ‘simple’ that is the intersection of them: push to the
upstream branch, but only if it has the same name remotely. If not, give
an error that suggests the right command to push explicitely to
‘upstream’ or ‘current’.

A question is whether to allow pushing when no upstream is configured. An
argument in favor of allowing the push is that it makes the new mode work
in more cases. On the other hand, refusing to push when no upstream is
configured encourages the user to set the upstream, which will be
beneficial on the next pull. Lacking better argument, we chose to deny
the push, because it will be easier to change in the future if someone
shows us wrong.

Original-patch-by: Jeff King

Signed-off-by: Matthieu Moy

This new possible value for push.default is available in 1.7.11, and will be made the default behaviour in the future (but it isn’t in any released version so far).

The Guardian on your Kindle

Update: It is now possible to buy an official subscription to The Guardian and The Observer. The rest of this post is now largely of historical interest if you just want The Guardian on your Kindle, but I’ve left the rest of the content unchanged for people who are interested in how I generated my unofficial version.


If you just want a copy of today’s copy of The Guardian or The Observer for you Kindle, you can download one from this automatically generated page.  This post describes the script that generates that generates the file and the motivation for it.

Since moving to Switzerland, I’ve found that I really miss being able to get The Guardian in the morning on my way into work.  Unfortunately, reading the website on a phone (or any other device) is no substitute if you’re relying on data over the mobile phone networks – one really wants all the articles cached for fast navigation through the paper.  The solution for this should be my shiny new Kindle, but sadly subscriptions to The Guardian aren’t available in the Kindle store.  (There are many other papers available.)  Fortunately, The Guardian has an excellent API for accessing its content, and the lovely interface produced by Phil Gyford for reading the paper in a cleaner interface suggested that I could similarly generate a bare bones version of the paper for my Kindle.  I believe that this is permitted under the terms and conditions of the Guardian Open Platform, since I’m (a) including the advertisement linked to from each article, (b) linking back to the original article and (c) acknowledging that the data is supplied by that service.  If I’ve misunderstood, and in fact this is not allowed, please let me know.

To generate a book in the Kindle’s preferred format, you have to generate a .opf file describing the contents of the book, which refers to other files describing its text, images, structure, etc.  Then you can run a binary called “kindlegen” to generate a .mobi archive from those files that will work on your Kindle.  (The samples in the kindlegen archive and the Amazon Kindle Publishing Guidelines are quite sufficient to figure out how to do this.)  My script to generate the .opf and supporting files is far from elegant, but I’m very happy with the results that it produces – it’s a really lovely reading experience.  You can use the normal page forward / back buttons to go from page to page, while the left and right buttons on the five-way skip to the next article’s headline.  This means you can skip quickly through the articles that you aren’t interested in, but each article you do want to read is presented very clearly on the amazing eInk display:

There are a few articles for which the API won’t return the text, saying that rights for redistribution are not available – I’m still including the other metadata for these article and the link to the original article, so that you know what’s missing:

At the end of each article is the advertisement image that’s included – this is to comply with the requirements of the Guardian Open Platform:

You can download the generated .mobi file for today’s Guardian (or The Observer on Sunday) from this page:

A Kindle version of today’s Guardian or Observer

You can bookmark that page in your Kindle’s web browser. Then, whenever you select the bookmark and then “Reload”, then it’ll be refreshed with a link to that day’s generated edition of the newspaper for your Kindle, which you can download straight from that page.

If you’re interested in this project, or have any comments or suggestions, you can contact me by email at:

"mark" followed by a dash, then "guardian" then "kindle" then an "at" sign then longair dot net

… or leave a comment below.  The script for generating this version of The Guardian is available at github.

The Canon PIXMA MP560 on Ubuntu

I’ve resisted getting an all-in-one printer / scanner / copier device in the past, largely due to worrying about the driver situation on Linux, but when I found out that my scanner hadn’t survived the trip to Zürich and we were also without a printer, we risked it and bought a Canon PIXMA MP560.

I say “risked”, because the noise levels in the search results when trying to find out whether this printer would actually work were very high.  In the hope that it will be of use to anyone similarly searching in the future, I thought it might be useful to add the following notes about things that definitely work for us.  (There are lots of things that we haven’t simply tried yet, e.g. printing via USB or scanning via USB, so I can’t comment on what’s involved in getting those to work.)

Update: with these drivers the printer does seem to have trouble printing more complex pages – it’ll only print out the top eighth of the page and then give up.  e.g. as an example, try printing out the English rules for Tobago (PDF linked from that page).  Rather annoying, and I haven’t had time to report it to Canon yet.

Obviously I can’t provide support if you’re having problems with this printer, so this is just to describe what worked for us on Ubuntu 9.04 (Jaunty Jackalope), 10.04 (Lucid Lynx) and 11.04 (Natty Narwhal).  I’m pretty happy with this printer, and it’s great to see that Canon are supporting Linux users by providing drivers.

Printing over Wireless

The instructions supplied with the printer explain how to get it to connect to your wireless network, which worked fine with WPA2 (AES).

Then you need to go to the drivers page for the PIXMA MP560 on Canon’s website, select “Linux” and “English”, and download “Debian Linux Print Drivers (3.2)”.  That’ll give you a tar file, which you should unpack in a new directory.  This in turn contains a tar.gz archive called cnijfilter-mp560series-3.20-1-i386-deb.tar.gz.  Unpack that and you’ll get the following:

cnijfilter-mp560series-3.20-1-i386-deb/
cnijfilter-mp560series-3.20-1-i386-deb/packages/
cnijfilter-mp560series-3.20-1-i386-deb/packages/cnijfilter-common_3.20-1_i386.deb
cnijfilter-mp560series-3.20-1-i386-deb/packages/cnijfilter-mp560series_3.20-1_i386.deb
cnijfilter-mp560series-3.20-1-i386-deb/install.sh

The “install.sh” script will fail if your setup is like either of ours, with the error message:

==================================================
Canon Inkjet Printer Driver Ver.3.20-1 for Linux
Copyright CANON INC. 2001-2009
All Rights Reserved.
==================================================
Error! Cannot specify package management system.

This error arises when both “dpkg” and “rpm” exist in your path, so you need to edit install.sh to cause the test for “rpm” to fail – find these lines:

## Judge is the distribution supporting rpm? ##                                                                                                                
 rpm --version 1> /dev/null 2>&1                                                                                                                                
 c_system_rpm=$?

… and change “rpm –version” to “rpm-no-thanks –version”, or something.  If you re-run ./install.sh then it should all work OK.  (I’ve put in bold the bits where you need user interaction.

 : mark@cava:~/Desktop/canon-drivers/cnijfilter-mp560series-3.20-1-i386-deb (master)
 ./install.sh
==================================================
Canon Inkjet Printer Driver Ver.3.20-1 for Linux
Copyright CANON INC. 2001-2009
All Rights Reserved.
==================================================
Execution command = sudo dpkg -iG ./packages/cnijfilter-common_3.20-1_i386.deb
[sudo] password for mark: 
(Reading database ... 490522 files and directories currently installed.)
Preparing to replace cnijfilter-common 3.20-1 (using .../cnijfilter-common_3.20-1_i386.deb) ...
Unpacking replacement cnijfilter-common ...
Setting up cnijfilter-common (3.20-1) ...
Processing triggers for libc-bin ...
ldconfig deferred processing now taking place/sbin/ldconfig.real: /usr/local/lib/ is not a symbolic link
Execution command = sudo dpkg -iG ./packages/cnijfilter-mp560series_3.20-1_i386.deb
(Reading database ... 490522 files and directories currently installed.)
Preparing to replace cnijfilter-mp560series 3.20-1 (using .../cnijfilter-mp560series_3.20-1_i386.deb) ...
Unpacking replacement cnijfilter-mp560series ...
Setting up cnijfilter-mp560series (3.20-1) ...
Processing triggers for libc-bin ...
ldconfig deferred processing now taking place
/sbin/ldconfig.real: /usr/local/lib/ is not a symbolic link

## Driver packages installed. ##
#=========================================================#
#  Register Printer
#=========================================================#
Next, register the printer to the computer.
Connect the printer, and then turn on the power.
To use the printer on the network, connect the printer to the network.
When the printer is ready, press the Enter key.
> #=========================================================#
#  Connection Method
#=========================================================#
 1) USB
 2) Network
Select the connection method.[1]2
Searching for printers...

#=========================================================#
#  Select Printer
#=========================================================#
Select the printer.
If the printer you want to use is not listed, select Update [0] to search again.
To cancel the process, enter [Q].
-----------------------------------------------------------
 0) Update
-----------------------------------------------------------
Could not detect the target printer.
-----------------------------------------------------------
(Currently selected:[0](Update) )
[At this point I remembered to turn on the printer...]
Enter the value. [0]0
Connect the printer, and then turn on the power.
To use the printer on the network, connect the printer to the network.
When the printer is ready, press the Enter key. > Searching for printers..

#=========================================================#
#  Select Printer
#=========================================================#
Select the printer.
If the printer you want to use is not listed, select Update [0] to search again.
To cancel the process, enter [Q].
-----------------------------------------------------------
 0) Update
-----------------------------------------------------------
Target printers detected (MAC address  IP address)
1) Canon MP560 series (00-1E-8F-64-60-96 192.168.1.5)
-----------------------------------------------------------
(Currently selected:[1]Canon MP560 series (00-1E-8F-64-60-96 192.168.1.5)) 
Enter the value. [1]
#=========================================================#
#  Printer Name
#=========================================================# 
Enter the printer name.[MP560LAN]mp560-again
Execution command = sudo /usr/sbin/lpadmin -p mp560-again -m canonmp560.ppd -v cnijnet:/00-1E-8F-64-60-96 -E
#=========================================================#
#  Set as Default Printer
#=========================================================# 
Do you want to set this printer as the default printer? (yes/no) [yes]no
#=========================================================#
Installation has been completed.
Printer Name : mp560-again
Select this printer name for printing.
#=========================================================#

Then the printer should be set up correctly, and you can test it by going to System -> Administration -> Printing, right clicking on the new printer, selecting “Properties” and clicking on “Print Test Page”.

Scanning over Wireless

Thanks to Fergus in the comments below for pointing out that this is easy to get working.  First, make sure that you have the package libgimp2.0 installed with:

sudo apt-get install libgimp2.0

Now, from the Canon MP560 drivers page, download “ScanGear for Linux (1.4)”, which gives you a file called MP560_Linux_Scangear.tar.  Unpack that tar file, and you’ll find within it another archive called scangearmp-mp560series-1.40-1-i386-deb.tar.gz.  Unpack that, and change into the newly created scangearmp-mp560series-1.40-1-i386-deb directory.  Then run ./install.sh – the installation should work fine.

Then you can scan with Simple Scan.

Scanning to a USB stick

As an alternative to scanning over wireless, you can plug a USB stick into the printer and scan directly to that – just follow the prompts on the display.  It only offers scanning to PDF or JPEG, unfortunately – it would be nice if PNG was another option available when scanning to a USB mass storage device.

git Submodules Explained

I haven’t actually finished the FAQ bit of this post yet, but since I’m not sure when I’ll have time to do so, I’ll just publish it anyway – please let me know in the comments if this is useful for you, or there’s something else you’d like to see included.

Submodules in git are commonly misunderstood in various ways, and although the explanation in the official manual is clear and pretty easy to understand, I thought that a different treatment here might be useful to someone.

What are submodules?

A submodule in a git repository is like a sub-directory which is really a separate git repository in its own right.  This is a useful feature when you have a project in git which depends on a particular versions of other projects.  For example, if you’re developing a new Ruby-on-Rails application, you could add a clearly specified version of the Rails repository as a submodule at the path vendor/rails.  The example I’m going to use in this post however, called whatdotheyknow, is one of the various mySociety projects that depend on a repository called commonlib, which contains useful code common to at least one project.  In each project the commonlib repository has been added as a submodule.  (I’ll sometimes refer to the whatdotheyknow repository as the super-project, which I hope is clear.)

It’s important to understand that the repository which contains a submodule knows very little about it except for which version it should be and various bits of information about how to update it.  (More on that below.)  If you change directory into the submodule then you’ll find that it doesn’t know anything about the the parent project at all, and you can carry out operations in that repository as if it were standalone.

Before you proceed…

… it’s worth checking what version of git you have.  Many actions that you might perform that relate to submodules are done with the git submodule command, but in older versions of git this has two problems that make it very easy to get confused – I think these are important enough that everyone who uses submodules should be aware of them, and ideally upgrade their copy of git to a version that doesn’t have these problems: at least version 1.6.2.

The first of these is that if you had a typo in the name of a submodule listed on the command line, that would be silently ignored.  The second problem which compounded this is that if you spelled the submodule name with a trailing slash (as is common with tab-completion) then that did not refer to the submodule, and due to the previous problem would be ignored.  There were fixed in f3670a5749d70 and 496917b721ada.  (As a small point of interest, to find out which tagged releases had these fixes, I cloned git.git and did git tag --contains 496917b.)

Note also that version 1.7.0 and later versions of git have some annoying differences in behaviour, which are noted below.

How are submodules stored?

To answer this you need to understand a little bit about how git stores objects.  If you just want recipes for how to do particular things, then you can skip to “Things You Might Need To Do” below, but I think this section is useful for figuring out problems that might arise.

git’s model of the world is based around objects which are identified by their “object name”, which is the correct term for the SHA1sum hashes you see all over the place.  These objects can be of various types, such as “commit”, “tag”, “blob” (file), “tree” (directory), etc.  Each commit object points to a tree object which represents the state of your source code at that commit.  A tree object in turn consists of a list of objects with some metadata, e.g. as in this example for whatdotheyknow:

$ git ls-tree HEAD^{tree}
100644 blob 1e38e022c1c7d27f6dd9b765793087b59d147ef8    .cvsignore
100644 blob aa5036394edfea0a5dff64e0c53b4e9a026f1beb    .gitignore
100644 blob 4ef4ae8268dcad9b0de371f1aa63bb3ebbeb436a    .gitmodules
100755 blob 44c881fe25b8dc1413d9195677f492121a3789f0    INSTALL.txt
100644 blob 37312d9a1bcc80ac334547f047a2cece38dd24dc    README
100644 blob 3bb0e8592a41ae3185ee32266c860714980dbed7    Rakefile
040000 tree e326ffb3d697e7ac83fa19d93a8a3305120c719e    app
160000 commit fd91ab69279f1e0cfed53353e64811d5aa9c4b5f    commonlib
040000 tree ae93b14ec7ab01ee33053c32eca340a31ce6449f    config
040000 tree 8a7eb4d1552cc2a59fc0528c02fe0fb686d7f562    db
040000 tree 84fae00002a0e834140e2f806978748d50d60c4b    doc
040000 tree eb4089c7989ee846bbd66c97069aeff7853d0064    lib
040000 tree e7bcca0f6d561188730125b228a22a4d7bd68782    public
040000 tree f4e46de68199afa382d53583d83430c691aeb473    script
040000 tree e5772463cfed62ba63cfaf4e0eacecd1dc3895e5    spec
100644 blob bfc265e33e47ffa9796fe7bb7ae7d1fe7e633593    todo.txt
040000 tree 2999c0a790c0033ad93e312c0bc62ecdc9a18f81    vendor

As you can see, typically the types of objects listed in a tree are either blobs or trees, indicating files or subdirectories.  However, if a object of type “commit” is listed (with the mode 160000) that represents a submodule.  The object name (in this case fd91ab69…) is the commit that the submodule’s HEAD should be at.  One implication of this is that that object name usually won’t be known outside the submodule.  This sometimes causes confusion when people do git diff in the super-project and find a difference in the submodule entry, e.g.:

$ git diff
diff --git a/commonlib b/commonlib
index fd91ab6..d6593c6 160000
--- a/commonlib
+++ b/commonlib
@@ -1 +1 @@
-Subproject commit fd91ab69279f1e0cfed53353e64811d5aa9c4b5f
+Subproject commit d6593c6741b29680665b8ae7470e2f80ab9a5977

This output means that the submodule version which is committed in the whatdotheyknow repository is fd91ab69279, but if you change into the commonlib subdirectory, you will find that the HEAD of that repository is at d6593c6741.  Hopefully both of these commits will be known in the commonlib submodule, but neither will be in the whatdotheyknow repository.

The other information about the submodule which is stored in the super-project is stored in the .gitmodules file and in config options.

A submodule which is “initialized” will have a config option set to indicate the URL that the submodule should be cloned from if it is missing.  These config options are of the form submodule.<SUBMODULE-NAME>.url, so having initialized the commonlib submodule in whatdotheyknow, I can see the following:

$ git config --list|egrep ^submodule
submodule.commonlib.url=git://git.mysociety.org/commonlib

The .gitmodules file provides sensible default URLs for each submodule, and is committed in the repository like any other versioned file:

$ cat .gitmodules
[submodule "commonlib"]
 path = commonlib
 url = git://git.mysociety.org/commonlib

If you’re publishing a repository with the intention that anyone should be able clone and use it, you should make sure that the URLs specified in .gitmodules are ones that can be publicly accessed – so don’t, for example, use an SSH URL with your user name in it.  Since these URLs are only used when initializing a submodule, which you typically do only rarely, it’s not a great inconvenience that you may have to change them in order to push changes you’ve made in the submodule.

Things You Might Need To Do

This section lists some simple recipes for doing all kinds of things with submodules.  If you think there’s something I should add, please let me know.  For the sake of simplicity, in the examples below, I’m not listing submodule paths explicitly at the end of git submodule commands, which generally means that the action applies to all of the submodules.  (The exception is git submodule add, which of course only applies to a single submodule.)

Get a working submodule version after cloning

If you’ve just cloned a repository which contains submodules, you can initialize and clone all of them with:

git submodule update --init

This is the equivalent of running:

git submodule init
git submodule update

With version 1.6.5 of git and later, you can do this automatically by cloning the super-project with the –recursive option:

git clone --recursive git://github.com/mysociety/whatdotheyknow.git

See the status of all the submodules

Running git submodule without arguments defaults to running git submodule status, which produces a helpful summary of the status of all your submodules.  Each line begins with a space, a ‘+’ or a ‘-‘ which indicate the following things:

+
The version checked out in the submodule is different from that specified in the super-project. The object name shown is that of the commit that the submodule is currently at.  (The meaning of this symbol changed in 1.7.0.)
The submodule hasn’t been initialized or there’s no repository at the submodule path (e.g. if you’ve run git submodule init but not git submodule update, or you’ve later deleted the submodule directory from the working tree). The object name shown is the commit that’s specified in the super-project.
[space]
The submodule’s HEAD is at the correct version – the object name shown is that version.

In projects with many submodules this can be a helpful way to see at a glance where all your submodules are at.  For example, here’s some output from a version of the Fiji project that I’m working on:

 bbff1fd4545b3a614b14eb0770ac6028b648746d AutoComplete (bbff1fd)
+16dcf52ef2106cc92ba89c90b6b5f457bc7619ea ImageJA (heads/current-147-g16dcf52)
 5bfc9eb779d39e38c23ce1c3b01b49953ebd8463 RSyntaxTextArea (5bfc9eb)
 b9f11849599d536528c26bc599dbec4609d77dc4 Retrotranslator (remotes/origin/master)
 90287f0250542be256f67ade4e29a618bf6e688f TrakEM2 (0.7m-227-g90287f0)
+f25db2a43b95480c780d865323fce659a1135c2d VIB (tracer-1.4.0-candidate-849-gf25db2a)
 e4d3eb47a8f9d4e62d1f356636652c3ecc739d92 batik (remotes/origin/svn/git-svn@216063-588-ge4d3eb4)
 79de599df2550f2813fd449505b6fa55ca08cbb3 bio-formats (remotes/origin/contrib-380-g79de599)
 e73abece1ebf3a4aba22104ae9452b2b816ab0d7 clojure (remotes/origin/HEAD)
 39618b6d881fb0c3b52de4929aa34134bb32ffdb clojure-contrib (remotes/origin/master)
 9fa7f4d993f57e27e3134b016c7d36fbfd33e34c ij-plugins (9fa7f4d)
 7ffa48359cdbf7a47735b719a605ea322c58d694 java/linux (heads/master)
-cc218f05fdc0bb55f40f904d5d1f804e8751d0d2 java/linux-amd64
-4f3964234f4e6fd78247e5e7fad9c8becad53e8f java/macosx-java3d
-e79c51473df06f00d4ba9c913afe27e675f71d64 java/win32
-54e735c6c9bac65fcc889bc9e833213f19c7458a java/win64
 b362c662f79763c7927a2ba486243ccefa9222a1 junit (obsolete-cvsimport)
-9ae38d4bde196fa6a4595aebed9f218d4ec591bc jython
 c6e929a15d77545f03ea4883bf033e13c632ef12 live-helper (1.0.4-1-43-gc6e929a)
 79d369af87c4412a47f7065938fe18befc0a183e mpicbg (remotes/origin/trakem2-30-g79d369a)
 20ab0539cc248c642982fdf1330325636d8c55c0 tcljava (tcljava-141-2007-06-06-6-g20ab053)
 a7bfed6752ea1aeac73db386411329486e339f94 weka (a7bfed6)

Update submodules to the versions specified in HEAD

If you change the HEAD of your super-project (e.g. with git pull, or by checking out a new branch) you may find that your submodules are now at the wrong versions.  (You can check with git submodule status as shown above.)  If you’re not actively working on the submodules, then the simplest way to move the to the right versions is with:

git submodule update

If any initialized submodules are missing, this will clone them.  For other submodules where the repository exists, this will change into its subdirectory,  run git fetch (to make sure  all the most recent updates are present) and then git checkout the correct version.  This has the effect of “detaching HEAD” in each submodule, so if you want to work on a branch in any of those subdirectories, you’ll have to git checkout to a branch.

The most frequent errors that you’ll find when running git submodule update are likely to be due to someone having created a commit in the super-project that references a commit in the submodule that they’ve forgotten to push, so check that whenever you get errors about not being able to find particular versions.

Versions of git after 1.6.4 add the --merge and --rebase options to git submodule update to allow more flexible ways of updating your submodules while you’re working on them.

Add a new submodule to a repository

This is nice and easy to do from a URL.  For example, if we wanted to create a new mySociety project called “create robot MP”, and add commonlib to it, you would just use git submodule add:

$ mkdir createrobotmp
$ cd createrobotmp
$ git init
Initialized empty Git repository in /home/mark/tmp/createrobotmp/.git/
$ git submodule add git://git.mysociety.org/commonlib commonlib
Initialized empty Git repository in /home/mark/tmp/createrobotmp/commonlib/.git/
remote: Counting objects: 5240, done.
remote: Compressing objects: 100% (1974/1974), done.
remote: Total 5240 (delta 3311), reused 5038 (delta 3197)
Receiving objects: 100% (5240/5240), 1020.36 KiB | 377 KiB/s, done.
Resolving deltas: 100% (3311/3311), done.
$ git status
# On branch master
#
# Initial commit
#
# Changes to be committed:
#   (use "git rm --cached <file>..." to unstage)
#
#    new file:   .gitmodules
#    new file:   commonlib
#

Then you need to stage and commit .gitmodules and commonlib as with any other new files.  Since this puts the URL in the .gitmodules file, you should make this a publicly clonable URL, as mentioned above.

Change the remote for a submodule

If you frequently work in a submodule you might want to change the default remote “origin” to refer to a URL that you can push to, just so you can use one remote for everything.  You can do this by deleting orgin and adding it back with a new URL, with e.g.:

$ cd commonlib
$ git remote rm origin
$ git remote add origin ssh://mark@git.mysociety.org/data/git/public/commonlib.git
$ git remote -v
origin    ssh://mark@git.mysociety.org/data/git/public/commonlib.git

However, you’ll find that two helpful config options will have been deleted when removing and adding back origin, so you’ll want to add these back.

$ git config branch.master.remote origin
$ git config branch.master.merge refs/heads/master

These config options set up the helpful defaults for git pull when you’re on master.

If you’re in the habit of deleting whole submodules, and then recreating them with git submodule update then you should also make sure that you change the URL in the super-project’s config settings, e.g.:

$ git config --list|egrep submodule
submodule.commonlib.url=git://git.mysociety.org/commonlib
$ git config submodule.commonlib.url ssh://mark@git.mysociety.org/data/git/public/commonlib.git
$ git config --list|egrep submodule
submodule.commonlib.url=ssh://mark@git.mysociety.org/data/git/public/commonlib.git

Initialize a submodule with a non-standard URL

If you know in advance that you want to clone your submodules from a URL different from that specified in .gitmodules (e.g. with a private SSH URL that can you push to) then after cloning the superproject you can set the appropriate config by hand before running git submodule update.  This takes the place of the git submodule init command, for example:

$ git clone ssh://mark@git.mysociety.org/data/git/public/whatdotheyknow.git
[...]
$ cd whatdotheyknow
$ git config submodule.commonlib.url ssh://mark@git.mysociety.org/data/git/public/commonlib.git
$ git submodule update
Initialized empty Git repository in /home/mark/tmp/whatdotheyknow/commonlib/.git/
remote: Counting objects: 5240, done.
remote: Compressing objects: 100% (1974/1974), done.
remote: Total 5240 (delta 3311), reused 5038 (delta 3197)
Receiving objects: 100% (5240/5240), 1020.36 KiB | 533 KiB/s, done.
Resolving deltas: 100% (3311/3311), done.
Submodule path 'commonlib': checked out 'a901c2a431f7869f5c2eaee5808f8590ca78544e'
$ cd commonlib/
$ git remote show origin
* remote origin
 URL: ssh://mark@git.mysociety.org/data/git/public/commonlib.git
 HEAD branch: master
 Remote branch:
 master tracked
 Local branch configured for 'git pull':
 master merges with remote master
 Local ref configured for 'git push':
 master pushes to master (up to date)

Modified submodules in 1.7.0 and later

Versions 1.7.0 and later of git contain an annoying change in the behaviour of git submodule.  Submodules are now regarded as dirty if they have any modified files or untracked files, whereas previously it would only be the case if HEAD in the submodule pointed to the wrong commit. Why is this annoying?  The following reasons:

  • Firstly, the meaning of the plus sign (+) in the output of git submodule has changed, and the first time that you come across this it takes a little while to figure out what’s going wrong, for example by looking through changelogs or using git bisect on git.git to find the change.  It would have been much kinder to users to introduce a different symbol for “at the specified version, but dirty”.
  • git status is now very slow in projects with several large submodules.  (git status used to be nearly instant in a clone of fiji.git but trying just now with 1.7.0.4 took an incredible 45 seconds.)

This seems like a change that was introduced without considering the surprise and impact that it would have on users.  In any case, I’ve added this note here since if you work with submodules, you may need to be aware of this change in behaviour.

The output of git diff has changed as well, to add “-dirty” to the object name if the working tree of that submodule is dirty:

$ git diff imglib
diff --git a/imglib b/imglib
--- a/imglib
+++ b/imglib
@@ -1 +1 @@
-Subproject commit c5c6bbaf616d64fbd873df7b7feecebb81b5aee7
+Subproject commit c5c6bbaf616d64fbd873df7b7feecebb81b5aee7-dirty

Update: Thanks to VonC, who points out in the comments below that in git 1.7.2 there is now a “–ignore-submodules” option to git status which can restore the old behaviour and also provides the useful option that only changed files (not untracked files) cause the submodule to be shown as dirty.

Removing a submodule

There are instructions for the several steps required to remove a submodule at the bottom of this page:

http://git.wiki.kernel.org/index.php/GitSubmoduleTutorial

Superficially Improving Google Reader

This is really a post about adapting Helvetireader for netbook-sized screens, but I can’t resist adding a short rant about online RSS aggregators first…

Why Google Reader?

Once upon a time, on the recommendation of Need To Know, I started using Bloglines to keep track of blogs and anything else that published an RSS feed.  It worked pretty well, but had a number of usability problems that were fixed by the excellent Bloglines Beta.  Well, it was excellent apart from one thing: at some point (one or two years ago?) feeds stopped being updated in a timely fashion.  I reported this as a bug but, in contrast to all my previous bug reports to them, I never even received a confirmation email back.  Problems of this type have been widely reported but as far as I know this is still unfixed.  Unfortunately, while I loved the interface of Bloglines Beta this lack of updates meant that it had effectively become useless, so I’ve switched to Google Reader – the other reasonable alternative when I was considering a switch seemed to be “NewsGator Online“, but that failed to import my rather large list of feeds and the service has now shut down anyway.

It’s very sad that Bloglines’s previously excellent service seems to be now so unloved by its current owners and maintainers.

Google Reader’s Shortcomings

(Update: originally this paragraph had a short complaint about how Google Reader couldn’t be configured in a way that I was happy with, but then Jenny pointed out that in fact it could :)  In short, the following settings are the ones that I like:

  • Select “New Items”, which is a toggle for the view, applying to everything until you change it back to “All Items”.
  • (Tedious) Go to every feed and folder in the subscription list on the left and select from the drop-down menus “Sort by oldest”.

Annoyingly the “Sort by oldest” option then only shows the unread items in the last 30 days, ordered chronologically – but this is still much better than the default configuration, in my opinion.)

There are also some superficial annoyances that fortunately can be easily fixed:

  • The appearance of the page is noisy and inelegant.
  • The layout works badly on small screens, such as my netbook.

At some point I was told about Jon Hicks’s “Helvetireader” redesign of the CSS for Google reader.  This is a vast improvement in the appearance of the application, I think, and it hides much of the noise in the interface that I don’t care about – for example, here are some “before and after” screenshots:

screenshot-basic

And with Helvetireader:screenshot-helvetireaderHowever this still uses up a lot of screen real estate at the top of the page for elements that I don’t care about, such as the search box, login details, social features, etc.  Also, on my netbook the pane on the left is so large that you can’t get a single Dinosaur Comics visible in the right :)  So I added a few more instances of display: none to the Helvetireader CSS and adjusted the sizes of the main panes.  The only non-obvious bit about this was that there’s some bit of Google’s javascript that sets the display property of the search box back to block – fortunately, if you use Greasemonkey to change the ID of this element to something else, then it can’t be found by ID and changed back.  This is a bit horrible, but doesn’t seem to obviously break anything else in the interface…screenshot-helvetireader-reducedIf you want to try this out, then the Greasemonkey script is here, and that points to the CSS here.  Those are tracked in a github gist in case you want to see all the changes back to the original Helvetireader.  This works well in Chromium (so presumably Google Chrome too) and is OK in Firefox, but for an odd bug where the window doesn’t redraw correctly after changing the CSS – you can resize the window to force a redraw, though.

Update: the screenshot there shows a slightly older version of my customization – since then I’ve added back some options for feed settings, etc.

I rather like this approach to changing the appearance of websites to match your needs.  You can find many inventive examples of this idea at userstyles.org.

(occasional miscellanea)