# Hashing Flickr Photos

I used to host my photos with a simple set of CGI scripts that basically worked well enough for my simple requirements.  Such web applications are easy and fun to write, but in the end I decided that it wasn’t worth it because:

• Hosting large amounts of data on a generic shell account is typically quite expensive.  Flickr‘s “pro” account subscription is a very good deal in comparison: as long as each photo is beneath 20 megabytes in size, you can upload as many as you like for $24.95 a year. • The community aspect of sites like Flickr is very encouraging – it’s lovely to have random people say nice things about your photographs, and occasionally have people use them in articles, etc. (Some people are put off from using Flickr by the appearance of the site, but its API means that there are plenty of alternative front-ends for viewing or presenting your photos, such as flickriver.) The slight problem with switching to hosting on Flickr was that previously I’d indexed all my photos by the MD5sum of the original image, so several of my pages had links or inline images that pointed to an MD5sum-based URL on the old site. It occurred to me that it might be useful in general to have “machine tags” on each photo with a hash or checksum of the image, so that, for example: • You can simply check which photos have already been uploaded. • You can find URLs for all the different image sizes, etc. based on the content of the file. Unfortunately, I hadn’t done this when uploading the files in the first place, so had to write a script (flickr-checksum-tags.py) which takes the slightly extraordinary step of downloading the original version of every photo that doesn’t have the checksum tags to a temporary file, hashing each file, adding the tags and deleting the temporary file. This add tags for the MD5sum and the SHA1sum, using a namespace and keys suggested in this discussion, where someone suggests taking the same approach. These tags are of the form:  checksum:md5=c629c63f8508cfd1a5e6ba6b4b3253a8 checksum:sha1=df44fc771660fbe7a2d6b2e284ae61e9ed3e377c  The same script can return URLs for a given checksum:  # ./flickr-checksum-tags.py -m c629c63f8508cfd1a5e6ba6b4b3253a8 --short > http://flic.kr/p/7oQxqK # ./flickr-checksum-tags.py -m c629c63f8508cfd1a5e6ba6b4b3253a8 -p > [... the Flickr photo page URL, which WordPress insists on turning into an image ...] # ./flickr-checksum-tags.py -m c629c63f8508cfd1a5e6ba6b4b3253a8 --size=b > http://farm3.static.flickr.com/2552/4196574615_491c6387f8_b.jpg  The repository also has a script to pick out files that haven’t been uploaded, and a simple uploader script which will upload an image and add the checksum tags. The scripts are based on the very useful Python flickrapi module and you’ll need to put your Flickr API key and secret in ~/.flickr-api Anyway, these have been useful for me so maybe of some interest to someone out there… # Most “Controversial” Guardian Crosswords I contribute to a collaborative blog called fifteensquared (under the name mhl) where each day people explain the clues for the Guardian, Observer, Independent, and Financial Times crosswords, as well as a number of more difficult and specialized crosswords. This is a great way to improve at cryptic crosswords – each day you have your best go at the puzzle and then can find out from the blog what you were missing and why. One interesting aspect of this is that the posts on certain crosswords generate far more comments than others. The Guardian week-day crosswords have substantially more comments than any other category, so all of these examples are from these posts. I wouldn’t take this measure of “controversy” all that seriously, since I’m not compensating for the overall variation of the number of comments: there used to be very few comments on any day, and there have been various periods where off-topic chatter has been more strongly discouraged. Nevertheless, I thought it might be interesting to do a post on what it was that made these crosswords particularly “controversial”. I’ve started with those that got 53 or more responses, which is an arbitrary threshold designed to include the memorable Auster crossword that had the answer “HUMP THE BLUEY” :) Of course, the crosswords that people like most tend not to get nearly so many comments as those where there’s ambiguity in interpretation, or disagreements over the fairness – it would be nice if the site had a simple mechanism for rating crosswords so that it would be easy to pick out the really great ones. It has been fun to read through these posts again, and be reminded that while it can be tiresome to read lots of complaints about a particular crossword, there are plenty of commenters on fifteensquared who consistently add lots of interest and fun to doing the puzzles that I would otherwise miss. ## 53 comments: ### Guardian 24544 / Auster I still think that this crossword had an over-the-top reaction, which was largely because of two clues: • “What Aussie swagmen do to obey Jesus’ instruction? (John 5:8) (4,3,5)” => HUMP THE BLUEY. To quote from my post, this is a ‘[d]ouble definition, the first of which is rather difficult: Chambers defines “hump the bluey” as “(Aust) to travel on foot, carrying a bundle of possessions”, and Jesus’s instruction in John 5:8 is “Rise, take up thy bed, and walk” (King James Version)’ • “Place where night finally gives way to day in a line between the poles (7)” => EQUADOR. It seems that this was an error, since it was later changed in the online version to “Place where Queen is replaced by Charlie and night finally gives way to day in a line between the poles (7)” => ECUADOR. Otherwise the puzzle was rather easy, with a very high number of anagrams. The more I think about the former clue, the more it makes me smile, so even though I couldn’t solve it unaided I’m glad it was in the crossword. ## 53 comments: ### Guardian 24,621 / Araucaria [ fifteensquared post by Eileen | original crossword ] This puzzle, which the majority seemed to enjoy, had a theme of two bicentenaries: the births of Charles Darwin and Abraham Lincoln. It’s unusual in this list because there were very few criticisms of the puzzle, but still many comments. (Quite a lot of the comments were rather off-topic, but not in a way that I thought was inappropriate.) ## 53 comments: ### Guardian 24643 / Gordius [ fifteensquared post by Uncle Yap | original crossword ] This crossword had a couple of mini-themes relating to euphemisms and a scattering of biblical clues. The most discussed clues in this one were: • “7 for what Saul did in Engedi (8,3,4)”, where the answer to 7 is EUPHEMISM => COVERING ONES FEET. A very difficult clue, where the answer is an obscure euphemism for defecating taken from the story of Saul relieving himself in 1 Samuel 24 verse 3 – the Hebrew is literally “covered his feet”. • “Gent’s son, perhaps is (7)” => BELGIAN. A very tough cryptic definition, where you need to know that “Gent” is the native spelling of Ghent, and the “[blah]’s son” or “son of [blah]” expression to mean “someone from [blah]” • The clue for 15 across was missing in some versions. • “Aramaic skull by barbarian in gaolbreak” => GOLGOTHA. There’s some discussion of which languages Golgotha means “place of the skull” in, and “gaolbreak” for (GAOL)* is, as you would expect, regarded by some as unfair. • There were a number of answers with difficult vocabulary, in particular DIABASE, LARBOARD and ANIMADVERT. I defended a few of these in the comments at the time, but in retrospect I think this was too tough for a daily puzzle – it probably justified the number of comments. ## 53 comments: ### Guardian 24,801 / Rover [ fifteensquared post by Andrew | original crossword ] This was considered easy by most, but with many clues that people felt were either unsound or unsatisfactory in some way, e.g.: • “Man from Naples is rescued from a riot (5)” => MARIO. (Unless “from” is doing double duty, there’s no indication that this is a hidden answer – even if it’s just “‘s” or “of” there needs to be something extra there…) Other complaints were of clues that weren’t cryptic enough or double definitions where both parts were very similar. ## 55 comments: ### Guardian 24,766 / Araucaria [ fifteensquared post by manehi | original crossword ] The large number of comments here were mostly genuine discussions of how to parse the clues, e.g. in what sense “of the French” can be DE, or whether ALBAT sounds the same as “Albert”. This was just a difficult puzzle for a weekday, I think, and some typically Araucarian touches in the cluing. ## 55 comments: ### Guardian 24,638 / Chifonie fifteensquared post by Eileen | original crossword ] A quite do-able crossword, and most of the discussion is taken up on with the question of whether particular abbreviations are reasonable, in particular: • whether “relation” to give PI should be allowed • the (rarely seen) T for “Troy”, referring to an abbreviation for the unit of weight sometimes used for precious metals The former question comes up quite frequently, and unfortunately tends to provoke tedious discussion. In the online archive it’s only been used by Chifonie, as far as I can tell. ## 56 comments: ### Guardian 24,872 / Araucaria fifteensquared post by Eileen | original crossword ] This puzzle was very well received by the regular contributors for its humour and a couple of entertaining liberties. The controversy here was generated by a comment from don which rewrote negative comments from the previous day’s blog post to apply to this puzzle, to make the point, as I understand it, that reactions to puzzles are strongly biased by the name of the setter. ## 59 comments: ### Guardian 24,620 / Chifonie [ fifteensquared post by Andrew | original crossword ] The comments consist largely of off-topic banter (e.g. about Paul’s clue competition), except for these issues: • “Archbishop hit little girl (7)” => LAMBETH: it turns out that LAMBETH being synonymous with “the Archbishop of Canterbury” is supported by Collins and the OED. • The clue “Ring student beset by siren (5)” => CIRCE provoked comments that Circe was not a Siren, but it was later argued that she was still a siren in the less specific sense :) ## 64 comments: ### Guardian 24,777 / Araucaria [ fifteensquared post by Andrew | original crossword ] A puzzle with a mini-theme of literature, which generated a bit of generic outrage. I think that two of the clues which upset people were genuinely sub-standard, though: • “Poem cut and edited to be on standby (4)” => IDLY, which is IDYL[l] then “edited” to rearrange L and Y. As well as that rather indirect construction, I don’t think one can substitute IDLY for “standby”, “on standby” or “to be on standby” in a sentence. • “Cheat to ask for oil over the water, say (7)” => BEGUILE. The most convincing two options were BEG = “ask” + UILE = sounds like the French for oil (“huile”) or an Irish pronunciation of “oil”. In either case “over the water … say” would indicate a non-mainland homophone. ## 66 comments: ### Guardian 24,734 / Enigmatist [ fifteensquared post by Andrew | original crossword ] The controversy here mostly arose from some of the answers and constructions being very hard. e.g. it contained the words COLOSTRUM, RELIEVO, MONOPHTHONG, EREMITE and ELEMI and a several more that are less than obvious. On the other hand, many of the clues had Enigmatist’s characteristic humorous touches – I remember laughing at several of them. The constructions that provoked the most discussion were: • “River spot in which I’ll get lost, say. No circumnavigating Backs (7)” => YANGTZE. C. G. Rishikesh explains this as EG = “say” + NAY = “no” around (“circumnavigating”) Z[i]T = “spot in which I’ll get lost”, with “Backs” indicating reversal of everything. • “Man by joiner in a whirl? The reverse (9)” => ALEXANDER, which IanN14 explains as A + REEL = “whirl” reversed around X = “by” + AND = “joiner”. Unfortunately the end of the discussion degenerated into some bad-tempered back-and-forth, tangentially related to the frequently seen “cattle” meaning of “neat”. ## 70 comments: ### Guardian 24591 / Logodaedalus [ fifteensquared post by mhl | original crossword ] Everyone seemed to find this pretty easy, and it was uncontroversial apart from the odd instance of an adjectival phrase defining a noun and the use of a number of words in the clue directly in the answer. The mostly off-topic comments include some discussion of whether accents should matter in crosswords, how strict cluing should be, and regrettably some trolling. ## 76 comments: ### Guardian 24,567 / Araucaria [ fifteensquared post Uncle Yap | original crossword ] A very tricky crossword themed around a quotation from Thomas Babington Macaulay’s tribute to John Milton, on the occasion of the 400th anniversary of Milton’s birth. Of the many clues that caused problems, there were: • “Parrot on top of one of the Roses? (4)” => LORY. “Lancaster OR York” were the Roses in the War of the Roses. • “Ross’s leader following his follower, mostly one that suffers (6)” => MARTYR. Ross’s follower is Cromarty (as in Ross and Cromarty), so “mostly” might give you [cro]MARTY followed by Ross’s leader (R). • “Forties cry that’s on the up when winning first gold (6)” => EXCELS. I think the definition, somewhat bizarrely, is a homophone for XLs (“Forties” in Roman numerals). Geoff Moss explained that the subsidiary is that EXCELSIOR = “on the up” can be obtained by adding I = “first” + OR = “gold”. • ‘Shocking omission of “the plural of mou_e (7)”‘ => SEISMIC. A nice clue, I thought – if you insert SEISMIC into “the plural of mou_e” you get “the plural of mouSE IS MICe”, the definition being “shocking”. • “Bird on pole, no friend to our friend (5)” => CROWN. CROW = “bird” + N = “pole”. The definition refers to Milton’s opposition to the monarchy. Some difficult answers as well (MONDAY CLUB, INANITION for me) made this all round a serious puzzle, and justified the large number of comments on it. I remember really liking Geoff Moss’s comment about how to consider a crossword after finishing it. ## 82 comments: ### Guardian 24,603 / Araucaria [ fifteensquared post by Andrew | original crossword ] This puzzle was themed after capital cities, all of which were missing the definition part – 13 were hidden in the grid, all but one being 6 letters long. (The rubric gave quite a big hint, in fact: “Thirteen solutions are of a set, one of which is here translated into its own language. None of these is further defined.”) The most difficult clue for people seemed to be: • “1 in 2 (4)” => WIEN. 2 was LONDON, London is sometimes known as “The Great WEN” and WIEN is the German for Vienna. (This was the solution “translated into its own language” referred to in the clue.) Undoubtedly a very difficult clue. The large number of comments on this crossword were partly due to it starting the (rather frequent) debate on whether Araucaria deserves the high standing in which he is held. My feeling is that fifteensquared would be better off without these debates, since people’s preferences are so personal, but that would be rather hard to enforce. I think Eileen’s comment on this one sums up how I feel about difficult but fair crosswords. However, I think it’s clear from the irritation that people express when obscure words or constructions come up in the daily crosswords that lots of people do take it much more personally. ## 86 comments: ### Guardian 24,615 / Rover [ fifteensquared post by mhl | original crossword ] And the winner of the grand prize is Rover! The problematic clues in this crossword were: • “It’s pretty to behold what Platonic friends discuss at leisure (4-2-8)” => LOVE-IN-IDLENESS. Love-in-Idleness is a flower, so “It’s pretty to behold” is the somewhat weak definition – LOVE is what friends discuss in Plato’s Symposium, and IN-IDLENESS is “at leisure”. • ‘Translator of “The German Eating Fish” (7)’ => DECODER. People generally assumed this was a mistake (COD = “Fish” in DER = “The German”) but perhaps it was meant to be CODE = “Fish” (a cipher used in WWII) in DER = “The German”. However, that latter interpretation would need the clue to have “Fish, Say” or “Fish, Perhaps”. • “Almost general tutorials (7)” => CLASSES. This should be read as “genera” (“Almost general”). In retrospect, I don’t think this should have caused so many problems. Again, it’s worth noting that despite provoking lots of discussion and criticism, there aren’t that many really problematic clues, as Sil van den Hoek points out. If I were a setter for a national newspaper, I’m not sure I would cope well with reading these discussions, given how tough it is to write a single good and original clue. # Replacing my iPod with a Sansa Clip Eventually, I reached a point with my long-suffering and much-repaired iPod where it didn’t seem to be worth continuing to pay to get it fixed up again, especially since I could reasonably switch to a device with solid-state storage instead of a hard disk. Since I’m trying to avoid using Apple products because of both ideological and pragmatic concerns, I went for the strategy of trying to buy the cheapest option in Dixons (sorry, “Currys.digital”) when I was running for a train. (This approach minimised the amount of time I could spend on the decision and so created extra happiness in itself.) The device I came away with was an ex-display model of the Sandisk Sansa Clip, with an extra discount because it was actually missing the clip bit. That came to £30 for an 8 gigabyte device, whereas an iPod nano with the same storage would have been about £140. (Of course, it doesn’t have the gorgeous screen of the nano and can’t play video – on the other hand, 8GB would get filled up fast if I put any video on it.) The biggest advantage of this cute little device over the iPod, of course, is that it’s much less likely that the manufacturer will deliberately prevent the device from working with my computer, as Apple have repeatedly done to GNU/Linux users who’ve bought their devices. Anyway, I’m basically very happy with this replacement. There are a few small user interface problems, e.g. I miss the click-wheel for seeking within a track – fast-forward and rewind on the Clip are a bit sluggish initially. Additionally it takes a couple of seconds to wake up after it’s been paused for a while, which is surprisingly irritating. On the other hand: • It plays Ogg Vorbis files! • It plays FLAC files! • The little screen is bright and clear • It’s really small and light • It has a “sleep” function • You can use it to take voice notes (although annoyingly you don’t seem to be able to set the time and date, so the timestamps are useless) • The sound is good • The UI is generally very responsive, so the small screen isn’t too bad for searching for music • It remembers the point at which you stopped listening to an audiobook, and offers to resume from that point when you return to it later Sadly, gapless playback doesn’t work, which is basically par-for-the-course for MP3 files (even when they have correct delay and padding in the metadata), but rather surprising in the case of FLAC and Ogg Vorbis files. The only other problem was getting it to present podcasts in the way that I like, which is probably due to my odd preferences rather than the device itself – the rest of this post is about that. ### Podcasts This may be unusual, but I find the most useful way to go through podcasts on any audio player is to have all of the most recently downloaded episodes from any source in one playlist, ordered from oldest to newest. With gtkpod and my old iPod, you could easily set up a “smart playlist” to do this, but I had to write a short script to create this on the Clip. It’s simple enough, apart from one point: • I found that very few episodes were actually appearing in the playlist, and it turns out that this was because the Clip cares deeply about that the TCON id3v2 tag, and if this contains “Podcast”, it won’t appear in any playlist under the “Music” menu – it’ll only appear in the “Podcasts” menu. So, there’s an option (which I always use) to wipe out that tag if it contains “podcast”. (Incidentally, I heartily recommend hpodder as a podcatcher – it’s the only one I’ve found that does what I want in its default configuration.) In case it’s of any use to you, the script is included below – it’s hosted as a gist on github, so you can clone it (or download the raw file) from the links below the file. # LyX Tips for Thesis Writing LyX is a lovely bit of software for preparing beautiful documents – you get the high quality output of LaTeX and the advantages of logical document description in a usable interface and without having to remember TeX syntax. There are a few aspects of using LyX that puzzled me while writing a certain large document, however – many of these are dealt with in the LyX FAQ, but I thought it would be worth collecting those that were most useful to me here. ### Use pdflatex for Output There are various different options for generating output PDF output in LyX, but it will save you trouble if you do everything using pdflatex in the first place. (I think this is the upshot of the slightly unclear advice in the FAQ on the subject.) This turned out to be particularly important because when your document is 50000 words long and has over 100 figures, the other methods take over 10 minutes to generate a PDF; pdflatex would finish in a couple of seconds. If you take this advice then you have to change the Document -> Settings -> Document Class option to pdfTeX, or you get some surprising errors. Also, I would strongly recommend that you only use PNG files for bitmap images and PDF for vector graphics. (PNG is obviously sensible, but in the case of vector graphics I found PS and EPS files unexpectedly awkward in terms of getting the orientation and clipping right.) ### Incorrect Colours in Bitmap Graphics I came across a bizarre problem where the colours would be slightly wrong for certain PNG files that I include in the document. (I suppose I should say “colors” too, just for the sake of searchers using American English.) This turned out to be a problem with full-colour PNG images with transparency (i.e. RGBA images), which my notes say is discussed further in these posts. Setting the PDF version as suggested in the first of those posts didn’t help me at all, so I had to convert all my RGBA PNG files to RGB. If you want to check for these files you can use file(1), something like: find . -type f -iname '*.png' -print0 | xargs -n 1 -0 file | egrep RGBA … and I fixed them by feeding the filenames (one per line) to a script like: #!/bin/sh set -e while [$# -ne 0 ]
do
t=mktemp
convert "$1" png24:"$t" && mv "$t" "$1"
shift
done

Obviously you need imagemagick installed for the “convert” command.

### Footnotes

By default there is no extra vertical space to separate footnotes, but I much prefer there to be a small gap. To do that, add to the following line to the document preamble:

\setlength{\footnotesep}{12pt}

### Captions

By default, the formatting of caption text in floated figures looks very similar to the main body text. Somewhere on the web I found the recommendation to use the “caption” package to change this, e.g. by adding the following to the preamble:

\usepackage[margin=10pt,font=small,labelfont=bf,labelsep=endash]{caption}

### Fitting Tables Onto Pages

Making tables fit on the page is annoying – just changing the text size often doesn’t reduce the overall size or causes a horrible font to be used. Resizing the whole table is the best way I found. Before the table (either in a float or in the normal flow of the document) I added the following in ERT:

 \resizebox{\textwidth}{!}{%

and then immediately after the table added, again in ERT:

 }

This scales the table such that the width of the table fits the page width.

### Suppressing Pages Numbers For Full Page Figures

If you want to use a whole page for a floated figure, the page number can overlap with the figure or just look odd.  However, second tip here: http://wiki.lyx.org/FAQ/UnfloatingFigureOnEmptyPage works well to remove page numbers from all-page figures. To summarize, add the following to the preamble:

\usepackage{floatpag}
\floatpagestyle{plain} % Default page style for pages with only floats

Then, in ERT before the figure (but still in the float) add:

\thisfloatpagestyle{empty}
\vspace{-\headsep}

… and similarly, after the figure but above the caption add:

\vspace{0.3cm}

… or you may find the caption too close to the graphics.

### Better On-Screen Fonts in PDFs

As explained in the second question in this mini-FAQ on generating PDFs from LyX you should use the outline font version of Computer Modern instead of the bitmapped versions. For me, this boiled down to going to Document -> Settings -> Fonts and setting the Roman font option to “AE (Almost European)”.

You can further improve the rendering of text in your output by using microtype. Just add

\usepackage{microtype}

… to the preamble. (These suggestions only apply if you’re using the pdflatex workflow as suggested above.)

You might notice that adding footnotes in table cells doesn’t work. One answer to this is manually add them in ERT with \footnotemark and \footnotetext, possibly adjusting the counter as described in that FAQ entry.

# Thesis Visualization

I submitted my PhD thesis over a month ago now (on the 11th of September) and I’ve still not recovered properly from the experience.  Perhaps that’s to be expected after 5 years of it.  At some point I’ll have to try to write something coherent about what it has been like, but all I can really say at the moment is that I still stand by my advice that embarking on PhD research is a bad idea for almost everyone. Anyway, as a way of trying to put it all into perspective I wrote a few scripts to visualize my thesis and the process of writing it, so I’ve collected a few of these here.

The first of these is pretty simple to do, since I just collected some word frequency data and fed that into Wordle:

This next graph shows how the number of lines in my thesis document slowly increased over time. The flat period for a year at the beginning really represents starting small bits of chapters and then realizing that much of the work and analysis would have to be redone:

(In case you’re wondering, the thesis was about 50000 words in the end, which corresponds to about 40000 lines of the LyX document, since the LyX format is very verbose – it does roughly correspond to how the thesis as a whole progressed, though.)

Throughout writing the thesis I wondered what the graph of citations would look like, but didn’t have time to do anything about it until after submitting.  I was hoping I could use Google Scholar (or some similar online archive) to discover the “A cites B” relationship, but there isn’t an API for it at the moment, and I didn’t think webscraping these data would be worth it. However, I kept all the papers I could find in PDF format in  my thesis git repository, consistently named as papers/[BIBTEX-KEY].pdf, so it was simple to write a short Python script which searched for each paper’s title in the text of every other paper. This means that it will miss quite a lot of relationships, since pdftotext doesn’t work satisfactorily on many of the papers, some have OCR errors, etc. etc. but I’m pleased that it seems to have extracted so many of them:

The colours indicate how recently the paper was published, from purple (1967) to 2009 (red). The script outputs the relationships in graphviz‘s dot format, and that image was rendered with “neato”.  I excluded any apparently unconnected papers. In case you’re interested in the rather shoddy script, I’ve put it online.

Finally, I thought it might be nice to include a section of one of the images from my thesis to add a flavour of what I’ve been doing – this shows the primary paths of some some neurons which were traced with my “Simple Neurite Tracer” tool and registered with CMTK:

# Fastest-Talking MSPs

I recently made a change to the Scottish Parliament parser in ukparse so that it would preserve as accurately as possible where the timestamps occur within the text of the Official Report. (These are now in <placeholder> tags throughout the XML.)  One of the things this lets us do is get a rough estimate of how fast each MSP talks.  This isn’t meant to be taken particularly seriously, for the various reason given below, but it’s perhaps a nice example of how you can use the structured data version of Scottish Parliament that I created for They Work For You Scotland for simple analyses of what’s being said in parliament.

### Top 25 Fastest-Talking MSPs

This league table was very quickly put together, so I apologise for any errors – it’s only really meant as a demonstration anyway.  I’ve only included the top 25, since the slower end of the table tends to be distorted by non-speech actions in the parliament falling in between timestamps and appearing to make a speech slower than it actually was. On the other hand, there is no converse effect (speeches appearing to be faster than they were) that can arise except through errors in the timestamps in the Official Report.

It’s worth noting that since speeches are typically time-limited by the Presiding Officer, there is a incentive to talk fast.  Also the variance in words-per-minute in this top 25 is not large, and having watched videos of these MSPs’ speeches, it’s clear that that there isn’t an obvious relationship between clarity and average words-per-minute.

Rank MSP Words Per Minute Total Words Total Time (Minutes) Measured Passages
1 Christina McKelvie 188.5 11501 61 14
2 Jackson Carlaw 180.8 23139 128 24
3 Keith Brown 178.0 17622 99 20
4 Aileen Campbell 176.7 22617 128 23
5 Liam McArthur 176.6 28437 161 33
6 Kenneth Macintosh 176.4 49403 280 58
7 Alison McInnes 176.1 15500 88 18
8 Bill Wilson 175.6 14929 85 17
9 David Whitton 175.5 22813 130 27
10 Richard Baker 174.2 33619 193 49
11 Anne McLaughlin 173.3 3812 22 4
12 Iain Smith 173.2 56111 324 64
13 Peter Peacock 172.3 31702 184 38
14 Richard Lochhead 171.9 64458 375 77
15 Stewart Maxwell 171.3 41454 242 44
16 Claire Baker 170.6 13477 79 15
17 Alasdair Allan 169.5 15594 92 18
18 Eleanor Scott 169.1 63076 373 88
19 Johann Lamont 169.1 75237 445 84
20 Alasdair Morrison 168.9 22628 134 26
21 Mike Watson 168.6 16019 95 23
22 Sarah Boyack 168.1 72778 433 70
23 Kenneth Gibson 167.5 62293 372 77
24 Andrew Wilson 167.2 5351 32 9
25 Nanette Milne 167.0 77828 466 101

### More Details

You can find the fastest-msps.py script in the ukparse repository as usual.

The script only takes notice of a speech when there is a single speaker between two consecutive timepoints. Unfortunately, this doesn’t happen as often as I would like, so you only get a small sample of each MSP’s speeches represented here – another reason to be suspicious of the results.

The script ignores timestamps that have the same time as the previous one and any speeches that contain text like “meeting suspended”, which often indicate that the next timestamp falls at the end of the break in proceedings. It also ignores any speech less than 2 minutes in length, since these have the highest error. “Words” are just considered to be anything in the speech that’s separated by whitespace.

I’ve also excluded the following speakers:

• Anyone who has spoken in the Official Report who isn’t an MSP, such as Her Majesty the Queen and the speakers at Time for Reflection.
• Anyone who has presided over the parliament, in particular the Presiding Officer and Deputy Presiding Officers. Whoever is chairing proceedings often has to introduce breaks, divisions or other actions which have no reported speech attached to them but nonetheless take up time between the placeholders, so their apparent speaking speed is much too low.

# Avoiding Crossword Applets

This post discusses a script that converts one frequently used crossword file format into another one with the advantage that it can be loaded into a free software crossword client.

The brilliant cryptic crossword in The Independent is available for free online, but only in the form of a Java applet, generated with the non-free software Crossword Compiler. It’s wonderful that the crossword is published online now, but the way it’s published is irritating for me because of the following things:

• Unbelievably, given how long they’ve been around, Java applets still have terrible usability problems: in particular they’re slow to load and hang Firefox while they’re doing so.
• If you accidentally navigate away from the page you lose everything you’ve done in the crossword so far.
• The applet is fiddly to use – it has to be carefully arranged to fit in the browser window on my netbook and then you can only see a couple of clues at a time. The tiny scrollbar buttons are awkward to hit.
• There’s no simple way to print the crossword from the applet.

An easy solution to these problems was suggested to me by a post on Dafydd’s livejournal. I hadn’t head of xword before, but it’s a pretty nice gtk interface for doing crosswords and it reads the AcrossLite .PUZ format appropriately liberally, e.g. ignoring the checksums that AcrossLite itself requires.  So, this Python 3 script, called ccj-parse.py will parse the .bin or .ccj file that the applet loads and generates a .puz file which is acceptable to xword.  e.g. example usage:

  ./ccj-parse.py -o foo.puz -t "Independent" -c "© independent.co.uk" \
-d 2009-07-24 < c_240709.bin

And here’s an example screenshot showing xword with The Independent crossword from a couple of days ago:

This solution works pretty well for me. xword isn’t perfect by any means, but has the advantage that it works well on my netbook’s small screen, and, perhaps most importantly, autosaves your progress. I also like that if you’ve used the “Solve Word” (or “Cheat”) button, the lights are marked with a red triangle in the corner to indicate which ones you had to give up on.

I’ve tested this script on lots of The Indepedent’s cryptic crosswords, and quickly checked that it works on one from the Glasgow Herald, but I’ve no idea how generally it will work.

Someone created a sourceforge project called ccj2puz that suggests it would do the same, but it’s never had any source code uploaded, as far as I can tell, and the author hasn’t replied to the message I sent asking about it.  The basic file format is quite easily guessable from the output of hexdump -C, so I don’t think this is a particularly big deal.

You can get the script from the ccj-to-puz repository on github.

### Dependencies

You need the version of xword with Dafydd’s patches to support British-style crosswords which is version 1.0-4 in Debian. (It’s not in Ubuntu yet, but the .deb file installs without any issues on Ubuntu.)

# Cryptic Crossword – Numpty / 3 (Answers)

This post discusses the answers to the last cryptic crossword I posted on this blog.  This one also has a “ghost theme” relating to one of my favourite films – Eternal Sunshine of the Spotless Mind. There are quite a few things I wasn’t happy with in this effort, but I’ve noted these below.

#### Across:

9. AQUITAINE: QUIT = “leave” + A1 = “the best” replaces N = “end of excursion” in ANNE = “Princess”

12. FLOODED: LOO = “lavatory” in (D = “diamond” in FED = “FBI agent”

13. TWITS: T + WITS

14. ARCHETYPE: (CHEE[se] PARTY) *; the S and the E to drop are “starters of stilton and emmental”

16. OVERSPECIALIZES: OVER = “on”  + (ELIZA’S SPICE)*

19. SORROWING:  (RINGO’S)* around ROW = “disagreement”

21. HINDI: [s]HINDI[g]

22. PEOPLED: = (POPE LED)*

23. ETERNAL: LAN = “a network” + RETE reversed. I wasn’t terribly happy with this – RETE for network is a bit too Azed-ish for this kind of crossword

24. WEEPS: WEE = “small” + PS = “note at the end of letter”

25. ADDRESSEE: DRESS = “a frock” in A = “a” +  DEE = “flower”

Down:

1. VARIATIONS: cryptic definition; probably much too easy – I find good cryptic definitions very hard to write.

2. SUNSHINE: = (HUSSEIN’S N)*

3. OTHERS: (M)OTHER’S; I probably should avoid M for Mark, since it’s as a currency abbreviation it’s a bit obscure now.

4. MIND: = initial letters of “move in new direction”; the charity is the Nation Association for Mental Health, known as Mind.

5. PERFECTING: PER = “for” + (DE)FECTING = “leaving for another country” without DE = “Germany”; Chambers supports the “for” sense of PER, but I’m a bit unhappy about using it, since I’m aiming for daily crossword difficulty.

6. ARBOREAL: LARA = “woman” reversed over BORE = “hole”

7. GONDRY = GON(E) DRY = “out of ideas” without the middle letter, referring to Michel Gondry, director of the wonderful film “23a 2d of the 17d 4d”. I’m not sure about “out of ideas”, but I was having trouble clueing this in a satisfactory way.

14. AMERINDIAN: DI = “princess” in (MARIANNE)*

15. ENSHIELDED: E = “drug”+ LD = “lethal dose”, as in LD50 + ED = “journalist” all after SH = “silence” in [b]ENI[n]

17. SPOTLESS: (SOS SPELT)*

18. ZANINESS = IN AZ = “in map” reversed + NESS = “headland”

20. REOPEN: (ROPE)* + NE reversed; this clue got a commendation in Paul’s clue competition

21. HEEDED: HE + (CONC)EDED

22. PAWN: PA = “father” + initial letters of “would not”

# Public Whip for the Scottish Parliament

### Summary

You can now use The Public Whip to track the voting record of MSPs (Members of the Scottish Parliament) on the issues that you care about.  If you’re interested in this, please help out by creating “policies” on the site that represent how an imaginary “single issue MSP” would vote.  This will enable us to add a summary of each MSP’s voting record to their page on TheyWorkForYou.

The broader point I’m making is that it’s easy for anyone to contribute to projects like TheyWorkForYou and PublicWhip – you can help to make the proceedings of parliament more transparent and accessible.

Here are some possibly interesting facts that came out of this work, subject to the usual proviso that since we’re working from web-scraped data, there may well be errors in these (and bear in mind the caveat about how a “rebellion” is defined on Public Whip):

Please let the Public Whip team know if you spot any errors, or are interested in doing some more complex analysis of voting records than the “policies” system seems to allow.

### Background

A couple of years ago, as a Christmas holiday project that got rather out-of-hand, I did some volunteer work on adding data from the Scottish Parliament to the mySociety website They Work For You.  In case you’re not familiar with TheyWorkForYou, it takes the official records of parliament and presents them in a more accessible and compelling way – just compare side-by-side the presentation of the same debate in TheyWorkForYou and the Official Reports, for example:

Hopefully you’ll agree that the latter presentation is much more involving to read.

TheyWorkForYou also offers to email you an alert when your MP or MSP speaks in parliament or tables a question. These are just a couple of the services that are only possible as a result of programmers having scraped the reports published by the parliaments into a logically structured data format.

Doing this scraping and parsing is deeply irritating, and it’s one of those strange tasks where your ultimate aim is to provoke someone else into making all that tedious work redundant – all the UK parliaments should publish proceedings in a structured data format rather than horribly mangled HTML, and if they ever do, I’ll certainly be cheering.  However, until that day, we still want to be able to build tools that make the workings of parliament more accessible, and to do that at the moment means webscraping…

Anyway, one of the most popular features of They​​Work​​For​​You is that there is a summary of the voting record of each MP on their page – for example, have a look at the bottom of Gordon Brown’s page to see how he votes.  These data are actually calculated by a distinct project called The Public Whip (a creation of Julian Todd and Francis Irving) which uses the same structured version of data from parliament to track how MPs and Lords vote on particular issues.  This site provides an incredibly valuable service in that it gives you, for example, a simple way to check whether what your MP says in the next election campaign actually corresponds with their voting record. In order to produce those summaries of voting records on a particular issue, Francis and Julian developed the idea of letting users create “policies” each of which is like a “dream MP” for a particular point of view – they only vote in divisions relevant to one position they care about (e.g. “for replacing Trident”, “against up-front tuition fees”) and then vote in the way that’s best aligned with that principle. It’s not necessarily easy to create these policies, since you need to do some research into what were the important and relevant votes, but for anyone interested in how politics works, it’s a great exercise.

Unfortunately, when doing my original work on TheyWorkForYou Scotland, I didn’t have time to finish adding support for the Scottish Parliament to Public Whip as well, although I had parsed all the divisions of the parliament that appear in the Official Reports.

Recently I was provoked into picking this up again, and with the help of Francis Irving we’ve now got basic support for the Scottish Parliament up on Public Whip.  While the implementation of this was not, on the whole, very interesting, there are few points that are perhaps worth mentioning:

• Making changes of this kind is essentially an exercise in checking which assumptions made in the original version of the site need to be revised.  For example, as soon as you introduce the data from the Scottish Parliament the following assumptions are broken:
• The name of a constituency and a date no longer determine a particular representative, since there are multiple regional MSPs for the same named constituency.
• Constituency names are not unique to particular parliaments – some of the constituencies in the Scottish Parliament have the same names as those in the House of Commons.
• What constitutes a rebellion isn’t necessarily the same.  In the House of Commons the closest thing to an explicit abstention is to vote both Aye and No in a division (marked as “Both” in Public Whip), whereas when voting in the Scottish Parliament there is an explicit option to abstain.  Since it seems that parties in the Scottish Parliament do whip their members to abstain in particular votes, I have counted as a rebellion any difference between the vote of the majority of the party and an individual, including abstentions.

You have to be careful about other assumptions that might seem obvious – for example, as of the time of writing, Alex Salmond is still both an MP and an MSP at the same time.  There are not only examples of people who have been both MPs and MSPs, but cases like Lord Steel of Aikwood who has been a MP, MSP and Lord.  (Historically, it gets even more awkward – Francis came across the case of the two Rowland Blennerhassetts who were the two MPs for Kerry in 1880 – constituencies in those days could have two members.)

### Creating Policies

To make the information in Public Whip easily understandable by people, we really need volunteers to help create policies.  Depending on the complexity of the issue, and your level of interest in the Scottish Parliament this might be easy or difficult.  However, for anyone interested in Scottish politics, I think this should something worthwhile doing. For example, there have been a lot of votes on the issue of tuition fees and the graduate endowment fee, but you need to read quite a lot of the proceedings of the parliament in order to find out which were the really substantive votes and which were the peripheral ones.  In addition, phrasing the policy is a little tricky – just reusing one of those designed for the Westminster doesn’t really work, since the clearest point of view that seems to be at issue is whether higher education should be free for students from Scotland, not students in general.

[Update: I’ve removed a pointless section about version control software from the end of this post.]

# Cryptic Crossword – Numpty / 3

This is another cryptic crossword that I’d been writing clues for in dribs and drabs for ages.  I’m not very satisfied with this either – my impression is that it’s mostly rather too easy, but for a couple of slightly obscure words.  However, there are a few that I’m quite proud of. Anyway, comments and suggestions are very welcome.