Category Archives: Software

Why should you learn SPARQL? Wikidata!

Why learn SPARQL now?

I think that SPARQL has something of a bad reputation in the open data community – my impression is that that came about because when organisations published data by making a SPARQL endpoint accessible, people had problems like:

  • Many SPARQL endpoints not working reliably
  • Writing queries that joined data between different endpoints never quite worked
  • The very variable quality of data sources
  • People just wanted CSV files, which they already know how to deal with, not an endpoint they had to learn a query language to use

The last one of these in particular I think was particularly important: people who just wanted to get data they could use felt they were being sold semantic web tech that they didn’t need, and that was getting in the way of more straightforward ways of accessing the data they needed.

However, nowadays I think there is a compelling reason to have a good command of SPARQL, and that is: the Wikidata project.

What is Wikidata?

Wikidata is an amazing project, with an ambitious goal on the scale of OpenStreetMap or Wikipedia: to provide a structured database of the world’s knowledge maintained by volunteers (and, like Wikipedia, anyone can edit that data).

One of the motivations for the Wikidata project is that lots of data that you see on Wikipedia pages, particularly in the infoboxes, is just stored as text, not structured data. This creates the strange situation where data like the population of a city might be different on the German and French Wikipedia and each language community maintains that independently – a massive duplication of effort. Some of these infoboxes are now being migrated to be generated from Wikidata, which is a huge win, particularly for the smaller Wikipedia communities (e.g. Welsh Wikipedia has a much smaller community than Spanish Wikipedia so benefits a lot from being able to concentrate on things other than redundantly updating data).

However, because of its potential as a repository for structured data about anything notable, the project has become much more than just a way of providing data for Wikipedia info boxes, or sitelinks – data in Wikidata doesn’t necessarily appear in Wikipedia at all.

A project that I’m involved in at work, for example, is writing tools to help to get data about all the world’s politicians into Wikidata. This effort grew out of the EveryPolitician project to ensure its long term sustainability.

Why might Wikidata change your perception of SPARQL?

The standard way to run queries against Wikidata nowadays is to write SPARQL queries using the Wikidata Query Service.

Going back to those points at the top of this post about why people might have felt that there was no point in learning SPARQL, I don’t think they apply in the same way when you’re talking about SPARQL specifically for querying Wikidata:

  • The Wikidata Query Service seems to work very reliably. We’ve been using it heavily at work, and I’ve been happy enough with its reliability to launch a service based on making live queries against it.
  • Because of the extraordinarily ambitious scope of Wikidata, there’s no real reason to make queries across different SPARQL endpoints – you just use the Wikidata Query Service, and deal entirely with information in Wikidata. (This also means you can basically ignore anything in generic SPARQL tutorials about namespaces, PREFIX or federation, which simplifies learning it a lot.)
  • Data quality varies a lot between different subjects in Wikidata (e.g. data about genes is very high quality, about politicians less so – so far!) but it’s getting better all the time, and you can always help to immediately fix errors or gaps in the data when you find them, unlike with other many other data projects. For lots of things you might want to do, though, the data is already good enough.
  • Lots of open data that governments, for example, release has a naturally tabular structure, so the question “why should I learn SPARQL just to get a CSV file?” is completely valid, but Wikidata’s data model is (like the real world) very far from being naturally tabular – it’s a highly interconnected graph, supporting multiple claims about particular facts, with references, qualifiers and so on. If you all you want is a CSV file, you (or someone else) can write some SPARQL that you can use in a URL that returns a CSV file, but you wouldn’t want to try to cope with all of Wikidata as a CSV file – or even all the information about a particular person that way.

Some motivating examples

You can find some fun examples of the kinds of things you can query with SPARQL in Wikidata at the WikidataFacts Twitter account, e.g.

These examples also demonstrate a couple of other things:

  • There are some nice ways of visualizing results built into the Wikidata Query Service, like timelines, graphs, images, maps, etc.
  • The musical instruments example is clearly incomplete and likely biased by which collections of art are best represented in Wikidata; lots of queries you might make are like this – they’re interesting despite being incomplete, and the data will only get better over time. (Also, looking at the results can give you ways of thinking about how best to improve the data – e.g. what’s the biggest missing source?)

As an example of something I made that uses Wikidata (as a fun holiday project) here’s a site that can suggest a random episode of your favourite TV programme to watch (there are links at the bottom of the page to see the queries used in generating each page) and suggests ways of improving the data.

Finally, here’s a silly query I just wrote – I knew that Catmando was unusual in being a cat that was a leader of a political party, but wondered if there are other animals that have an occupation of “politician”. It turns out that there are! (click the “play” icon to run the query and see the results).

Great! How do I get started?

There’s an excellent tutorial on using SPARQL to extract data from Wikidata. I’d definitely recommend starting there, rather than more generic SPARQL tutorials, those others will tell you lots of things you don’t need to know for Wikidata, and miss out lots of useful practical tips. Also, the examples are all ones you can try or play with directly from the tutorial.

It’s a long tutorial, but you’ll get a lot of out of it even if you don’t got through the whole thing.

Tips that helped me to write SPARQL more effectively

The original idea of this post was to pass on some tips that helped me to write SPARQL better (which I mostly got from Tony Bowden and Lucas Werkmeister) – though it turns out that lots of these are in the tutorial I linked to above in some form or other! Nonetheless, I figure it might be useful to someone to reiterate or expand on some of these here. Some are quite basic, while others will probably only make sense after you’ve been writing SPARQL for a bit longer.

1. Use the Wikidata Query Service for writing queries

There are a couple of features of the Wikidata Query Service that mean it’s the best way I know of to start writing queries from scratch:

  • If you don’t know which property or item you want, you can follow “wdt:” or “wd:” with some plain English, and hit “ctrl-space” to autocomplete it.
  • If you mouse-over a property or item, the tooltip will give you a human readable description of it.

Both of these will reduce your reliance on post-it notes with property numbers :)

Of course, writing queries in the Wikidata Query Service form also means you can try them out just with a button press (or control-enter :)).

2. Use Reasonator or Squid for browsing relationships of an item

The item browser at Wikidata.org is a bit limited: the main way in which this is frustrating is that the item pages only show claims that the item is the subject of. For example, the item page for the X-Files looks like:

… and you can’t get from there to the episodes of the series, or its seasons, since those are the objects of the ‘series’ and ‘part of’ predicates, not the subjects. The Reasonator and Squid pages do let you follow these relationships backwards, however:

(Also, those alternatives show images and other media related to the items, which is nice :))

3. Learn what wdt:P31/wdt:P279* means

Wikidata’s very flexible way of modelling data means that trying to find a particular “type of thing” can be complicated. For example, if you want to find all television series, then the following query:

SELECT * WHERE {
  ?series wdt:P31 wd:Q5398426
}

(which asks for any item that is an ‘instance of’ (P31) the item ‘television series’ (Q5398426)) wouldn’t return The Simpsons, since that’s an instance of ‘animated series’ (Q581714) instead.

‘animated series’, however, is a ‘subclass of’ (P279) ‘television series’. This means that if you change the query to:

SELECT * WHERE {
  ?series wdt:P31/wdt:P279* wd:Q5398426
}

… it will include The Simpsons. That new version of the query essentially asks for any item that is an instance of ‘television series’ or anything that is a ‘subclass of’ ‘television series’ if you keep following that ‘subclass of’ relationship up the hierarchy.

You’ll probably need to use this quite frequently. A more advanced variant that you might need is to use is:

SELECT * WHERE {
  ?series p:P31/ps:P31/wdt:P279* wd:Q5398426
}

… which will consider an ‘instance of’ relationship even if there are multiple ‘instance of’s and another one has the ‘preferred’ rank. I’ve written more about the ‘p:’ and ‘ps:’ predicates below.

4. Add labels to your query results

Understanding if your query has worked properly or not is confusing if you’re just seeing a list of item numbers in your output – for instance in this query to find all episodes of season 1 of the West Wing:

SELECT ?episode WHERE {
  ?episode wdt:P361 wd:Q3730536
}

One way of improving this is to add variables with the “labels” of Wikidata items. The easiest way to do this in the Wikidata Query Service is to type “SERVICE” and hit control-space – the second option (beginning “SERVICE wikibase:label…”) will add a SERVICE clause like this:

SERVICE wikibase:label {
  bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".
}

With this clause added in to the WHERE clause, you’ll find that you have some extra variables you can include in those you’re SELECTing, which are named with a suffix of “Label” on the end of the existing variable name. So to see the names of those West Wing episodes, the full query would be:

SELECT ?episode ?episodeLabel WHERE {
  ?episode wdt:P361 wd:Q3730536 .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".
  }
}

This should be much easier to understand in your output.

Here are some more advanced notes on this tip:

  • The default languages of “[AUTO_LANGUAGE],en” won’t be what you want in every situation. For example, in queries for politicians from Argentina, it made more sense for us to use “es,en”, which is saying to use the Spanish label if present, or otherwise an English one.
  • You should be aware that Wikidata has more flexible ways of representing names of people than the labels attached to the person’s item.
  • Sometimes the “Label” suffixed variable name won’t be what you want, and you’ll want to customize the variable name. (For example, this might come up if you care about the column headers of CSV output, which are based on the variable names.) In these cases you could rename them using the rdfs:label property within the SERVICE clause. In the example above we could do this like:
SELECT ?episode ?episode_name WHERE {
  ?episode wdt:P361 wd:Q3730536 .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" .
    ?episode rdfs:label ?episode_name.
  }
}

5. Learn the helpful semi-colon, comma and square bracket syntax to abbreviate your queries

In your WHERE clauses, if you’re making lots of constraints with the same subject, you can avoid repeating it by using a semi-colon (;). There’s a nice example of this in the tutorial. With helpful indentation, this can make your queries shorter and easier to read.

That same section of the tutorial introduces two other shorthands that are sometimes very helpful:

  • a comma (,) to reduce repetition of the object in a constraint
  • using square brackets ([]) to substitute in the subject of another constraint

6. Understand the different predicate types

A big conceptual jump for me in writing SPARQL queries for Wikidata was that you can do quite a lot just using the “wdt:” predicates, but they only allow you to query a limited subset of the information in Wikidata. For lots of things you might want to query, you need to use predicates with other prefixes. To explain the problem, here’s an example of the data that a wdt:P69 (“educated at”) pattern for Douglas Adams extracts:

As you can see, there are two “educated at” statements for Douglas Adams, and the only information that a wdt:P69 pattern extracts are the items representing the institution itself. As you can see in that diagram, there’s lots more information associated with those statements, like references and qualifiers which you can’t get at just with a wdt: prefix.  A further limitation is that if there are multiple statements, the wdt: predicates only extract the most “truthy” statements – so if one of them is given preferred rank, that’s the only one that’ll match.

Fortunately, there are other predicate types that you can use to get everything you see in that diagram. Firstly, you need a pattern based on a p: predicate, which relates a subject to a statement value – the statement values are the bits in the orange boxes here:

Then you can use other predicate types like pq:, ps:, pr: and prov:wasDerivedFrom to access details within those boxes. You can read in more detail about those predicates in this page I wrote (that I made these diagrams are for).

To give an example (which might take some careful reading to understand!) suppose you want to find the all the seasons of The West Wing, with their season number. Each season is a ‘part of’ (P361) the series, and also has a ‘series’ (P179) relationship to the series. However, the number of the season within the series is only available as a ‘series ordinal’ (P1545) qualifier on the series statement, so you need to use the p:, ps: and pq: qualifiers like this:

SELECT ?season ?seasonLabel ?seasonNumber WHERE {
  ?season wdt:P361 wd:Q3577037 .
  ?season p:P179 ?seasonSeriesStatement .
  ?seasonSeriesStatement ps:P179 wd:Q3577037.
  ?seasonSeriesStatement pq:P1545 ?seasonNumber .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".
  }
} ORDER BY xsd:integer(?seasonNumber)

7. Use OPTIONAL when you don’t require a relationship to exist

Because lots of data in Wikidata is currently incomplete, you’ll often want to use OPTIONAL clauses in your WHERE query in situations where there’s extra information that might be helpful, but you don’t want its absence to prevent items from being returned.

Note also that you can nest OPTIONAL clauses, if you have OPTIONAL data that might depend on other OPTIONAL data.

For example, for TV programmes, the ‘series ordinal’ qualifier (on ‘series’ statements), provides both the number of a season and the number of the episode within a season. The latter is done much less consistently than the former, however, so you might want to make that part of your query optional. There are two parts that you might want to make optional:

  • Whether there’s a ‘series’ statement relating the episode to its season at all.
  • If there is such a ‘series’ statement, whether there’s a ‘series ordinal’ qualifier on it.

That’s a case where you might want to nest OPTIONAL blocks to make sure you get as much data as possible, even if this modelling is incomplete. For example, you could use this query:

SELECT ?episode ?episodeLabel ?season ?seasonNumber ?numberWithinSeason WHERE {
  ?episode wdt:P361 ?season .
  ?season p:P179 ?seasonSeriesStatement .
  ?seasonSeriesStatement ps:P179 wd:Q3577037 .
  ?seasonSeriesStatement pq:P1545 ?seasonNumber .
  OPTIONAL {
    ?episode p:P179 ?episodeSeasonSeriesStatement .
    ?episodeSeasonSeriesStatement ps:P179 ?season .
    OPTIONAL {
      ?episodeSeasonSeriesStatement pq:P1545 ?numberWithinSeason
    }
  }
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".
  }
} ORDER BY xsd:integer(?seasonNumber) xsd:integer(?numberWithinSeason)

(In that example, all the data is now present so the OPTIONAL blocks aren’t necessary, but for other TV series where the episode number modelling isn’t complete, they would be.)

Dealing with awkward subtitle problems in Handbrake

I know very little about Handbrake; this is just some notes on what I personally do to reduce my confusion about why subtitles aren’t being ripped properly and manually fixing that, but I almost certainly can’t answer any questions about issues you might be having! This is just here in the hope that it might be useful to someone…

I’ve been on a long-running project to rip some of our DVDs to network attached storage, so that playing them is a much more pleasant experience: we can then easily play any of our DVDs around the flat without suffering cute-but-unusable DVD menus, condescending anti-piracy warnings or trashy trailers. With some DVDs that are already quite poor quality I’ll just image the disk and copy the ISO to the NAS, but in most cases I use Handbrake to transcode just the titles I want from the DVD using a codec with rather better compression than MPEG-2. There’s inevitably a loss of quality doing this transcoding, of course, but in most cases I don’t mind.

DVD Subtitles

One slightly vexing issue that sometimes comes up when doing this is what to do about the subtitle tracks on the DVD. For plenty of films or TV shows there aren’t any points at which you’d expect a subtitle to be shown, so you don’t need to worry about them. However, plenty of our DVDs do have brief sections in foreign languages, for example, that you’d expect to be subtitled.

There are various choices about how you can handle these when ripping, but I prefer what’s really the option that loses most information: “burning in” any subtitles that are there for the purposes of translating foreign language sections into English or are added for comic effect. “Burning in” these images from a subtitle track means that you’re overlaying the images before the source video is transcoded, so you can’t remove those subtitles or switch them to other languages afterwards. Obviously lots of people prefer not to burn in the subtitles for that reason, but I tend to do this because it means I’m not subject to the vagaries of how different video players handle subtitles, and these rips are only for my own personal use anyway.

As well as “burning in” subtitles, another concept you might need to know about is of “forced subtitles”. Any subtitles in a given subtitle track might be marked as “forced” – this is used to tell the DVD player to display these even if the person hasn’t chosen to see all subtitles from a particular subtitle track – the intended use of this is exactly for brief sections of foreign languages in films, as I understand it.

In most cases, what works fine in Handbrake for what I want to do is to use the “Foreign Audio Search” option in the “Subtitle List” tab, selecting “Burned In” and then guessing whether to tick the “Forced Subtitles Only” box or not – generally I’ll leave that unchecked, unless when I look at the list of available subtitles (under “Add”) there’s only one English language subtitle track, in which case it’s probably forcing subtitles for any foreign language sections that are subtitled. This option should look through all the subtitle tracks for ones that appear less than 10% of the time, look for forced subtitles, and make a sensible choice about what to use: the Handbrake documentation explains this.

However, there are various ways this can go wrong – the “Foreign Audio Search” option sometimes makes the wrong choice in peculiar ways – e.g. I’ve seen it pick the right track when you’re just ripping one chapter from a title, but the wrong one when you’ve selected all the chapters (!). Also, there’s just very little consistency in how DVD producers choose whether to mark subtitles as forced.

When it goes wrong, here’s the method I use to manually pick the right subtitle track to burn in – essentially this is to do the “Foreign Audio Search” scan on just one chapter that I know both has audio that should and should not be subtitled, look through the “Activity Log” to see the results of that scan, and then manually select the right subtitle track based on that.

Examples

To step through that in more detail, here’s what I’d do:

  • Select from the “Subtitle List” tab the “Foreign Audio Search” option, and add that to the subtitles list. It doesn’t matter what options you choose, since we’re just adding this to get the results of the search into the activity log.
  • Find a chapter in the title you want to rip that has some audio you’d want to have subtitles for, and some that you don’t want to be subtitled. (You can select multiple chapters to achieve this if you want – the point of just choosing a small number of chapters is only to make the scan quicker.) I’d normally do this by (a) knowing a bit of the film with such audio and (b) finding that in bit of the film in Totem, which conveniently shows the title and chapter number in the window title.
  • Select just those chapters to rip,  add them to the queue
  • Start encoding
  • Open the activity log window
  • Once the “subtitle scan” pass has completed (it should be the first thing that Handbrake does) scroll up in the activity log to find the lines that look something like this:

[13:35:12] Subtitle track 0 (id 0x20bd) ‘English’: 87 hits (0 forced)
[13:35:12] Subtitle track 1 (id 0x21bd) ‘English’: 89 hits (0 forced)
[13:35:12] Subtitle track 2 (id 0x22bd) ‘English’: 6 hits (0 forced)

  • That means that subtitle track 2 is the right one, because there are subtitles for only some of the audio – the other two probably have subtitles for every line of dialogue, even if it’s in English. So, now we want to set up Handrake to rip the complete title but with that subtitle track added manually:
  • Remove the “Foreign Audio Search” option from the subtitle list.
  • Click “Add”  in the subtitle list tab, and select subtitle track 3 (n.b. not 2, since the graphical user interface (GUI) for Handbrake numbers subtitle tracks starting at 1, not 0.) Make sure you don’t select “Forced Subtitles Only” since the subtitles on that track aren’t forced (see “0 forced” in the output above).  (I would also select “Burn in” whenever adding a subtitle track manually like this, for the reasons discussed above – but you might well have different views about that.)
  • Then select all the chapters of the title, add the title to the queue and rip it as normal.

As other examples of things you might see in that list of “hits” in the results of the foreign audio search pass, consider the following:

[16:57:06] Subtitle track 0 (id 0x20bd) ‘English’: 77 hits (0 forced)
[16:57:06] Subtitle track 1 (id 0x21bd) ‘English’: 78 hits (0 forced)
[16:57:06] Subtitle track 14 (id 0x2ebd) ‘English’: 8 hits (8 forced)

In this example, it happens that the subtitles are “forced”, but it’s still clear that track 14 (0-based) is the one which just has the subtitles for the foreign language section, so I’d add track 15 (14 +1) manually in the GUI as above – and in this case it doesn’t matter whether you selected “Forced Subtitles Only” or not, since all of the subtitles on that track appear to be forced.

As a final example, you might see output like this:

[18:23:55] Subtitle track 0 (id 0x20bd) ‘English’: 92 hits (11 forced)

In that example there’s a single English subtitle track, which marks some subtitles as “forced” to indicate that those are for foreign language audio. In that case, in the GUI I would manually add subtitle track 1 (0 + 1) but would have to select the “Force Subtitles Only” option to avoid getting subtitles for everything.

Migrating YourNextRepresentative from PopIt to django-popolo

This post was originally intended for the mySociety blog, but because of its length and technical content people suggested to me it might be more suitable for my own blog. This about the development of the YourNextRepresentative project (GitHub link), and in particular why and how we migrated its storage system from using the PopIt web service to instead use PostgreSQL via django-popolo. Hopefully this might be of interest to those who have been following the development of YourNextRepresentative (previously YourNextMP).

Background

YourNextMP for 2015

For the 2015 General Election in the UK, we developed a new codebase to crowd-source the candidates who were standing in every constituency; this site was YourNextMP. (Edmund von der Burg kindly let us use the yournextmp.com domain name, which he’d used for his site with the same intent for the 2010 election.) We were developing this software to help out Democracy Club, who ran the site and built up the community of enthusiastic contributors that made YourNextMP so successful.

At the time, we saw this crowd-sourcing challenge as a natural fit for another technology we had been developing called PopIt, which was a data store and HTTP-based API for information about people and the organizations they hold positions in. PopIt used the Popolo data model, which is a very carefully thought-out specification for storing open government data. The Popolo standard helps you avoid common mistakes in modelling data about people and organisations, and PopIt provided interfaces for helping people to create structured data that conformed to that data model, and also made it easily available.

The original intention was that YourNextMP would be quite a thin front-end for PopIt but, as development progressed, but it became clear that this would provide a very poor user experience for editors. So, the YourNextMP wrapper, even in the very early versions, had to provide a lot of features that weren’t available in PopIt, such as:

  • A user experience specific to the task of crowd-sourcing election candidate data, rather than the very generic and unconstrained editing interface of PopIt’s web-based UI.
  • Versioning of information about people, including being able to revert to earlier versions.
  • Lookup of candidates by postcode.
  • Summary statistics of progress and completion of the crowdsourcing effort.
  • Logging of actions taken by users, so recent changes could be tracked and checked.
  • CSV export of the data as well as the JSON based API.
  • etc. etc.

Later on in development we also added more advanced features such as a photo moderation queue. To help support all of this, the YourNextMP front-end used a traditional RDBMS (in our case usually PostgreSQL), but effectively PopIt was the primary data store, which had all the information about people and the posts they were standing for.

This system worked fine for the general election, and I think everyone considered the YourNextMP project to be a great success  – we had over a million unique users, and the data was extensively reused, including by Google for their special election information widget. (You can see some more about the project in this presentation.)

Turning YourNextMP into YourNextRepresentative

There had been a considerable demand to reuse the code from YourNextMP for other elections internationally, so our development efforts then focussed on making the code reusable for other elections. The key parts of this were:

  • Separating out any UK-specific code from the core application
  • Supporting multiple elections generically (the site was essentially hard-coded to only know about two elections – the 2010 and 2015 UK general elections)
  • Internationalizing all text in the codebase, so that it could be localized into other languages.

We worked on this with the target of supporting a similar candidate crowd-sourcing effort for the general election in Argentina in 2015, which was being run by Congreso Interactivo, Open Knowledge Argentina and the Fundación Legislativo. This new version of the code was deployed as part of their Yo Quiero Saber site.

Since the name “YourNextMP” implies that the project is specific to electoral systems with  Members of Parliament, we also changed the name to YourNextRepresentative. This name change wasn’t just about international re-use of the code – it’s going to be supporting the 2016 elections in the UK as well, where the candidates won’t be aspiring to be MPs, but rather MSPs, MLAs, AMs, mayors, PCCs and local councillors.

Problems with PopIt

It had become apparent to us throughout this development history that PopIt was creating more problems than it was solving for us. In particular:

  • The lack of foreign keys constraints between objects in PopIt, and across the interface between data in PostgreSQL and PopIt meant that we were continually having to deal with data integrity problems.
  • PopIt was based on MongoDB, and while it used JSON schemas to constrain the data that could be stored in the Popolo defined fields, all other data you added was unconstrained. We spent a lot of time of time writing scripts that just fixed data in PopIt that had been accidentally introduced or broken.
  • PopIt made it difficult to efficiently query for a number of things that would have been simple to do in SQL. (For example, counts of candidates per seat and per party were calculated offline from a cron-job and stored in the database.)
  • Developers who wanted to work on the codebase would have to set up a PopIt locally or use our hosted version; this unusual step made it much more awkward for other people to set up a development system compared to any conventional django-based site.
  • PopIt’s API had a confusing split between data that was available from its collection-based endpoints and the search endpoints; the latter meant using Elasticsearch’s delightfully named “Simple Query String Query” which was powerful but difficult for people to use.
  • It was possible (and common) to POST data to PopIt that it could store in MongoDB but couldn’t stored in Elasticsearch, but no error would be returned. This long-standing bug meant that the results you got from the collections API (MongoDB-backed) and search API (Elasticsearch-backed) were confusingly inconsistent.
  • The latency of requests to the API under high load meant we had to have a lot of caching. (We discovered this on the first leaders’ debate, which was a good indication of how much traffic we’d have to cope with.) Getting the cache invalidation correct was tricky, in accordance with the usual aphorism.

At the same time, we were coming to the more broad conclusion that the PopIt project hadn’t been achieving the goals that we’d hoped it would, despite having putting a lot of development time into it, and we decided to stop all development on PopIt. (For many applications of PopIt we now felt the user need was better served by the new EveryPolitician project, but in YourNextRepresentative’s case that didn’t apply, since EveryPolitician only tracks politicians after they’re elected, not when they’re candidates.)

As developers we wanted to be able to use a traditional RDBMS again (through the Django ORM) while still getting the benefits of the carefully thought-out Popolo data model. And, happily, there was a project that would help use to do exactly that – the django-popolo package developed by openpolis.

Migrating YourNextRepresentative to django-popolo

django-popolo provides Django models which correspond to the Popolo Person, Organisation, Membership, Post and Area classes (and their associated models like names and contact details).

We had used django-popolo previously for SayIt, and have been maintaining a fork of the project which is very close to the upstream codebase, except for removing the requirement that one uses django-autoslug for managing the id field of its models. We opted to use the mySociety fork for related reasons:

  • Using the standard Django integer primary key id field rather than a character-based id field seems to be closer to Django’s “golden path”.
  • There are interesting possibilities for using SayIt in a YourNextRepresentative site (e.g. to capture campaign speeches, or promises) and using the same version of django-popolo will make that much easier

Extending django-popolo’s models

Perhaps the biggest technical decision about how to use the django-popolo models (the main ones being Person, Organization, Post and Membership) is how we extended those models to add the various extra fields we needed for YourNextRepresentative’s use case. (With PopIt you could simply post a JSON object with extra attributes.) The kinds of extra-Popolo data we recorded were:

  • The one or many elections that each Post is associated with.
  • The particular election that a Membership representing a candidacy is associated with.
  • The set of parties that people might stand for in a particular Post. (e.g. there’s a different set of available parties in Great Britain and Northern Ireland).
  • The ‘versions’ attribute of a Person, which records all the previous versions of that person’s data. (We considered switching to a versioning system that’s integrated with Django’s ORM, like one of these, but instead we decided to just make the smallest incremental step as part of this migration, which meant keeping the versions array and the same JSON serialization that was used previously, and save switching the versioning system for the future.
  • Multiple images for each Person and Organization. (django-popolo just has a single ‘image’ URL field.)
  • Whether the candidates for a particular Post are locked or not. (We introduced a feature where you could lock a post once we knew all the candidates were correct.)
  • To support proportional representation systems where candidates are elected from a party list, each Membership representing a candidacy needs a “party_list_position” attribute to indicate that candidate’s position on the party list.
  • etc.

Perhaps the most natural way of adding this data would be through multi-table inheritance; indeed, that is how SayIt uses django-popolo. However, we were wary of this approach because of the warnings in Two Scoops of Django and elsewhere that using multi-table inheritance can land you with difficult performance problems because queries on the parent model will use OUTER JOINs whether you need them or not. We decided instead to follow the Two Scoops of Django suggestion and make the one-to-one relationship between parent and child table explicit by creating new models called PersonExtra, PostExtra, etc. with a `base` attribute which is a OneToOneField to Person, Post, etc., respectively. This means that the code that uses these models is slightly less clear than it would be otherwise (since sometimes you use person, sometimes person.extra) but we do have control over when joins between these tables are done by the ORM.

Data migration

Once the Extra models were created, we wrote the main data migration. The idea of this was that if your installation of YourNextRepresentative was configured as one of the known existing installations at the time (i.e. the ELECTION_APP setting specified the St Paul, Burkina Faso, Argentina or the UK site) this data migration would download a JSON export of the data from the corresponding PopIt instance and load it into the django-popolo models and the *Extra models that extend them.

As the basis for this migration, we contributed a PopIt importer class and management command upstream to django-popolo. This should make it easier for any project that used to use PopIt to migrate to django-popolo, if it makes sense for them to do so. Then the data migration in YourNextRepresentative subclassed the django-popolo PopItImporter class, extending it to also handle the extra-Popolo data we needed to import.

(A perhaps surprising ramification of this approach is that once an installation has been migrated to django-popolo we should change the name of that country’s ELECTION_APP, or otherwise someone setting up a new site with that ELECTION_APP configured will have to wait for a long time for out-of-date data to be imported on running the initial migrations. So we will shortly be renaming the “uk_general_election_2015” application to just “uk”. To support people who want that feature (cloning an existing site to a development instance) we’ve added a new “candidates_import_from_live_site” management command that uses its new API to mirror the current version of an existing instance.)

Another issue that came up in working on this data migration is that we needed to preserve any identifiers that were used in URLs on the site previously so that after upgrading each site every bookmarked or search-engine-indexed URL would still show the same content. In the case of the Person model, this was easy because it used an integer ID previously. Parties and posts, however, used strings as their IDs. We migrated these IDs to fields called ‘slug’ (perhaps a slightly misleading name) on the OrganisationExtra and PostExtra models.

This turns out to be quite a slow migration to run – as well as importing the core data, it also downloads every image of people and parties, which is pretty slow even on a fast connection.

Updating views, tests and management commands

The next part of the migration was updating all the code that previously used PopIt’s API to instead use the new Django models. This was a significant amount of work, which left very little code in the project unchanged by the end. In general we tried to update a test at a time and then change the core code such that the test passed, but we knew there was quite a bit of code that wasn’t exercised by the tests. (One nice side-effect of this work is that we greatly improved the coverage of the test suite.)

We did think about whether we could avoid doing this update of the code essentially in one go – it felt rather like a “stop-the-world refactoring”; was there an incremental approach that would have worked better? Maybe so, but we didn’t come up with one that might reasonably have saved time. If the old code that interacted with the PopIt API had been better encapsulated, perhaps it would have made sense to use proxy models which in the migration period updated both PopIt’s API and the database, but this seemed like it would be at least as much work, and it was lucky that we did have a period of time we could set aside for this work during which the sites weren’t being actively updated.

Moving other code and configuration to the database

We also took the opportunity of this migration to introduce some new Django models for data that had previously been defined in code or as Django settings. In particular:

  • We introduced Election and AreaType models (previously the elections that a site supported and the types of geographical boundary they were associated with were defined in a dictionary in the per-country settings).
  • We introduced a PartySet model – this is to support the very common requirement that different sets of parties can field candidates in different areas of the country.
  • We replaced the concept of a “post group” (definied in code previously) with a “group” attribute on PostExtra

These all had the effect of simplifying the setup of a new site – more of the initial setup can now be done in the Django admin interface, rather than needing to be done by someone who’s happy to edit source code.

Replacing PopIt’s API

One of the nice aspects of using PopIt as the data store for the site was that it supplied a RESTful API for the core data of the site, so we previously hadn’t had to put much thought into the API other than adding some basic documentation of what was there already. However, after the migration away from PopIt we still wanted to provide some HTTP-based API for users of the site. We chose to use Django REST framework to provide this; it seems to be the most widely recommended tool at the moment for providing a RESTful API for a Django site (and lots of people and talks at djangocon EU in Cardiff had independently recommended it too). Their recommendations certainly weren’t misplaced – it was remarkably quick to add the basic read-only API to YourNextRepresentative. We’re missing the more sophisticated search API and various other features that the old PopIt API provided, but there’s already enough information available via the API to completely mirror an existing site, and Django REST framework provides a nice framework for extending the API in response to developers’ needs.

The end result

As is probably apparent from the above, this migration was a lot of work, but we’re seeing the benefits every day:

  • Working on the codebase has become a vastly more pleasant experience; I find I’m looking forward to it working on it much more than I ever did previously.
  • We’ve already seen signs that other developers appreciate that it’s much more easy to set up than previously.
  • Although the tests are still far from perfect, they’re much more easy to work with than previously. (Previously we mocked out a large number of the external PopIt API requests and returned JSON from the repository instead; this would have been a lot better if we’d used a library like betamax instead to record and update these API responses, but not having to worry about this at all and just create test data with factory_boy is better yet, I think.)

I think it’s also worth adding a note that if you’re starting a new site that deals with data about people and their memberships of organizations, then using the Popolo standard (and if you’re a Django developer, django-popolo in particular) can save you time – it’s easy to make mistakes in data modelling in this domain (e.g. even just related to names). It should also help with interoperability with other projects that use the same standard (although it’s a bit more complicated than that – this post is long enough already, though :))

The UK instance of YourNextRepresentative (at edit.yournextmp.com) has been using the new codebase for some time now, and that will be relaunched shortly to collect data on the 2016 elections in the UK.

If you’re interested in starting up an candidate crowd-sourcing site using YourNextRepresentative, please get in touch with our international team or email ynr@mysociety.org for any other issues.

Printing out GitHub issues for triage or estimation

On one of the projects I’ve been working on at mySociety had a large number of open issues, most of which hadn’t been looked at for some time, and didn’t have estimates of difficulty. To address this we tried a variation of an exercise that’s suggested in “The Scrum Field Guide” in the chapter called “Prioritizing and Estimating Large Backlogs”, which involves printing out all the stories in the product backlog onto index cards and arranging them on a wall roughly in order of difficulty from left to right. (A second stage is to let the product owner adjust the height of each story on the wall to reflect its priority.) It seemed as if this might be a helpful way of estimating our large number of open issues, even though they’re not written as users stories.

An obvious problem here was that we didn’t have an easy way of GitHub tickets en masse, so I wrote a script that would generate a PDF for each open issue (excluding those that are pull requests) in a repository’s issue tracker using Pandoc. That script is here in case it’s of use to anyone:

As it turned out, this session was pretty successful – as well as the value of generating lots of difficulty estimates, it was a good way of getting a picture of the history of the project, and starting lots of discussions of issues that only one member of the team knew about.

Making an encrypted partition on a USB drive

On Ubuntu or Debian, it’s really simple to create an encrypted partition on a newly-purchased USB mass storage device.  In my case, I had bought a 1TB hard drive which had very mixed reviews, some people saying their drives had failed very early.  I wanted to be able to return the drive under warranty if it broke without worrying about personal data.

It turns out that if you want to reformat a partition on an external USB drive so that it’s encrypted, this is just a matter of doing the following:

sudo luksformat -t ext4 /dev/partitiondevice

…. where /dev/partitiondevice is the device for the drive partition you want to overwrite.  Obviously, this will destroy everything that was previously on that partition.

I like to use a proper filesystem for USB mass storage devices, but if you leave out the -t ext4 then the default is to use VFAT.

When you next plug in that drive, you’ll be prompted to enter the password that you picked when creating the partition – if you type that correctly, the drive will be mounted and usable.  (If you mistype it, you’re not given another chance to enter the password, so you’ll need to go to the command line and do: gvfs-mount -d /dev/partitiondevice to try again.)

One small thing is that the mount point in /media will be based on a UUID by default, but if you set the ext4 partition label, it’ll be mounted under that name in /media/ instead.  To do this, starting from when your disk is mounted, you can run mount without parameters to find the unencrypted device name and then unmount it and change the label:

$ umount /dev/mapper/udisks-luks-uuid-b7bbb2c8-etc
$ e2label /dev/mapper/udisks-luks-uuid-b7bbb2c8-etc topsekrit

If you unplug and plug in the disk again, it should be mounted on /media/topsekrit

The Guardian on your Kindle

Update: It is now possible to buy an official subscription to The Guardian and The Observer. The rest of this post is now largely of historical interest if you just want The Guardian on your Kindle, but I’ve left the rest of the content unchanged for people who are interested in how I generated my unofficial version.


If you just want a copy of today’s copy of The Guardian or The Observer for you Kindle, you can download one from this automatically generated page.  This post describes the script that generates that generates the file and the motivation for it.

Since moving to Switzerland, I’ve found that I really miss being able to get The Guardian in the morning on my way into work.  Unfortunately, reading the website on a phone (or any other device) is no substitute if you’re relying on data over the mobile phone networks – one really wants all the articles cached for fast navigation through the paper.  The solution for this should be my shiny new Kindle, but sadly subscriptions to The Guardian aren’t available in the Kindle store.  (There are many other papers available.)  Fortunately, The Guardian has an excellent API for accessing its content, and the lovely interface produced by Phil Gyford for reading the paper in a cleaner interface suggested that I could similarly generate a bare bones version of the paper for my Kindle.  I believe that this is permitted under the terms and conditions of the Guardian Open Platform, since I’m (a) including the advertisement linked to from each article, (b) linking back to the original article and (c) acknowledging that the data is supplied by that service.  If I’ve misunderstood, and in fact this is not allowed, please let me know.

To generate a book in the Kindle’s preferred format, you have to generate a .opf file describing the contents of the book, which refers to other files describing its text, images, structure, etc.  Then you can run a binary called “kindlegen” to generate a .mobi archive from those files that will work on your Kindle.  (The samples in the kindlegen archive and the Amazon Kindle Publishing Guidelines are quite sufficient to figure out how to do this.)  My script to generate the .opf and supporting files is far from elegant, but I’m very happy with the results that it produces – it’s a really lovely reading experience.  You can use the normal page forward / back buttons to go from page to page, while the left and right buttons on the five-way skip to the next article’s headline.  This means you can skip quickly through the articles that you aren’t interested in, but each article you do want to read is presented very clearly on the amazing eInk display:

There are a few articles for which the API won’t return the text, saying that rights for redistribution are not available – I’m still including the other metadata for these article and the link to the original article, so that you know what’s missing:

At the end of each article is the advertisement image that’s included – this is to comply with the requirements of the Guardian Open Platform:

You can download the generated .mobi file for today’s Guardian (or The Observer on Sunday) from this page:

A Kindle version of today’s Guardian or Observer

You can bookmark that page in your Kindle’s web browser. Then, whenever you select the bookmark and then “Reload”, then it’ll be refreshed with a link to that day’s generated edition of the newspaper for your Kindle, which you can download straight from that page.

If you’re interested in this project, or have any comments or suggestions, you can contact me by email at:

"mark" followed by a dash, then "guardian" then "kindle" then an "at" sign then longair dot net

… or leave a comment below.  The script for generating this version of The Guardian is available at github.

Hashing Flickr Photos

I used to host my photos with a simple set of CGI scripts that basically worked well enough for my simple requirements.  Such web applications are easy and fun to write, but in the end I decided that it wasn’t worth it because:

  • Hosting large amounts of data on a generic shell account is typically quite expensive.  Flickr‘s “pro” account subscription is a very good deal in comparison: as long as each photo is beneath 20 megabytes in size, you can upload as many as you like for $24.95 a year.
  • The community aspect of sites like Flickr is very encouraging – it’s lovely to have random people say nice things about your photographs, and occasionally have people use them in articles, etc.

(Some people are put off from using Flickr by the appearance of the site, but its API means that there are plenty of alternative front-ends for viewing or presenting your photos, such as flickriver.)

The slight problem with switching to hosting on Flickr was that previously I’d indexed all my photos by the MD5sum of the original image, so several of my pages had links or inline images that pointed to an MD5sum-based URL on the old site.  It occurred to me that it might be useful in general to have “machine tags” on each photo with a hash or checksum of the image, so that, for example:

  • You can simply check which photos have already been uploaded.
  • You can find URLs for all the different image sizes, etc. based on the content of the file.

Unfortunately, I hadn’t done this when uploading the files in the first place, so had to write a script (flickr-checksum-tags.py) which takes the slightly extraordinary step of downloading the original version of every photo that doesn’t have the checksum tags to a temporary file, hashing each file, adding the tags and deleting the temporary file.  This add tags for the MD5sum and the SHA1sum, using a namespace and keys suggested in this discussion, where someone suggests taking the same approach. These tags are of the form:

  checksum:md5=c629c63f8508cfd1a5e6ba6b4b3253a8
  checksum:sha1=df44fc771660fbe7a2d6b2e284ae61e9ed3e377c

The same script can return URLs for a given checksum:

  # ./flickr-checksum-tags.py -m c629c63f8508cfd1a5e6ba6b4b3253a8 --short
  > http://flic.kr/p/7oQxqK
  # ./flickr-checksum-tags.py -m c629c63f8508cfd1a5e6ba6b4b3253a8 -p
  > [... the Flickr photo page URL, which WordPress insists on turning into an image ...]
  # ./flickr-checksum-tags.py -m c629c63f8508cfd1a5e6ba6b4b3253a8 --size=b
  > http://farm3.static.flickr.com/2552/4196574615_491c6387f8_b.jpg

The repository also has a script to pick out files that haven’t been uploaded, and a simple uploader script which will upload an image and add the checksum tags.  The scripts are based on the very useful Python flickrapi module and you’ll need to put your Flickr API key and secret in ~/.flickr-api

Anyway, these have been useful for me so maybe of some interest to someone out there…

LyX Tips for Thesis Writing

LyX is a lovely bit of software for preparing beautiful documents – you get the high quality output of LaTeX and the advantages of logical document description in a usable interface and without having to remember TeX syntax.  There are a few aspects of using LyX that puzzled me while writing a certain large document, however – many of these are dealt with in the LyX FAQ, but I thought it would be worth collecting those that were most useful to me here.

Use pdflatex for Output

There are various different options for generating output PDF output in LyX, but it will save you trouble if you do everything using pdflatex in the first place. (I think this is the upshot of the slightly unclear advice in the FAQ on the subject.) This turned out to be particularly important because when your document is 50000 words long and has over 100 figures, the other methods take over 10 minutes to generate a PDF; pdflatex would finish in a couple of seconds. If you take this advice then you have to change the Document -> Settings -> Document Class option to pdfTeX, or you get some surprising errors. Also, I would strongly recommend that you only use PNG files for bitmap images and PDF for vector graphics. (PNG is obviously sensible, but in the case of vector graphics I found PS and EPS files unexpectedly awkward in terms of getting the orientation and clipping right.)

Incorrect Colours in Bitmap Graphics

I came across a bizarre problem where the colours would be slightly wrong for certain PNG files that I include in the document. (I suppose I should say “colors” too, just for the sake of searchers using American English.) This turned out to be a problem with full-colour PNG images with transparency (i.e. RGBA images), which my notes say is discussed further in these posts. Setting the PDF version as suggested in the first of those posts didn’t help me at all, so I had to convert all my RGBA PNG files to RGB. If you want to check for these files you can use file(1), something like:

find . -type f -iname '*.png' -print0 | xargs -n 1 -0 file | egrep RGBA

… and I fixed them by feeding the filenames (one per line) to a script like:

#!/bin/sh

set -e

while [ $# -ne 0 ]
do
 t=`mktemp`
 convert "$1" png24:"$t" && mv "$t" "$1"
 shift
done

Obviously you need imagemagick installed for the “convert” command.

Footnotes

By default there is no extra vertical space to separate footnotes, but I much prefer there to be a small gap. To do that, add to the following line to the document preamble:

\setlength{\footnotesep}{12pt}

Captions

By default, the formatting of caption text in floated figures looks very similar to the main body text. Somewhere on the web I found the recommendation to use the “caption” package to change this, e.g. by adding the following to the preamble:

\usepackage[margin=10pt,font=small,labelfont=bf,labelsep=endash]{caption}

Fitting Tables Onto Pages

Making tables fit on the page is annoying – just changing the text size often doesn’t reduce the overall size or causes a horrible font to be used. Resizing the whole table is the best way I found. Before the table (either in a float or in the normal flow of the document) I added the following in ERT:

 \resizebox{\textwidth}{!}{%

and then immediately after the table added, again in ERT:

 }

This scales the table such that the width of the table fits the page width.

Suppressing Pages Numbers For Full Page Figures

If you want to use a whole page for a floated figure, the page number can overlap with the figure or just look odd.  However, second tip here: http://wiki.lyx.org/FAQ/UnfloatingFigureOnEmptyPage works well to remove page numbers from all-page figures. To summarize, add the following to the preamble:

\usepackage{floatpag}
\floatpagestyle{plain} % Default page style for pages with only floats

Then, in ERT before the figure (but still in the float) add:

\thisfloatpagestyle{empty}
\vspace{-\headsep}

… and similarly, after the figure but above the caption add:

\vspace{0.3cm}

… or you may find the caption too close to the graphics.

Better On-Screen Fonts in PDFs

As explained in the second question in this mini-FAQ on generating PDFs from LyX you should use the outline font version of Computer Modern instead of the bitmapped versions. For me, this boiled down to going to Document -> Settings -> Fonts and setting the Roman font option to “AE (Almost European)”.

You can further improve the rendering of text in your output by using microtype. Just add

\usepackage{microtype}

… to the preamble. (These suggestions only apply if you’re using the pdflatex workflow as suggested above.)

Adding Footnotes in Tables

You might notice that adding footnotes in table cells doesn’t work. One answer to this is manually add them in ERT with \footnotemark and \footnotetext, possibly adjusting the counter as described in that FAQ entry.