This post was originally intended for the mySociety blog, but because of its length and technical content people suggested to me it might be more suitable for my own blog. This about the development of the YourNextRepresentative project (GitHub link), and in particular why and how we migrated its storage system from using the PopIt web service to instead use PostgreSQL via django-popolo. Hopefully this might be of interest to those who have been following the development of YourNextRepresentative (previously YourNextMP).
YourNextMP for 2015
For the 2015 General Election in the UK, we developed a new codebase to crowd-source the candidates who were standing in every constituency; this site was YourNextMP. (Edmund von der Burg kindly let us use the yournextmp.com domain name, which he’d used for his site with the same intent for the 2010 election.) We were developing this software to help out Democracy Club, who ran the site and built up the community of enthusiastic contributors that made YourNextMP so successful.
At the time, we saw this crowd-sourcing challenge as a natural fit for another technology we had been developing called PopIt, which was a data store and HTTP-based API for information about people and the organizations they hold positions in. PopIt used the Popolo data model, which is a very carefully thought-out specification for storing open government data. The Popolo standard helps you avoid common mistakes in modelling data about people and organisations, and PopIt provided interfaces for helping people to create structured data that conformed to that data model, and also made it easily available.
The original intention was that YourNextMP would be quite a thin front-end for PopIt but, as development progressed, but it became clear that this would provide a very poor user experience for editors. So, the YourNextMP wrapper, even in the very early versions, had to provide a lot of features that weren’t available in PopIt, such as:
- A user experience specific to the task of crowd-sourcing election candidate data, rather than the very generic and unconstrained editing interface of PopIt’s web-based UI.
- Versioning of information about people, including being able to revert to earlier versions.
- Lookup of candidates by postcode.
- Summary statistics of progress and completion of the crowdsourcing effort.
- Logging of actions taken by users, so recent changes could be tracked and checked.
- CSV export of the data as well as the JSON based API.
- etc. etc.
Later on in development we also added more advanced features such as a photo moderation queue. To help support all of this, the YourNextMP front-end used a traditional RDBMS (in our case usually PostgreSQL), but effectively PopIt was the primary data store, which had all the information about people and the posts they were standing for.
This system worked fine for the general election, and I think everyone considered the YourNextMP project to be a great success – we had over a million unique users, and the data was extensively reused, including by Google for their special election information widget. (You can see some more about the project in this presentation.)
Turning YourNextMP into YourNextRepresentative
There had been a considerable demand to reuse the code from YourNextMP for other elections internationally, so our development efforts then focussed on making the code reusable for other elections. The key parts of this were:
- Separating out any UK-specific code from the core application
- Supporting multiple elections generically (the site was essentially hard-coded to only know about two elections – the 2010 and 2015 UK general elections)
- Internationalizing all text in the codebase, so that it could be localized into other languages.
We worked on this with the target of supporting a similar candidate crowd-sourcing effort for the general election in Argentina in 2015, which was being run by Congreso Interactivo, Open Knowledge Argentina and the Fundación Legislativo. This new version of the code was deployed as part of their Yo Quiero Saber site.
Since the name “YourNextMP” implies that the project is specific to electoral systems with Members of Parliament, we also changed the name to YourNextRepresentative. This name change wasn’t just about international re-use of the code – it’s going to be supporting the 2016 elections in the UK as well, where the candidates won’t be aspiring to be MPs, but rather MSPs, MLAs, AMs, mayors, PCCs and local councillors.
Problems with PopIt
It had become apparent to us throughout this development history that PopIt was creating more problems than it was solving for us. In particular:
- The lack of foreign keys constraints between objects in PopIt, and across the interface between data in PostgreSQL and PopIt meant that we were continually having to deal with data integrity problems.
- PopIt was based on MongoDB, and while it used JSON schemas to constrain the data that could be stored in the Popolo defined fields, all other data you added was unconstrained. We spent a lot of time of time writing scripts that just fixed data in PopIt that had been accidentally introduced or broken.
- PopIt made it difficult to efficiently query for a number of things that would have been simple to do in SQL. (For example, counts of candidates per seat and per party were calculated offline from a cron-job and stored in the database.)
- Developers who wanted to work on the codebase would have to set up a PopIt locally or use our hosted version; this unusual step made it much more awkward for other people to set up a development system compared to any conventional django-based site.
- PopIt’s API had a confusing split between data that was available from its collection-based endpoints and the search endpoints; the latter meant using Elasticsearch’s delightfully named “Simple Query String Query” which was powerful but difficult for people to use.
- It was possible (and common) to POST data to PopIt that it could store in MongoDB but couldn’t stored in Elasticsearch, but no error would be returned. This long-standing bug meant that the results you got from the collections API (MongoDB-backed) and search API (Elasticsearch-backed) were confusingly inconsistent.
- The latency of requests to the API under high load meant we had to have a lot of caching. (We discovered this on the first leaders’ debate, which was a good indication of how much traffic we’d have to cope with.) Getting the cache invalidation correct was tricky, in accordance with the usual aphorism.
At the same time, we were coming to the more broad conclusion that the PopIt project hadn’t been achieving the goals that we’d hoped it would, despite having putting a lot of development time into it, and we decided to stop all development on PopIt. (For many applications of PopIt we now felt the user need was better served by the new EveryPolitician project, but in YourNextRepresentative’s case that didn’t apply, since EveryPolitician only tracks politicians after they’re elected, not when they’re candidates.)
As developers we wanted to be able to use a traditional RDBMS again (through the Django ORM) while still getting the benefits of the carefully thought-out Popolo data model. And, happily, there was a project that would help use to do exactly that – the django-popolo package developed by openpolis.
Migrating YourNextRepresentative to django-popolo
django-popolo provides Django models which correspond to the Popolo Person, Organisation, Membership, Post and Area classes (and their associated models like names and contact details).
We had used django-popolo previously for SayIt, and have been maintaining a fork of the project which is very close to the upstream codebase, except for removing the requirement that one uses django-autoslug for managing the id field of its models. We opted to use the mySociety fork for related reasons:
- Using the standard Django integer primary key id field rather than a character-based id field seems to be closer to Django’s “golden path”.
- There are interesting possibilities for using SayIt in a YourNextRepresentative site (e.g. to capture campaign speeches, or promises) and using the same version of django-popolo will make that much easier
Extending django-popolo’s models
Perhaps the biggest technical decision about how to use the django-popolo models (the main ones being Person, Organization, Post and Membership) is how we extended those models to add the various extra fields we needed for YourNextRepresentative’s use case. (With PopIt you could simply post a JSON object with extra attributes.) The kinds of extra-Popolo data we recorded were:
- The one or many elections that each Post is associated with.
- The particular election that a Membership representing a candidacy is associated with.
- The set of parties that people might stand for in a particular Post. (e.g. there’s a different set of available parties in Great Britain and Northern Ireland).
- The ‘versions’ attribute of a Person, which records all the previous versions of that person’s data. (We considered switching to a versioning system that’s integrated with Django’s ORM, like one of these, but instead we decided to just make the smallest incremental step as part of this migration, which meant keeping the versions array and the same JSON serialization that was used previously, and save switching the versioning system for the future.
- Multiple images for each Person and Organization. (django-popolo just has a single ‘image’ URL field.)
- Whether the candidates for a particular Post are locked or not. (We introduced a feature where you could lock a post once we knew all the candidates were correct.)
- To support proportional representation systems where candidates are elected from a party list, each Membership representing a candidacy needs a “party_list_position” attribute to indicate that candidate’s position on the party list.
Perhaps the most natural way of adding this data would be through multi-table inheritance; indeed, that is how SayIt uses django-popolo. However, we were wary of this approach because of the warnings in Two Scoops of Django and elsewhere that using multi-table inheritance can land you with difficult performance problems because queries on the parent model will use OUTER JOINs whether you need them or not. We decided instead to follow the Two Scoops of Django suggestion and make the one-to-one relationship between parent and child table explicit by creating new models called PersonExtra, PostExtra, etc. with a `base` attribute which is a OneToOneField to Person, Post, etc., respectively. This means that the code that uses these models is slightly less clear than it would be otherwise (since sometimes you use person, sometimes person.extra) but we do have control over when joins between these tables are done by the ORM.
Once the Extra models were created, we wrote the main data migration. The idea of this was that if your installation of YourNextRepresentative was configured as one of the known existing installations at the time (i.e. the ELECTION_APP setting specified the St Paul, Burkina Faso, Argentina or the UK site) this data migration would download a JSON export of the data from the corresponding PopIt instance and load it into the django-popolo models and the *Extra models that extend them.
As the basis for this migration, we contributed a PopIt importer class and management command upstream to django-popolo. This should make it easier for any project that used to use PopIt to migrate to django-popolo, if it makes sense for them to do so. Then the data migration in YourNextRepresentative subclassed the django-popolo PopItImporter class, extending it to also handle the extra-Popolo data we needed to import.
(A perhaps surprising ramification of this approach is that once an installation has been migrated to django-popolo we should change the name of that country’s ELECTION_APP, or otherwise someone setting up a new site with that ELECTION_APP configured will have to wait for a long time for out-of-date data to be imported on running the initial migrations. So we will shortly be renaming the “uk_general_election_2015” application to just “uk”. To support people who want that feature (cloning an existing site to a development instance) we’ve added a new “candidates_import_from_live_site” management command that uses its new API to mirror the current version of an existing instance.)
Another issue that came up in working on this data migration is that we needed to preserve any identifiers that were used in URLs on the site previously so that after upgrading each site every bookmarked or search-engine-indexed URL would still show the same content. In the case of the Person model, this was easy because it used an integer ID previously. Parties and posts, however, used strings as their IDs. We migrated these IDs to fields called ‘slug’ (perhaps a slightly misleading name) on the OrganisationExtra and PostExtra models.
This turns out to be quite a slow migration to run – as well as importing the core data, it also downloads every image of people and parties, which is pretty slow even on a fast connection.
Updating views, tests and management commands
The next part of the migration was updating all the code that previously used PopIt’s API to instead use the new Django models. This was a significant amount of work, which left very little code in the project unchanged by the end. In general we tried to update a test at a time and then change the core code such that the test passed, but we knew there was quite a bit of code that wasn’t exercised by the tests. (One nice side-effect of this work is that we greatly improved the coverage of the test suite.)
We did think about whether we could avoid doing this update of the code essentially in one go – it felt rather like a “stop-the-world refactoring”; was there an incremental approach that would have worked better? Maybe so, but we didn’t come up with one that might reasonably have saved time. If the old code that interacted with the PopIt API had been better encapsulated, perhaps it would have made sense to use proxy models which in the migration period updated both PopIt’s API and the database, but this seemed like it would be at least as much work, and it was lucky that we did have a period of time we could set aside for this work during which the sites weren’t being actively updated.
Moving other code and configuration to the database
We also took the opportunity of this migration to introduce some new Django models for data that had previously been defined in code or as Django settings. In particular:
- We introduced Election and AreaType models (previously the elections that a site supported and the types of geographical boundary they were associated with were defined in a dictionary in the per-country settings).
- We introduced a PartySet model – this is to support the very common requirement that different sets of parties can field candidates in different areas of the country.
- We replaced the concept of a “post group” (definied in code previously) with a “group” attribute on PostExtra
These all had the effect of simplifying the setup of a new site – more of the initial setup can now be done in the Django admin interface, rather than needing to be done by someone who’s happy to edit source code.
Replacing PopIt’s API
One of the nice aspects of using PopIt as the data store for the site was that it supplied a RESTful API for the core data of the site, so we previously hadn’t had to put much thought into the API other than adding some basic documentation of what was there already. However, after the migration away from PopIt we still wanted to provide some HTTP-based API for users of the site. We chose to use Django REST framework to provide this; it seems to be the most widely recommended tool at the moment for providing a RESTful API for a Django site (and lots of people and talks at djangocon EU in Cardiff had independently recommended it too). Their recommendations certainly weren’t misplaced – it was remarkably quick to add the basic read-only API to YourNextRepresentative. We’re missing the more sophisticated search API and various other features that the old PopIt API provided, but there’s already enough information available via the API to completely mirror an existing site, and Django REST framework provides a nice framework for extending the API in response to developers’ needs.
The end result
As is probably apparent from the above, this migration was a lot of work, but we’re seeing the benefits every day:
- Working on the codebase has become a vastly more pleasant experience; I find I’m looking forward to it working on it much more than I ever did previously.
- We’ve already seen signs that other developers appreciate that it’s much more easy to set up than previously.
- Although the tests are still far from perfect, they’re much more easy to work with than previously. (Previously we mocked out a large number of the external PopIt API requests and returned JSON from the repository instead; this would have been a lot better if we’d used a library like betamax instead to record and update these API responses, but not having to worry about this at all and just create test data with factory_boy is better yet, I think.)
I think it’s also worth adding a note that if you’re starting a new site that deals with data about people and their memberships of organizations, then using the Popolo standard (and if you’re a Django developer, django-popolo in particular) can save you time – it’s easy to make mistakes in data modelling in this domain (e.g. even just related to names). It should also help with interoperability with other projects that use the same standard (although it’s a bit more complicated than that – this post is long enough already, though :))
The UK instance of YourNextRepresentative (at edit.yournextmp.com) has been using the new codebase for some time now, and that will be relaunched shortly to collect data on the 2016 elections in the UK.