Why should you learn SPARQL? Wikidata!

Why learn SPARQL now?

I think that SPARQL has something of a bad reputation in the open data community – my impression is that that came about because when organisations published data by making a SPARQL endpoint accessible, people had problems like:

  • Many SPARQL endpoints not working reliably
  • Writing queries that joined data between different endpoints never quite worked
  • The very variable quality of data sources
  • People just wanted CSV files, which they already know how to deal with, not an endpoint they had to learn a query language to use

The last one of these in particular I think was particularly important: people who just wanted to get data they could use felt they were being sold semantic web tech that they didn’t need, and that was getting in the way of more straightforward ways of accessing the data they needed.

However, nowadays I think there is a compelling reason to have a good command of SPARQL, and that is: the Wikidata project.

What is Wikidata?

Wikidata is an amazing project, with an ambitious goal on the scale of OpenStreetMap or Wikipedia: to provide a structured database of the world’s knowledge maintained by volunteers (and, like Wikipedia, anyone can edit that data).

One of the motivations for the Wikidata project is that lots of data that you see on Wikipedia pages, particularly in the infoboxes, is just stored as text, not structured data. This creates the strange situation where data like the population of a city might be different on the German and French Wikipedia and each language community maintains that independently – a massive duplication of effort. Some of these infoboxes are now being migrated to be generated from Wikidata, which is a huge win, particularly for the smaller Wikipedia communities (e.g. Welsh Wikipedia has a much smaller community than Spanish Wikipedia so benefits a lot from being able to concentrate on things other than redundantly updating data).

However, because of its potential as a repository for structured data about anything notable, the project has become much more than just a way of providing data for Wikipedia info boxes, or sitelinks – data in Wikidata doesn’t necessarily appear in Wikipedia at all.

A project that I’m involved in at work, for example, is writing tools to help to get data about all the world’s politicians into Wikidata. This effort grew out of the EveryPolitician project to ensure its long term sustainability.

Why might Wikidata change your perception of SPARQL?

The standard way to run queries against Wikidata nowadays is to write SPARQL queries using the Wikidata Query Service.

Going back to those points at the top of this post about why people might have felt that there was no point in learning SPARQL, I don’t think they apply in the same way when you’re talking about SPARQL specifically for querying Wikidata:

  • The Wikidata Query Service seems to work very reliably. We’ve been using it heavily at work, and I’ve been happy enough with its reliability to launch a service based on making live queries against it.
  • Because of the extraordinarily ambitious scope of Wikidata, there’s no real reason to make queries across different SPARQL endpoints – you just use the Wikidata Query Service, and deal entirely with information in Wikidata. (This also means you can basically ignore anything in generic SPARQL tutorials about namespaces, PREFIX or federation, which simplifies learning it a lot.)
  • Data quality varies a lot between different subjects in Wikidata (e.g. data about genes is very high quality, about politicians less so – so far!) but it’s getting better all the time, and you can always help to immediately fix errors or gaps in the data when you find them, unlike with other many other data projects. For lots of things you might want to do, though, the data is already good enough.
  • Lots of open data that governments, for example, release has a naturally tabular structure, so the question “why should I learn SPARQL just to get a CSV file?” is completely valid, but Wikidata’s data model is (like the real world) very far from being naturally tabular – it’s a highly interconnected graph, supporting multiple claims about particular facts, with references, qualifiers and so on. If you all you want is a CSV file, you (or someone else) can write some SPARQL that you can use in a URL that returns a CSV file, but you wouldn’t want to try to cope with all of Wikidata as a CSV file – or even all the information about a particular person that way.

Some motivating examples

You can find some fun examples of the kinds of things you can query with SPARQL in Wikidata at the WikidataFacts Twitter account, e.g.

These examples also demonstrate a couple of other things:

  • There are some nice ways of visualizing results built into the Wikidata Query Service, like timelines, graphs, images, maps, etc.
  • The musical instruments example is clearly incomplete and likely biased by which collections of art are best represented in Wikidata; lots of queries you might make are like this – they’re interesting despite being incomplete, and the data will only get better over time. (Also, looking at the results can give you ways of thinking about how best to improve the data – e.g. what’s the biggest missing source?)

As an example of something I made that uses Wikidata (as a fun holiday project) here’s a site that can suggest a random episode of your favourite TV programme to watch (there are links at the bottom of the page to see the queries used in generating each page) and suggests ways of improving the data.

Finally, here’s a silly query I just wrote – I knew that Catmando was unusual in being a cat that was a leader of a political party, but wondered if there are other animals that have an occupation of “politician”. It turns out that there are! (click the “play” icon to run the query and see the results).

Great! How do I get started?

There’s an excellent tutorial on using SPARQL to extract data from Wikidata. I’d definitely recommend starting there, rather than more generic SPARQL tutorials, those others will tell you lots of things you don’t need to know for Wikidata, and miss out lots of useful practical tips. Also, the examples are all ones you can try or play with directly from the tutorial.

It’s a long tutorial, but you’ll get a lot of out of it even if you don’t got through the whole thing.

Tips that helped me to write SPARQL more effectively

The original idea of this post was to pass on some tips that helped me to write SPARQL better (which I mostly got from Tony Bowden and Lucas Werkmeister) – though it turns out that lots of these are in the tutorial I linked to above in some form or other! Nonetheless, I figure it might be useful to someone to reiterate or expand on some of these here. Some are quite basic, while others will probably only make sense after you’ve been writing SPARQL for a bit longer.

1. Use the Wikidata Query Service for writing queries

There are a couple of features of the Wikidata Query Service that mean it’s the best way I know of to start writing queries from scratch:

  • If you don’t know which property or item you want, you can follow “wdt:” or “wd:” with some plain English, and hit “ctrl-space” to autocomplete it.
  • If you mouse-over a property or item, the tooltip will give you a human readable description of it.

Both of these will reduce your reliance on post-it notes with property numbers :)

Of course, writing queries in the Wikidata Query Service form also means you can try them out just with a button press (or control-enter :)).

2. Use Reasonator or Squid for browsing relationships of an item

The item browser at Wikidata.org is a bit limited: the main way in which this is frustrating is that the item pages only show claims that the item is the subject of. For example, the item page for the X-Files looks like:

… and you can’t get from there to the episodes of the series, or its seasons, since those are the objects of the ‘series’ and ‘part of’ predicates, not the subjects. The Reasonator and Squid pages do let you follow these relationships backwards, however:

(Also, those alternatives show images and other media related to the items, which is nice :))

3. Learn what wdt:P31/wdt:P279* means

Wikidata’s very flexible way of modelling data means that trying to find a particular “type of thing” can be complicated. For example, if you want to find all television series, then the following query:

  ?series wdt:P31 wd:Q5398426

(which asks for any item that is an ‘instance of’ (P31) the item ‘television series’ (Q5398426)) wouldn’t return The Simpsons, since that’s an instance of ‘animated series’ (Q581714) instead.

‘animated series’, however, is a ‘subclass of’ (P279) ‘television series’. This means that if you change the query to:

  ?series wdt:P31/wdt:P279* wd:Q5398426

… it will include The Simpsons. That new version of the query essentially asks for any item that is an instance of ‘television series’ or anything that is a ‘subclass of’ ‘television series’ if you keep following that ‘subclass of’ relationship up the hierarchy.

You’ll probably need to use this quite frequently. A more advanced variant that you might need is to use is:

  ?series p:P31/ps:P31/wdt:P279* wd:Q5398426

… which will consider an ‘instance of’ relationship even if there are multiple ‘instance of’s and another one has the ‘preferred’ rank. I’ve written more about the ‘p:’ and ‘ps:’ predicates below.

4. Add labels to your query results

Understanding if your query has worked properly or not is confusing if you’re just seeing a list of item numbers in your output – for instance in this query to find all episodes of season 1 of the West Wing:

SELECT ?episode WHERE {
  ?episode wdt:P361 wd:Q3730536

One way of improving this is to add variables with the “labels” of Wikidata items. The easiest way to do this in the Wikidata Query Service is to type “SERVICE” and hit control-space – the second option (beginning “SERVICE wikibase:label…”) will add a SERVICE clause like this:

SERVICE wikibase:label {
  bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".

With this clause added in to the WHERE clause, you’ll find that you have some extra variables you can include in those you’re SELECTing, which are named with a suffix of “Label” on the end of the existing variable name. So to see the names of those West Wing episodes, the full query would be:

SELECT ?episode ?episodeLabel WHERE {
  ?episode wdt:P361 wd:Q3730536 .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".

This should be much easier to understand in your output.

Here are some more advanced notes on this tip:

  • The default languages of “[AUTO_LANGUAGE],en” won’t be what you want in every situation. For example, in queries for politicians from Argentina, it made more sense for us to use “es,en”, which is saying to use the Spanish label if present, or otherwise an English one.
  • You should be aware that Wikidata has more flexible ways of representing names of people than the labels attached to the person’s item.
  • Sometimes the “Label” suffixed variable name won’t be what you want, and you’ll want to customize the variable name. (For example, this might come up if you care about the column headers of CSV output, which are based on the variable names.) In these cases you could rename them using the rdfs:label property within the SERVICE clause. In the example above we could do this like:
SELECT ?episode ?episode_name WHERE {
  ?episode wdt:P361 wd:Q3730536 .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" .
    ?episode rdfs:label ?episode_name.

5. Learn the helpful semi-colon, comma and square bracket syntax to abbreviate your queries

In your WHERE clauses, if you’re making lots of constraints with the same subject, you can avoid repeating it by using a semi-colon (;). There’s a nice example of this in the tutorial. With helpful indentation, this can make your queries shorter and easier to read.

That same section of the tutorial introduces two other shorthands that are sometimes very helpful:

  • a comma (,) to reduce repetition of the object in a constraint
  • using square brackets ([]) to substitute in the subject of another constraint

6. Understand the different predicate types

A big conceptual jump for me in writing SPARQL queries for Wikidata was that you can do quite a lot just using the “wdt:” predicates, but they only allow you to query a limited subset of the information in Wikidata. For lots of things you might want to query, you need to use predicates with other prefixes. To explain the problem, here’s an example of the data that a wdt:P69 (“educated at”) pattern for Douglas Adams extracts:

As you can see, there are two “educated at” statements for Douglas Adams, and the only information that a wdt:P69 pattern extracts are the items representing the institution itself. As you can see in that diagram, there’s lots more information associated with those statements, like references and qualifiers which you can’t get at just with a wdt: prefix.  A further limitation is that if there are multiple statements, the wdt: predicates only extract the most “truthy” statements – so if one of them is given preferred rank, that’s the only one that’ll match.

Fortunately, there are other predicate types that you can use to get everything you see in that diagram. Firstly, you need a pattern based on a p: predicate, which relates a subject to a statement value – the statement values are the bits in the orange boxes here:

Then you can use other predicate types like pq:, ps:, pr: and prov:wasDerivedFrom to access details within those boxes. You can read in more detail about those predicates in this page I wrote (that I made these diagrams are for).

To give an example (which might take some careful reading to understand!) suppose you want to find the all the seasons of The West Wing, with their season number. Each season is a ‘part of’ (P361) the series, and also has a ‘series’ (P179) relationship to the series. However, the number of the season within the series is only available as a ‘series ordinal’ (P1545) qualifier on the series statement, so you need to use the p:, ps: and pq: qualifiers like this:

SELECT ?season ?seasonLabel ?seasonNumber WHERE {
  ?season wdt:P361 wd:Q3577037 .
  ?season p:P179 ?seasonSeriesStatement .
  ?seasonSeriesStatement ps:P179 wd:Q3577037.
  ?seasonSeriesStatement pq:P1545 ?seasonNumber .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".
} ORDER BY xsd:integer(?seasonNumber)

7. Use OPTIONAL when you don’t require a relationship to exist

Because lots of data in Wikidata is currently incomplete, you’ll often want to use OPTIONAL clauses in your WHERE query in situations where there’s extra information that might be helpful, but you don’t want its absence to prevent items from being returned.

Note also that you can nest OPTIONAL clauses, if you have OPTIONAL data that might depend on other OPTIONAL data.

For example, for TV programmes, the ‘series ordinal’ qualifier (on ‘series’ statements), provides both the number of a season and the number of the episode within a season. The latter is done much less consistently than the former, however, so you might want to make that part of your query optional. There are two parts that you might want to make optional:

  • Whether there’s a ‘series’ statement relating the episode to its season at all.
  • If there is such a ‘series’ statement, whether there’s a ‘series ordinal’ qualifier on it.

That’s a case where you might want to nest OPTIONAL blocks to make sure you get as much data as possible, even if this modelling is incomplete. For example, you could use this query:

SELECT ?episode ?episodeLabel ?season ?seasonNumber ?numberWithinSeason WHERE {
  ?episode wdt:P361 ?season .
  ?season p:P179 ?seasonSeriesStatement .
  ?seasonSeriesStatement ps:P179 wd:Q3577037 .
  ?seasonSeriesStatement pq:P1545 ?seasonNumber .
    ?episode p:P179 ?episodeSeasonSeriesStatement .
    ?episodeSeasonSeriesStatement ps:P179 ?season .
      ?episodeSeasonSeriesStatement pq:P1545 ?numberWithinSeason
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".
} ORDER BY xsd:integer(?seasonNumber) xsd:integer(?numberWithinSeason)

(In that example, all the data is now present so the OPTIONAL blocks aren’t necessary, but for other TV series where the episode number modelling isn’t complete, they would be.)

Leave a Reply

Your email address will not be published. Required fields are marked *