Introduction to Querying Wikidata Knowledge Graph using SPARQL
Most developers are familiar with query languages like SQL for querying relational databases. For the past several years, we have been hearing a lot about knowledge graphs. The focus of this article is to query an open knowledge graph called Wikidata using SPARQL.
Before going to explain SPARQL, we need to talk about data formats like RDF. RDF data consists of statements in the form of triples subject-predicate-object. Take, for example, we have got three programming languages, C, Python, and C++, that are stored in our database, with the following identifiers.
<https://example.com/C> <https://example.com/C++> <https://example.com/Python>
We use <, > to embed the identifiers for these languages and make use of the example domain https://example.com/
, which can be replaced by any other user-specified domain.
Let’s also assume that we have another entity called Programming language, represented in the following manner.
<https://example.com/ProgrammingLanguage>
We need to represent the relationship between our example programming languages and the entity called ProgrammingLanguage
. For this purpose, we may need to introduce a relationship called IsA
, which is identified in the following manner:
<https://example.com/IsA>
Now our goal is to state the following statements:
C is a Programming language.
C++ is a Programming language.
Python is a Programming language.
This may be done in the following manner:
<https://example.com/C> <https://example.com/IsA> <https://example.com/ProgrammingLanguage>
<https://example.com/C++> <https://example.com/IsA> <https://example.com/ProgrammingLanguage>
<https://example.com/Python> <https://example.com/IsA> <https://example.com/ProgrammingLanguage>
The difference between the above statements using identifiers and the ones written in natural language (English) is that the former can be easily understood by the machines and can be queried.
And this brings us to SPARQL queries. What if I want to ask the following queries?
- Give me all the programming languages.
- How many programming languages are there in my database?
- What is C? or Python?
SPARQL is a query language that can be used to query semantic web data in RDF format. SPARQL queries also make use of the triple patterns. The above questions can be translated to SPARQL in the following way.
Give me all the programming languages.
SELECT ?proglang {
?proglang <https://example.com/IsA> <https://example.com/ProgrammingLanguage>
}
Here, we use the triple pattern seen above but replacing the first value with a variable proglang. So our SPARQL query engine must find all the possible values for this variable that can match this pattern.
How many programming languages are there in my database?
SELECT (count(?proglang) as ?count) {
?proglang <https://example.com/IsA> <https://example.com/ProgrammingLanguage>
}
As you can see, though we reuse the query seen above, we also make use of a special aggregate function count
that counts the number of possible values that match our pattern with different values for the variable proglang.
What is C++?
SELECT ?type {
<https://example.com/C++> <https://example.com/IsA> ?type
}
We change the position of our variable and use a new variable name type to obtain the type of C++.
However, instead of repeating the example domain https://example.com/
, it is possible to create a namespace using the keyword PREFIX. The above query now becomes:
PREFIX example: <https://example.com/>SELECT ?type {
example:C++ example:IsA ?type
}
But SPARQL can handle many more complex queries. In real-life, we may have a lot of information on programming languages like the date of the first release, the names of creators and designers, etc. And our databases are not just limited to programming languages. They may have information on human beings, natural languages, rivers, mountains, etc. Some of this information may not be present in our database and we may need to query external databases. These use cases are discussed in detail in this article.
Wikidata
To demonstrate SPARQL with real-life data, we now use Wikidata, which is an open-data store for information related to a large number of domains and not just programming languages. It also has a dedicated SPARQL endpoint where you can run the queries given below and see the responses.
Basic SPARQL queries
Let’s reuse some of the above examples.
Give me a list of the programming languages
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT ?proglang {
?proglang wdt:P31 wd:Q9143
}
Take a look at the two prefixes that we used in the above query. wdt: is used to specify the relationships and wd: for the entities. In this case, wdt:P31 can be used to specify the relationship IsA seen above or the type. wd:Q9143
is used to specify the entity “Programming Language”.
So far, we have been using only one triple pattern in the query. What if, there are more than two triple patterns in the query. Take, for example, wdt:P571 can give the date when a programming language was first released. So now our query engine will need to match both the patterns in our data store as demonstrated by the next query.
Give me a list of the programming languages along with the dates of their inception?
SELECT ?proglang ?year {
?proglang wdt:P31 wd:Q9143.
?proglang wdt:P571 ?year.
}
We however we need not repeat the ?proglang every time. We can just simply remove the subsequent appearances, using the ;
.
SELECT ?proglang ?year {
?proglang wdt:P31 wd:Q9143;
wdt:P571 ?year.
}
SPARQL queries using expressions
In real-life situations, we may not be interested in listing all the programming languages. We may wish to filter our results. SPARQL query language supports expressions that can be used to specify conditions and filter relevant results. Let’s take an example query.
Give me a list of the programming languages released after the year 2000 along with the dates of their inception?
SELECT ?proglang ?year {
?proglang wdt:P31 wd:Q9143;
wdt:P571 ?year.
FILTER (year(?year) > 2000).
}
Now our SPARQL query engine will not only try to match the two triple patterns but also verify whether the inception year is greater than 2000. In the above example, we make use of the keyword FILTER for filtering the results. A function year
will extract the year from the inception date.
Aggregate SPARQL queries
What if, we are not interested in listing the example programming languages, but want to explore the count of available information. The languages provide several aggregate functions like count
etc. for this purpose, which are used in the following two examples.
Give me the count of available programming languages?
SELECT (count(?proglang) as ?count) {
?proglang wdt:P31 wd:Q9143.
}
The above code will give the number of programming languages stored in Wikidata. But it is also possible to make use of expressions and get the count of filtered results that is demonstrated below.
Give me the count of programming languages released after the year 2000 along with the dates of their inception?
SELECT (count(?proglang) as ?count) {
?proglang wdt:P31 wd:Q9143;
wdt:P571 ?year.
FILTER (year(?year) > 2000).
}
The above code will give the number of programming languages stored in Wikidata, whose inception is after 2000.
Advanced SPARQL queries
But with SPARQL, we can also try some advanced queries, like asking Wikidata if there are some programming languages stored in the datastore.
Is there any programming language?
ASK {
?proglang wdt:P31 wd:Q9143.
}
In the example below, we ask Wikidata whether it has any information about the programming languages and their inceptions.
Is there information about programming languages and their inception?
ASK {
?proglang wdt:P31 wd:Q9143;
wdt:P571 ?year.
}
The answers to these queries may be true
or false
, depending on the availability of the data.
SPARQL queries using Federation
In real-life, not one datastore can store all the information. We may need to make use of multiple datastores to get a (probably) complete view of the different entities. In our final example, we query another data store called DBPedia to see if we can obtain additional information. For example, the C programming language has a lot of information on DBPedia, which may not be present on Wikidata.
Is there some additional information about programming languages on DBPedia?
SELECT ?proglang ?resource ?val{
?proglang wdt:P31 wd:Q9143.
SERVICE <http://dbpedia.org/sparql> {
?resource rdf:type wd:Q9143;
owl:sameAs ?proglang;
foaf:homepage ?homepage.
}
}
LIMIT 10
See the use of the keyword SERVICE that specifies the SPARQL endpoint of DBPedia for obtaining the relevant information. The interesting part of such queries is that they can be run on the Wikidata SPARQL endpoint and the query engine will call other services like DBPedia for obtaining the data. Such queries are called federated queries.
This article presented an introduction to several key aspects of the SPARQL query language. Though we used wd:Q9143
for obtaining information related to programming languages, some of the above queries can be used for obtaining information related to softwares (wd:Q7397
), mountains (wd:Q8502
), parks (wd:Q22698
), etc.
Originally published at https://johnsamuel.info.