I want to query wikidata entities and their labels in multiple languages. But for some reason querying the labels is very inperformant.
My base query looks like this (find 3 life forms that have unicode characters associated with them), which takes around 200ms to run:
SELECT ?lifeform ?unicode_character WHERE {
?lifeform wdt:P31 wd:Q16521;
wdt:P487 ?unicode_character.
}
LIMIT 3
What I want is to add labels in 4 languages to the result: english (en), german (de), spanish (es) and french (fr). In my opinion that doesn't make the query more difficult, because it is just additional information on the results. It doesn't change the number of results or which results are found.
Here is what an answer would look like:
| lifeform | unicode_character | label_en | label_de | label_fr | label_es |
|---|---|---|---|---|---|
| wd:Q80117 | 🍤 | Caridea | Caridea | Caridea | camarón |
| wd:Q726 | 🐎 | horse | Hauspferd | cheval | caballo |
| wd:Q71516 | 🐪 | Camelus dromedarius | Dromedar | dromadaire | dromedario |
My first approach was based on this stack overflow answer:
SELECT ?lifeform ?unicode_character ?label_en ?label_de ?label_fr ?label_es WHERE {
?lifeform wdt:P31 wd:Q16521;
wdt:P487 ?unicode_character.
OPTIONAL { ?lifeform rdfs:label ?label_en filter (lang(?label_en) = "en"). }
OPTIONAL { ?lifeform rdfs:label ?label_de filter (lang(?label_de) = "de"). }
OPTIONAL { ?lifeform rdfs:label ?label_fr filter (lang(?label_fr) = "fr"). }
OPTIONAL { ?lifeform rdfs:label ?label_es filter (lang(?label_es) = "es"). }
}
LIMIT 3
This works, but brings up the time to almost 7000ms.
So I tried this answer next, resulting in this query:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?lifeform ?unicode_character ?label_en ?label_de ?label_fr ?label_es WHERE {
?lifeform wdt:P31 wd:Q16521;
wdt:P487 ?unicode_character.
?lifeform rdfs:label ?label_en, ?label_de, ?label_fr, ?label_es.
FILTER(
((LANG(?label_de)) = "de") &&
((LANG(?label_en)) = "en") &&
((LANG(?label_fr)) = "fr") &&
((LANG(?label_es)) = "es"))
}
LIMIT 3
This doesn't even terminate (but I think it is close to terminating, because If I reduce it to 2 or 3 languages it does).
(I also did ask the AI for help and it has a lot of suggestions, none of which was even a valid query.)
Honestly, I am a bit baffled about how hard this seems to be. I get that searching for patterns a graph might be hard, but this seems to be a very easy task: Just look up a few associated values for those 3 entities. How does this justify a 14x increase in time? Adding more variables to the query doesn't seem to be such a hit on perfomance (like asking for optional images and parent taxons on the ?lifeform).
Later I hope to increase the size of the request in the future (lets say 10 items with 10 languages each). So I am interested in a better approach, not a cheat to make this very special query work but doesn't generalize.
So my questions are:
- Why do my queries take so long to complete when the base query was reasonably fast?
- How can I query labels in multiple languages in a more performant way?
SELECT ?lifeform ?unicode_character ?label_en ?label_de ?label_fr ?label_es WHERE { { SELECT * { ?lifeform wdt:P31 wd:Q16521; wdt:P487 ?unicode_character. } LIMIT 3 } OPTIONAL { ?lifeform rdfs:label ?label_en filter (lang(?label_en) = "en"). } OPTIONAL { ?lifeform rdfs:label ?label_de filter (lang(?label_de) = "de"). } OPTIONAL { ?lifeform rdfs:label ?label_fr filter (lang(?label_fr) = "fr"). } OPTIONAL { ?lifeform rdfs:label ?label_es filter (lang(?label_es) = "es"). } }