GraphDB - very slow sparql query with two connections

Question

My database has information about documents, where each document has a category, e.g.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX: <http://example.com>

:doc1 :hasCategory :category1 .
:category1 rdfs:label "Law" .

There are about 100k statements like this.

Running a simple query to get counts of documents per category:

SELECT ?category (count(distinct ?doc) as ?count) WHERE {
    ?doc :hasCategory ?category .
} GROUP BY ?category

takes about 0.1s to run.

But to return the category labels as well:

SELECT ?category ?label (count(distinct ?doc) as ?count) WHERE {
            ?doc :hasCategory ?category .
            ?category rdfs:label ?label .
} GROUP BY ?category ?label

this query takes more than 7s to run.

Why would the difference be so large, and is there a more optimised query I can use to get the labels?

You can check and compare the query plans of both queries. Do you really need to group by label in each category? If there is only a single label per category, try SELECT ?category (sample(?label) as ?l) (count(distinct ?doc) as ?count) WHERE { ...} GROUP BY ?category — UninformedUser
– UninformedUser, Commented Jun 14, 2018 at 12:51
The estimated number of iterations is 39926 for both queries. sample reduced the time taken by a second, but it's still about 6s — user888734
– user888734, Commented Jun 14, 2018 at 14:47
Actually it's 39926 for the first query and 39926.4551683254 for the second query. — user888734
– user888734, Commented Jun 14, 2018 at 14:51
The number of unique objects for the first statement ?doc :hasCategory ?category . is only 133 . I would have imagined that it only has to find labels for 133 categories — user888734
– user888734, Commented Jun 14, 2018 at 15:01

user888734 · Accepted Answer · 2018-06-14 15:13:04Z

3

I found I can get the desired result in 0.2s with the following query:

SELECT ?category (sample(?lbl) as ?label) ?count WHERE {
    ?category rdfs:label ?lbl .
    {
        SELECT ?category (count(distinct ?doc) as ?count) WHERE {
            ?doc :hasCategory ?category .
        } GROUP BY ?category 
    }
} GROUP BY ?category ?count

But I don't really understand why it's more efficient.

answered Jun 14, 2018 at 15:13

user888734

3,9175 gold badges40 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

vassil_momtchev · Accepted Answer · 2018-06-14 15:29:04Z

2

GraphDB versions before 8.6 release implement the GROUP BY operation with a naive LinkedHashMap, where the hash key is composed of all elements part of the projection. To calculate the hashcode the engine will translate the internal identifier to a RDF value. If the strings are longer, they will be read from an external collection resulting an extra disk operation and additional CPU to calculate the hashcode.

The only way to optimise the query is to switch to GraphDB 8.6 (currently it's a late release candidate), which implements a more optimised aggregate algorithm or reduce the GROUP BY projection like you did in your answer.

answered Jun 14, 2018 at 15:29

vassil_momtchev

1,2036 silver badges11 bronze badges

Collectives™ on Stack Overflow

GraphDB - very slow sparql query with two connections

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related