2

I wanted to know for each president, in what state were they born, so I wrote this Wikidata query:

SELECT * WHERE {
  # P31 = instance of
  # Q5 = human (excludes fictional characters)
  ?president wdt:P31 wd:Q5.

  # P39 = held office
  # Q11696 = POTUS
  ?president wdt:P39 wd:Q11696.

  # P19 = place of birth
  ?president wdt:P19 ?birthPlace.

  # P131 = located in the administrative territorial entity
  ?birthPlace wdt:P131* ?state.

  # P31 = instance of
  # Q35657 = US State
  #?state wdt:P31 wd:Q35657.
}

This query returns all birth places and containing territories of those birth places for all presidents, yielding 140 results in 743 ms. For example, (Theodore Roosevelt, Manhattan) and (Theodore Roosevelt, New York) are both in the result.

When I uncommenting the last line (filter the 140 territories in the previous result for only US states), the query breaks the 60 second timeout. Why is that, and how to simplify the query?


My debugging:

I haven't observed a chain of P131 to be longer than 4. They are usually Neighborhood/Burrough -> City -> State -> US, and the US is not contained in any adminstrative territory. On average the paths are of length 140 results / 45 presidents = 3.1.

I looked at the BlazeGraph query details for the first, all territories query and the second, US State query. Note that the second BlazeGraph query does not render due to the timeout, so click "view source" to see it. See below for a filtered version

It appears to my untrained eye that the relevant difference is that the first one filters on P39 Q11696 (holds-office POTUS) and then looks up the birth places (P19) whereas the second query looks up the birth places (P19) and then filters on P39-Q11696 (holds-office POTUS).

  • Is my conclusion right?
  • Based on the query planner's own estimated cardinalities, isn't it better to filter (cardinality=63) then lookup rather than lookup (cardinality=3720417) then filter?
  • If so, how can I guide the query planner to do the filter before the lookup?

Filtered versions of both query plans in BlazeGraph:

# First, all-territories query

ChunkedMaterializationOp[14](ProjectionOp[13])[ vars=[president, birthPlace, state] ]
  ProjectionOp[13](HashJoinOp[12])[ select=[president, birthPlace, state] ]
    HashJoinOp[12](Pathop[11])[ namedSetRef=NamedSolutionSetRef{localName=set-7,joinVars=[birthPlace]} ]
      Pathop[11](HashIndexOp[10])[ subquery=Join[9]()[predicate=SPO[8](tVarLeft=null, P131, tVarRight=null)[estimatedCardinality=14287222]], leftTerm=birthPlace, rightTerm=state, projectInVars=[birthPlace], dropVars=[tVarLeft, tVarRight] ]
      @subquery:
        Join[9]()[ predicate=SPO[8](tVarLeft=null, P131, tVarRight=null)[estimatedCardinality=14287222] ]
        HashIndexOp[10](Join[6])[ namedSetRef=NamedSolutionSetRef{localName=set-7,joinVars=[birthPlace]} ]
          Join[6](Join[4])[ predicate=SPO[5](president=null, P31, Q5)[estimatedCardinality=12613937] ]
            Join[4](Join[2])[ predicate=SPO[3](president=null, P19, birthPlace=null)[estimatedCardinality=3720416]]
              Join[2]()[ predicate=SPO[1](president=null, P39, Q11696)[estimatedCardinality=63] ]

# Second, only-states query

ChunkedMaterializationOp[16](ProjectionOp[15])[ vars=[president, birthPlace, state] ]
  ProjectionOp[15](Join[14])[ sharedState=true, select=[president, birthPlace, state] ]
    Join[14](Join[12])[ predicate=SPO[13](president=null, P31, Q5)[estimatedCardinality=12613938] ]
      Join[12](Join[10])[ predicate=SPO[11](president=null, P39, Q11696)[estimatedCardinality=63] ]
        Join[10](HashJoinOp[8])[ predicate=SPO[9](president=null, P19, birthPlace=null)[estimatedCardinality=3720417] ]
          HashJoinOp[8](Pathop[7])[ namedSetRef=NamedSolutionSetRef{localName=set-3,joinVars=[state]} ]
            Pathop[7](HashIndexOp[6])[ subquery=Join[5]()[predicate=SPO[4](tVarLeft=null, P131, tVarRight=null)[estimatedCardinality=14287225]], leftTerm=birthPlace, rightTerm=state, projectInVars=[state], dropVars=[tVarLeft, tVarRight] ]
            @subquery:
              Join[5]()[ predicate=SPO[4](tVarLeft=null, P131, tVarRight=null)[estimatedCardinality=14287225] ]
              HashIndexOp[6](Join[2])[ HashjoinVars=[state], namedSetRef=NamedSolutionSetRef{localName=set-3,joinVars=[state]} ]
                Join[2]()[ predicate=SPO[1](state=null, P31, Q35657)[estimatedCardinality=50] ]
3
  • @DarkBee, why remove "SPARQL" from the title? Commented Oct 23 at 21:33
  • you could disable the query optimizer and put the more discriminative POTUS triple pattern to the top: SELECT * WHERE { hint:Query hint:optimizer "None". ?president wdt:P39 wd:Q11696. ?president wdt:P31 wd:Q5. ?president wdt:P19 ?birthPlace. ?birthPlace wdt:P131* ?state. ?state wdt:P31 wd:Q35657. } Commented Oct 24 at 5:45
  • what you could also do is to chose another SPARQL endpoint that does host Wikidata, e.g. Qlever - note, all the queries with Blazegraph custom wont work in the same way, e.g. the is no such label service like you might have seen in Wikidata example queries. Commented Oct 24 at 5:51

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.