I wanted to know for each president, in what state were they born, so I wrote this Wikidata query:
SELECT * WHERE {
# P31 = instance of
# Q5 = human (excludes fictional characters)
?president wdt:P31 wd:Q5.
# P39 = held office
# Q11696 = POTUS
?president wdt:P39 wd:Q11696.
# P19 = place of birth
?president wdt:P19 ?birthPlace.
# P131 = located in the administrative territorial entity
?birthPlace wdt:P131* ?state.
# P31 = instance of
# Q35657 = US State
#?state wdt:P31 wd:Q35657.
}
This query returns all birth places and containing territories of those birth places for all presidents, yielding 140 results in 743 ms. For example, (Theodore Roosevelt, Manhattan) and (Theodore Roosevelt, New York) are both in the result.
When I uncommenting the last line (filter the 140 territories in the previous result for only US states), the query breaks the 60 second timeout. Why is that, and how to simplify the query?
My debugging:
I haven't observed a chain of P131 to be longer than 4. They are usually Neighborhood/Burrough -> City -> State -> US, and the US is not contained in any adminstrative territory. On average the paths are of length 140 results / 45 presidents = 3.1.
I looked at the BlazeGraph query details for the first, all territories query and the second, US State query. Note that the second BlazeGraph query does not render due to the timeout, so click "view source" to see it. See below for a filtered version
It appears to my untrained eye that the relevant difference is that the first one filters on P39 Q11696 (holds-office POTUS) and then looks up the birth places (P19) whereas the second query looks up the birth places (P19) and then filters on P39-Q11696 (holds-office POTUS).
- Is my conclusion right?
- Based on the query planner's own estimated cardinalities, isn't it better to filter (cardinality=63) then lookup rather than lookup (cardinality=3720417) then filter?
- If so, how can I guide the query planner to do the filter before the lookup?
Filtered versions of both query plans in BlazeGraph:
# First, all-territories query
ChunkedMaterializationOp[14](ProjectionOp[13])[ vars=[president, birthPlace, state] ]
ProjectionOp[13](HashJoinOp[12])[ select=[president, birthPlace, state] ]
HashJoinOp[12](Pathop[11])[ namedSetRef=NamedSolutionSetRef{localName=set-7,joinVars=[birthPlace]} ]
Pathop[11](HashIndexOp[10])[ subquery=Join[9]()[predicate=SPO[8](tVarLeft=null, P131, tVarRight=null)[estimatedCardinality=14287222]], leftTerm=birthPlace, rightTerm=state, projectInVars=[birthPlace], dropVars=[tVarLeft, tVarRight] ]
@subquery:
Join[9]()[ predicate=SPO[8](tVarLeft=null, P131, tVarRight=null)[estimatedCardinality=14287222] ]
HashIndexOp[10](Join[6])[ namedSetRef=NamedSolutionSetRef{localName=set-7,joinVars=[birthPlace]} ]
Join[6](Join[4])[ predicate=SPO[5](president=null, P31, Q5)[estimatedCardinality=12613937] ]
Join[4](Join[2])[ predicate=SPO[3](president=null, P19, birthPlace=null)[estimatedCardinality=3720416]]
Join[2]()[ predicate=SPO[1](president=null, P39, Q11696)[estimatedCardinality=63] ]
# Second, only-states query
ChunkedMaterializationOp[16](ProjectionOp[15])[ vars=[president, birthPlace, state] ]
ProjectionOp[15](Join[14])[ sharedState=true, select=[president, birthPlace, state] ]
Join[14](Join[12])[ predicate=SPO[13](president=null, P31, Q5)[estimatedCardinality=12613938] ]
Join[12](Join[10])[ predicate=SPO[11](president=null, P39, Q11696)[estimatedCardinality=63] ]
Join[10](HashJoinOp[8])[ predicate=SPO[9](president=null, P19, birthPlace=null)[estimatedCardinality=3720417] ]
HashJoinOp[8](Pathop[7])[ namedSetRef=NamedSolutionSetRef{localName=set-3,joinVars=[state]} ]
Pathop[7](HashIndexOp[6])[ subquery=Join[5]()[predicate=SPO[4](tVarLeft=null, P131, tVarRight=null)[estimatedCardinality=14287225]], leftTerm=birthPlace, rightTerm=state, projectInVars=[state], dropVars=[tVarLeft, tVarRight] ]
@subquery:
Join[5]()[ predicate=SPO[4](tVarLeft=null, P131, tVarRight=null)[estimatedCardinality=14287225] ]
HashIndexOp[6](Join[2])[ HashjoinVars=[state], namedSetRef=NamedSolutionSetRef{localName=set-3,joinVars=[state]} ]
Join[2]()[ predicate=SPO[1](state=null, P31, Q35657)[estimatedCardinality=50] ]
SELECT * WHERE { hint:Query hint:optimizer "None". ?president wdt:P39 wd:Q11696. ?president wdt:P31 wd:Q5. ?president wdt:P19 ?birthPlace. ?birthPlace wdt:P131* ?state. ?state wdt:P31 wd:Q35657. }