In my projectFor dummy's sake, let's say that I have databasesa database with n columns of comparable integer values. The n differs between databases depending on some external factor. (In reality the n columns (ind0, ind1text, ind2ind below) are the resultand sentid. You can think of it as a join operationdatabase with one row per word, with a word's text, its position in the sentence, and the ID of the mainsentence.
To query for specific occurrences of n words, I join the table with itself n times. That table does not have a unique column except for the default rowid. Note: this original way of JOINing is given further down below in an edit.) I oftentimes want to query these columns in such a way that the integers in the nn ind columns are sequential without any integer between them. Sometimes the order matters, sometimes the order does not. At the same time, each of the n columns also needs to fulfil some requirement, e.g. n0.text = 'a' AND n1.text = 'b' AND n2.text = 'c'. Put differently, in every sentence (unique sentid), find all occurrences of a b c either ordered or in any order (but sequential).
The tougher nut to crack is those cases where the integers also have to be sequential but where the order does not matter (e.g., 7 9 8, 3 1 2, 11 10 9). My current approach is "brute forcing" by simply generating all possible permutations of orders (e.g., (ind1 = ind0 + 1 AND ind2 = ind1 + 1) OR (ind0 = ind1 + 1 AND ind2 = ind0 + 1) OR ...)). But as n grows, this becomes a huge list of possibilities and my query speed seems to be really hurting on this. For example, for n=6 (the max requirement) this will generate 720 potential orders separate with OR... ThisSuch an approach, that works, is given as a minimal but documented example below for you to try out.
import sqlite3
from itertools import permutations
from pathlib import Path
from random import choices, sample
from string import ascii_lowercase
def generate_all_possible_sequences(n: int) -> str:
"""Given an integer, generates all possible permutations of the 'n' given indices with respect to
order in SQLite. What this means is that it will generate all possible permutations, e.g., for '3':
0, 1, 2; 0, 2, 1; 1, 0, 2; 1, 2, 0 etc. and then build corresponding SQLite requirements, e.g.,
0, 1, 2: ind1 = ind0 + 1 AND ind2 = ind1 + 1
0, 2, 1: ind2 = ind0 + 1 AND ind1 = ind2 + 1
...
and all these possibilities are then concatenated with OR to allow every possibility:
((ind1 = ind0 + 1 AND ind2 = ind1 + 1) OR (ind2 = ind0 + 1 AND ind1 = ind2 + 1) OR ...)
"""
idxs = list(range(n))
order_perms = []
for perm in permutations(idxs):
this_perm_orders = []
for i in range(1, len(perm)):
this_perm_orders.append(f"ind{perm[i]} = ind{perm[i-1]} + 1")
order_perms.append(f"({' AND '.join(this_perm_orders)})")
return f"({' OR '.join(order_perms)})"
def main():
pdb = Path("temp.db")
if pdb.exists():
pdb.unlink()
conn = sqlite3.connect(str(pdb))
db_cur = conn.cursor()
db_cur.execute("CREATE TABLE tbl(text TEXT, ind0 INTEGER, ind1 INTEGER, ind2 INTEGER)")
# Generate 200 row values: random str, and three integers between 0-10
vals = [(''.join(choices(ascii_lowercase, k=20)), *map(str, sample(range(10), 3))) for _ in range(200)]
# Wrap the values in single quotes for SQLite
vals = [(f"'{v}'" for v in val) for val in vals]
# Convert values into INSERT commands
cmds = [f"INSERT INTO tbl VALUES ({','.join(val)})" for val in vals]
# Build DB
db_cur.executescript(f"BEGIN TRANSACTION;{';'.join(cmds)};COMMIT;")
# Query DB for sequential occurences in ind0, ind1, and ind2: the order does not matter
# but they have to be sequential
query = f"""SELECT tbl.text, tbl.ind0, tbl.ind1, tbl.ind2
FROM tbl
WHERE {generate_all_possible_sequences(3)}"""
for res in db_cur.execute(query).fetchall():
print("\t".join(map(str, res)))
db_cur.close()
conn.commit()
conn.close()
pdb.unlink()
if __name__ == '__main__':
main()
As requested, here is an updated script where data is actually JOIN'd. Note, of course that this example will still be quite fast but in reality the database contains millions of rows and there can be up to 6 JOINs.
Fiddle with data and current implementation herehere, reproducible Python code below.
Note that it is possible to get multiple results per sentid, but only if at least one of the word indices differs in the three matched words. E.g. permutations of the results themselves are not needed. E.g., 1-2-0 and 3-2-1 can both be valid results for one sentid, but both 1-2-0 and 2-1-0 can't.
import sqlite3
from itertools import permutations
from pathlib import Path
from random import choices, randint, sample
from string import ascii_lowercaseshuffle
def generate_all_possible_sequences(n: int) -> str:
"""Given an integer, generates all possible permutations of the 'n' given indices with respect to
order in SQLite. What this means is that it will generate all possible permutations, e.g., for '3':
0, 1, 2; 0, 2, 1; 1, 0, 2; 1, 2, 0 etc. and then build corresponding SQLite requirements, e.g.,
0, 1, 2: ind1 = ind0 + 1 AND ind2 = ind1 + 1
0, 2, 1: ind2 = ind0 + 1 AND ind1 = ind2 + 1
...
and all these possibilities are then concatenated with OR to allow every possibility:
((ind1 = ind0 + 1 AND ind2 = ind1 + 1) OR (ind2 = ind0 + 1 AND ind1 = ind2 + 1) OR ...)
"""
idxs = list(range(n))
order_perms = []
for perm in permutations(idxs):
this_perm_orders = []
for i in range(1, len(perm)):
this_perm_orders.append(f"w{perm[i]}.ind = w{perm[i-1]}.ind + 1")
order_perms.append(f"({' AND '.join(this_perm_orders)})")
return f"({' OR '.join(order_perms)})"
def main():
pdb = Path("temp.db")
if pdb.exists():
pdb.unlink()
conn = sqlite3.connect(str(pdb))
db_cur = conn.cursor()
# Create a table of words, where each word has its text, its position in the sentence, and the ID of its sentence
db_cur.execute("CREATE TABLE tbl(wordtext TEXT, ind INTEGER, sentid INTEGER)")
# Create dummy data
vals = []
for sent_id in range(320):
shuffled = ["a", "b", "c", "d", "e", "a", "c"]
shuffle(shuffled)
for word_id, word in rangeenumerate(15shuffled):
vals.append((''.join(choices(ascii_lowercase, k=5))word, word_id, sent_id))
# Wrap the values in single quotes for SQLite
vals = [(f"'{v}'" for v in val) for val in vals]
# Convert values into INSERT commands
cmds = [f"INSERT INTO tbl VALUES ({','.join(val)})" for val in vals]
# Build DB
db_cur.executescript(f"BEGIN TRANSACTION;{';'.join(cmds)};COMMIT;")
print(f"BEGIN TRANSACTION;{';'.join(cmds)};COMMIT;\n")
# Query DB for sequential occurences in JOIN'dind0, indicesind1, (`ind`)and ind2: the order does not matter
# but they have to be sequential
query = f"""SELECT w0.ind, w1.ind, w2.ind, w0.sentid
FROM tbl AS w0
JOIN tbl AS w1 USING (sentid)
JOIN tbl AS w2 USING (sentid)
WHERE w0.text = 'a'
AND w1.text = 'b'
AND w2.text = 'c'
AND {generate_all_possible_sequences(3)}"""
print(query)
print()
print("a_idx\tb_idx\tc_idx\tsentid")
for res in db_cur.execute(query).fetchall():
print("\t".join(map(str, res)))
db_cur.close()
conn.commit()
conn.close()
pdb.unlink()
if __name__ == '__main__':
main()