0

I have a DB file that has a table with two columns, 'a' and 'b', and about 11 million rows.

When I load the table into a pandas.Dataframe and perform a simple filtering like

df = df[ abs(df['a']-df['b']) > 0.0001 ]

the processing takes less than 500 ms.

However, when I query the db directly in sqlite3 shell like this

SELECT a, b
FROM table
WHERE abs(a-b)>0.0001

The process takes about 3 s. In my actual work I need a more complex query that should produce much larger overhead. In fact, it is needed to change the filtering condition interactively, which means I need to query many times to obtain the finial table.

I know that pandas dataframe is in memory but the table is on disk. Is there a simple way to load tables in memory and filter the entries as fast as the boolean indexing in pandas?

2 Answers 2

1

You can play with settings like adjusting the cache size or memory mapping the database, but with relational databases including sqlite, the way to improve query performance is with an appropriate index. In particular, Sqlite supports indexes on expressions:

CREATE INDEX table_idx_abs_a_b ON table(abs(a-b));

Compare the query plans before and after this index:

sqlite> CREATE TABLE foo(a, b);
sqlite> EXPLAIN QUERY PLAN SELECT a, b FROM foo WHERE abs(a-b)>0.0001;
QUERY PLAN
`--SCAN TABLE foo
sqlite> CREATE INDEX foo_idx_abs_a_b ON foo(abs(a-b));
sqlite> EXPLAIN QUERY PLAN SELECT a, b FROM foo WHERE abs(a-b)>0.0001;
QUERY PLAN
`--SEARCH TABLE foo USING INDEX foo_idx_abs_a_b (<expr>>?)

Without the index, it has to scan the entire table and look at every row. With the index, it can directly look up those greater than the compared-to value and ignore those are that are less than or equal, saving a lot of time if there are many such rows (If the condition is true for most of your rows, there's not much benefit to an index, though).

Another option is to calculate the abs(a-b) value ahead of time in another column (And add an index on it). The upcoming Sqlite 3.31 will have generated columns for this sort of thing, but for now triggers on insert and update to keep it in sync with the a and b values is the way to go.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for this answer! I created an index as you suggested. Now the execution time is only 2ms.
0

SQlite does support pure in-memory databases, see this link. You would need to manage persistence yourself. Also, even an in-memory SQLite database will benefit from correctly specified keys. "Correct" in this case is determined by the exact nature of your queries.

5 Comments

I created an in-memory database and attached the one on disk. Then I tried the same filtering query. The time for processing the query is 4 s. It seems in-memory database does not really improve speed.
I'm not a regular user of either in-memory or attached databases. However, I would expect that attaching a table from another database would still operate on that table in it's original location (on disk). To operate in memory, you would have to INSERT the data from the on-disk database into a table in the in-memory database (paying a time cost for that operation, obviously) and then SELECT against the table resident in the in-memory database. Also, no guarantee that SQLite, even in-memory, will be as fast as Pandas -- they're different architectures for different purposes.
Also, if you do transfer the data into the in-memory database I'd perform the abs() calculation at that time and store the calculated number in the in-memory database. That should definitely improve performance.
I created an in-memory database and then copied the table to it using this query "insert into tab1 select * from attached_db.table". Then I queried again the same way as in m original question. The process time is still 4 s. I think overhead comes from 'WHERE' clause, as you have to evaluate 11 million rows.
@Nownuri An index on abs(a-b) would help.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.