I have two dataframes: a (~600M rows) and b (~2M rows). What is the best approach for joining b onto a, when using 1 equality condition and 2 inequality conditions on the respective columns?
- a_1 = b_1
- a_2 >= b_2
- a_3 >= b_3
I have explored the following paths so far:
- Polars:
- join_asof(): only allows for 1 inequality condition
- join_where() with filter(): even with a small tolerance window, the standard Polars installation runs out of rows (4.3B row limit) during the join, and the polars-u64-idx installation runs out of memory (512GB)
- DuckDB: ASOF LEFT JOIN: also only allows for 1 inequality condition
- Numba: As the above didn't work, I tried to create my own join_asof() function - see code below. It works fine but with increasing lengths of a, it becomes prohibitively slow. I tried various different configurations of for/ while loops and filtering, all with similar results.
Now I'm running a bit out of ideas... What would be a more efficient way to implement this?
Thank you
import numba as nb
import numpy as np
import polars as pl
import time
@nb.njit(nb.int32[:](nb.int32[:], nb.int32[:], nb.int32[:], nb.int32[:], nb.int32[:], nb.int32[:], nb.int32[:]), parallel=True)
def join_multi_ineq(a_1, a_2, a_3, b_1, b_2, b_3, b_4):
output = np.zeros(len(a_1), dtype=np.int32)
for i in nb.prange(len(a_1)):
for j in range(len(b_1) - 1, -1, -1):
if a_1[i] == b_1[j]:
if a_2[i] >= b_2[j]:
if a_3[i] >= b_3[j]:
output[i] = b_4[j]
break
return output
length_a = 5_000_000
length_b = 2_000_000
start_time = time.time()
output = join_multi_ineq(a_1=np.random.randint(1, 1_000, length_a, dtype=np.int32),
a_2=np.random.randint(1, 1_000, length_a, dtype=np.int32),
a_3=np.random.randint(1, 1_000, length_a, dtype=np.int32),
b_1=np.random.randint(1, 1_000, length_b, dtype=np.int32),
b_2=np.random.randint(1, 1_000, length_b, dtype=np.int32),
b_3=np.random.randint(1, 1_000, length_b, dtype=np.int32),
b_4=np.random.randint(1, 1_000, length_b, dtype=np.int32))
print(f"Duration: {(time.time() - start_time):.2f} seconds")