I noticed a significant performance deterioration when using polars dataframe join function after upgrading polars from 1.30.0 to 1.31.0. The code snippet is below:
import polars as pl
import time
import numpy as np
print(pl.__version__)
np.random.seed(0)
indices = np.arange(2_000)
columns = [f"col_{i}" for i in range(20_000)]
df_1 = pl.DataFrame({
"index": indices,
**{col: np.random.rand(len(indices)) for col in columns}
})
df_2 = pl.DataFrame({
"index": indices,
**{col: np.random.rand(len(indices)) for col in columns}
})
print("DataFrames created.")
t0 = time.time()
df_merged = df_1.join(df_2, on="index", how="left", suffix="_right")
t1 = time.time()
print(f"Time taken to merge: {t1 - t0:.2f} seconds")
When using polars 1.30.0, the merge step takes 0.06 seconds,
1.30.0
DataFrames created.
Time taken to merge: 0.06 seconds
but when using polars 1.31.0, the merge step takes almost 30 seconds
1.31.0
DataFrames created.
Time taken to merge: 27.68 seconds
Anyone knows why that happened?
.lazy()on both frames and run with.collect(engine="streaming")it is fast again. You could report it on Github as a performance issue.Time taken to merge: 0.06 secondson my (Debian) Linux. I do reproduce the issue in the 1.35.1. They probably fixed the problem in the minor update :) !