I’m experiencing data loss when writing a large DataFrame to Redis using the Spark-Redis connector.
Details:
I have a DataFrame with millions of rows.
Writing to Redis works correctly for small DataFrames, but when the DataFrame is large, some rows seem to be missing after the write.
Observations:
Reading back from Redis via the Spark-Redis connector returns fewer rows than the original DataFrame.
Reading directly by key or using
scan_iteralso returns fewer entries.There are no duplicate rows in the DataFrame.
This issue only happens with large datasets; small datasets are written correctly.
Question:
Why does Spark-Redis drop rows when writing large DataFrames?
Are there any recommended settings, configurations, or approaches to reliably write large datasets to Redis using Spark-Redis?
Example Code
# Prepare Redis key column
df_to_redis = df.withColumn("key", F.concat(F.lit("{"), F.col("uid"), F.lit("}"))).select("key", "lang")
# Write to Redis
df_to_redis.write.format("org.apache.spark.sql.redis") \
.option("table", "info") \
.option("key.column", "key")
.option("host", "REDIS_HOST") \
.option("port", 6379) \
.option("dbNum", 0) \
.mode("append") \
.save()
# Reading back from Redis using Spark-Redis
df_redis = spark.read.format("org.apache.spark.sql.redis") \
.option("table", "info") \
.option("host", "REDIS_HOST") \
.option("port", 6379) \
.option("dbNum", 0) \
.load()
# Reading all keys directly from Redis using redis-py keys()
r = redis.Redis(host="REDIS_HOST", port=6379, db=0)
all_keys = r.keys("info:*")
print(f"Number of keys read via keys(): {len(all_keys)}")
# Reading all keys from Redis using scan_iter()
r = redis.Redis(host="REDIS_HOST", port=6379, db=0)
keys = list(r.scan_iter("info:*"))
print(f"Number of keys read via scan_iter: {len(keys)}")