Spark-Redis write loses rows when writing large DataFrame to Redis

Question

I’m experiencing data loss when writing a large DataFrame to Redis using the Spark-Redis connector.

Details:

I have a DataFrame with millions of rows.
Writing to Redis works correctly for small DataFrames, but when the DataFrame is large, some rows seem to be missing after the write.

Observations:

Reading back from Redis via the Spark-Redis connector returns fewer rows than the original DataFrame.
Reading directly by key or using scan_iter also returns fewer entries.
There are no duplicate rows in the DataFrame.
This issue only happens with large datasets; small datasets are written correctly.

Question:

Why does Spark-Redis drop rows when writing large DataFrames?
Are there any recommended settings, configurations, or approaches to reliably write large datasets to Redis using Spark-Redis?

Example Code

# Prepare Redis key column
df_to_redis = df.withColumn("key", F.concat(F.lit("{"), F.col("uid"), F.lit("}"))).select("key", "lang")

# Write to Redis
df_to_redis.write.format("org.apache.spark.sql.redis") \
    .option("table", "info") \
    .option("key.column", "key")
    .option("host", "REDIS_HOST") \
    .option("port", 6379) \
    .option("dbNum", 0) \
    .mode("append") \
    .save()

# Reading back from Redis using Spark-Redis
df_redis = spark.read.format("org.apache.spark.sql.redis") \
    .option("table", "info") \
    .option("host", "REDIS_HOST") \
    .option("port", 6379) \
    .option("dbNum", 0) \
    .load()

# Reading all keys directly from Redis using redis-py keys()
r = redis.Redis(host="REDIS_HOST", port=6379, db=0)
all_keys = r.keys("info:*")
print(f"Number of keys read via keys(): {len(all_keys)}")

# Reading all keys from Redis using scan_iter()
r = redis.Redis(host="REDIS_HOST", port=6379, db=0)
keys = list(r.scan_iter("info:*"))
print(f"Number of keys read via scan_iter: {len(keys)}")

Did you set correct timeout. Connection timeout might have failed to write all the data. I see similar issue in the link. do checkout if the solution in that thread helps stackoverflow.com/questions/67924823/… — Vindhya G
– Vindhya G, Commented Nov 19 at 5:07
I’ve run into this issue before. It turned out not to be a Redis write or data-loss problem. The real cause was that Redis ran out of memory. Did you check the memory usage? You can inspect it with: info("memory") function — Morgan Fisher
– Morgan Fisher, Commented 2 days ago
@MorganFisher, Thank you, it was actually about redis ran out of memory issue. Thank you. — gianfranco de siena
– gianfranco de siena, Commented 2 days ago
@gianfrancodesiena if you have the time, please post this comment as the answer and accept it... this will help anyone in the future that runs into the same issue as you. thanks! — Derek O
– Derek O, Commented 16 hours ago

gianfranco de siena · Accepted Answer · 2025-11-22 13:43:15Z

0

I’ve seen this issue before, and it wasn’t caused by a Redis write failure or data loss. The real problem was that Redis had run out of memory.
You can check this using the INFO command.
Follow is the example code.

If used_memory_human and maxmemory_human show the same value, it means Redis has reached its memory limit and can’t store any additional data.

import redis

# Connect to Redis
r = redis.Redis(
    host="your-redis-host",
    port=port_num,
    db=db_num
)


# Fetch memory usage
memory_info = r.info("memory")
usage = {
    "used_memory_human": memory_info.get("used_memory_human"),
    "maxmemory_human": memory_info.get("maxmemory_human")
}

print(usage)

answered 30 mins ago

Collectives™ on Stack Overflow

Spark-Redis write loses rows when writing large DataFrame to Redis

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related