3

I’m experiencing data loss when writing a large DataFrame to Redis using the Spark-Redis connector.

Details:

  • I have a DataFrame with millions of rows.

  • Writing to Redis works correctly for small DataFrames, but when the DataFrame is large, some rows seem to be missing after the write.

Observations:

  1. Reading back from Redis via the Spark-Redis connector returns fewer rows than the original DataFrame.

  2. Reading directly by key or using scan_iter also returns fewer entries.

  3. There are no duplicate rows in the DataFrame.

  4. This issue only happens with large datasets; small datasets are written correctly.

Question:

  • Why does Spark-Redis drop rows when writing large DataFrames?

  • Are there any recommended settings, configurations, or approaches to reliably write large datasets to Redis using Spark-Redis?

Example Code

# Prepare Redis key column
df_to_redis = df.withColumn("key", F.concat(F.lit("{"), F.col("uid"), F.lit("}"))).select("key", "lang")

# Write to Redis
df_to_redis.write.format("org.apache.spark.sql.redis") \
    .option("table", "info") \
    .option("key.column", "key")
    .option("host", "REDIS_HOST") \
    .option("port", 6379) \
    .option("dbNum", 0) \
    .mode("append") \
    .save()
# Reading back from Redis using Spark-Redis
df_redis = spark.read.format("org.apache.spark.sql.redis") \
    .option("table", "info") \
    .option("host", "REDIS_HOST") \
    .option("port", 6379) \
    .option("dbNum", 0) \
    .load()
# Reading all keys directly from Redis using redis-py keys()
r = redis.Redis(host="REDIS_HOST", port=6379, db=0)
all_keys = r.keys("info:*")
print(f"Number of keys read via keys(): {len(all_keys)}")
# Reading all keys from Redis using scan_iter()
r = redis.Redis(host="REDIS_HOST", port=6379, db=0)
keys = list(r.scan_iter("info:*"))
print(f"Number of keys read via scan_iter: {len(keys)}")
New contributor
gianfranco de siena is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
4
  • Did you set correct timeout. Connection timeout might have failed to write all the data. I see similar issue in the link. do checkout if the solution in that thread helps stackoverflow.com/questions/67924823/… Commented Nov 19 at 5:07
  • 2
    I’ve run into this issue before. It turned out not to be a Redis write or data-loss problem. The real cause was that Redis ran out of memory. Did you check the memory usage? You can inspect it with: info("memory") function Commented 2 days ago
  • 1
    @MorganFisher, Thank you, it was actually about redis ran out of memory issue. Thank you. Commented 2 days ago
  • @gianfrancodesiena if you have the time, please post this comment as the answer and accept it... this will help anyone in the future that runs into the same issue as you. thanks! Commented 16 hours ago

1 Answer 1

0

I’ve seen this issue before, and it wasn’t caused by a Redis write failure or data loss. The real problem was that Redis had run out of memory.
You can check this using the INFO command.
Follow is the example code.

If used_memory_human and maxmemory_human show the same value, it means Redis has reached its memory limit and can’t store any additional data.

import redis

# Connect to Redis
r = redis.Redis(
    host="your-redis-host",
    port=port_num,
    db=db_num
)


# Fetch memory usage
memory_info = r.info("memory")
usage = {
    "used_memory_human": memory_info.get("used_memory_human"),
    "maxmemory_human": memory_info.get("maxmemory_human")
}

print(usage)
New contributor
gianfranco de siena is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.