0

I have a MongoDB (v8.0.9) sharded cluster running in Kubernetes with the following setup:

  • 3 shards, each with 2 replicas
  • An empty collection was created with a hashed shard key (over a UUID field)
  • Sharding was enabled using:
    sh.enableSharding("db")
    sh.shardCollection("db.collection", { shardkey: "hashed" })
    
  • The balancer was started:
    sh.startBalancer()
    
  • I started importing data at a rate of about 2,500 documents per second
  • I stopped the import after reaching approximately 716 million documents

Data distribution is perfect (about 33.3% of data on each shard), but the problem is that there are only 10 chunks in total. Here is the output from the chunk distribution (after 2 days after data import):

Shard-0
{
  data: '884.5GiB',
  docs: 238,840,277,
  chunks: 1,
  'estimated data per chunk': '884.5GiB',
  'estimated docs per chunk': 238,840,277
}
Shard-1
{
  data: '884.4GiB',
  docs: 238,813,353,
  chunks: 4,
  'estimated data per chunk': '221.1GiB',
  'estimated docs per chunk': 59,703,338
}
Shard-2
{
  data: '884.54GiB',
  docs: 238,851,801,
  chunks: 5,
  'estimated data per chunk': '176.9GiB',
  'estimated docs per chunk': 47,770,360
}

My question:
Why are there only 10 chunks in total, even though the collection contains over 700 million documents and nearly 900 GiB of data per shard?


Additional info:

  • The balancer is running.
  • The collection was empty before import.
  • The shard key is a hashed UUID field.
  • chunksize: 128

What could be causing the low number of chunks? Shouldn’t there be many more chunks given the data volume?


What I tried:

  • I attempted to manually split the chunks using the sh.splitFind() method.
  • After executing the splits, the chunks were indeed split as expected.
  • However, after a few minutes, the chunks reverted back to their previous state (i.e., the splits were undone).

Update (2025-05-27):

I continued with testing and added 3 new shards to the cluster (so now there are 6 shards in total). The balancer was enabled and running. However, I noticed that after adding the new shards, the data was not automatically redistributed to the new shards—everything remained on the original 3 shards, and the chunk distribution did not change.

My questions:

  • What is the best approach to trigger data rebalancing after adding new shards to an existing cluster?
  • Is it effective for the balancer to work with such large chunks (e.g., ~884 GB per chunk), or should I try to split them into smaller chunks first?
  • Is there a recommended way to force or accelerate the redistribution of data to the new shards in this scenario?

Thank you for any insights!

7
  • 1
    MongoDB 8 spreads the data across the shards so that each has roughly the same size. What would be the benefit of having more chunks? Commented May 21 at 23:53
  • 128 GB chunk size is just the initial and minimum size. The balancer can merge chunks. If you don't modify any data and wait some more time, then most likely you will get even just four chunks - one in each shard. Commented May 22 at 5:04
  • @Joe: I thought that having more chunks with smaller sizes would allow the balancer to move smaller portions of data between shards, which could help maintain balance as the dataset grows or as new shards are added. Commented May 22 at 7:15
  • @Wernfried Domscheit: The default chunk size is 128 MB, not GB, so I expected to have many chunks of this size. Commented May 22 at 7:15
  • In the meantime, I found this response from the Sharding Product Manager: mongodb.com/community/forums/t/…. It seems that in newer versions of MongoDB, it is normal to have a large data size per chunk. If anyone knows the technical reasons behind this change, and why larger chunks are now preferred over smaller ones, I would appreciate any insights. Commented May 22 at 7:15

1 Answer 1

0

In a MongoDB sharded cluster, the balancer thread runs on the config server primary and is responsible for performing chunk migrations to ensure an even distribution of data across shards. The goal is to have each shard own approximately the same amount of data for a given sharded collection.

Prior to MongoDB 6.1, the balancer focused solely on distributing chunks, not the actual data size. This meant that if chunks were unevenly sized, the cluster could appear balanced in terms of chunk count, while the underlying data distribution remained skewed.

Starting with MongoDB 6.1 (and backported to 6.0.3 with the Feature Compatibility Version (FCV) set to "6.0"), the balancer now distributes data based on data size rather than the number of chunks. This change coincides with the removal of the chunk auto-splitter, leading to more accurate and efficient data distribution across shards in sharded clusters.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for your explanation. I’ve updated my original post with more details (adding new shards to the cluster...). Could you please take a look and share your valuable thoughts? Thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.