Skip to content

importance of pre-splitting; edited sharding FAQ #57

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 29, 2012
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions source/faq/sharding.txt
Original file line number Diff line number Diff line change
Expand Up @@ -254,3 +254,58 @@ size of the connection pool. This setting prevents the
:program:`mongos` from causing connection spikes on the individual
:term:`shards <shard>`. Spikes like these may disrupt the operation
and memory allocation of the :term:`shard cluster`.

What are the best ways to successfully insert larger volumes of data into as sharded collection?
------------------------------------------------------------------------------------------------

- what is pre-splitting

In sharded environments, MongoDB distributes data into :term:`chunks
<chunk>`, each defined by a range of shard key values. Pre-splitting is a command run
prior to data insertion that specifies the shard key values on which to split up chunks.

- Pre-splitting is useful before large inserts into a sharded collection when:

1. inserting data into an empty collection

If a collection is empty, the database takes time to determine the optimal key
distribution. If you insert many documents in rapid succession, MongoDB will initially
direct writes to a single chunk, potentially having significant impacts on performance.
Predefining splits may improve write performance in the early stages of a bulk import by
eliminating the database's "learning" period.

2. data is not evenly distributed

Even if the sharded collection was previously populated with documents and contains multiple
chunks, pre-splitting may be beneficial if the write operation isn't evenly distributed, in
other words, if the inserts have shard keys values contained on a small number of chunks.

3. monotomically increasing shard key

If you attempt to insert data with monotonically increasing shard keys, the writes will
always hit the last chunk in the collection. Predefining splits may help to cycle a large
write operation around the cluster; however, pre-splitting in this instance will not
prevent consecutive inserts from hitting a single shard.

- when does it not matter

If data insertion is not rapid, MongoDB may have enough time to split and migrate chunks without
impacts on performance. In addition, if the collection already has chunks with an even key
distribution, pre-splitting may not be necessary.

See the ":doc:`/tutorial/inserting-documents-into-a-sharded-collection`" tutorial for more
information.


Is it necessary to pre-split data before high volume inserts into a sharded collection?
---------------------------------------------------------------------------------------

The answer depends on the shard key, the existing distribution of chunks, and how
evenly distributed the insert operation is. If a collection is empty prior to a
bulk insert, the database will take time to determine the optimal key
distribution. Predefining splits improves write performance in the early stages
of a bulk import.

Pre-splitting is also important if the write operation isn't evenly distributed.
When using an increasing shard key, for example, pre-splitting data can prevent
writes from hammering a single shard.
161 changes: 161 additions & 0 deletions source/tutorial/inserting-documents-into-a-sharded-collection.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
=============================================
Inserting Documents into a Sharded Collection
=============================================

Shard Keys
----------

.. TODO

outline the ways that insert operations work given shard keys of the following types




Even Distribution
~~~~~~~~~~~~~~~~~

If the data's distribution of keys is evenly, MongoDB should be able to distribute writes
evenly around a the cluster once the chunk key ranges are established. MongoDB will
automatically split chunks when they grow to a certain size (~64 MB by default) and will
balance the number of chunks across shards.

When inserting data into a new collection, it may be important to pre-split the key ranges.
See the section below on pre-splitting and manually moving chunks.

Uneven Distribution
~~~~~~~~~~~~~~~~~~~

If you need to insert data into a collection that isn't evenly distributed, or if the shard
keys of the data being inserted aren't evenly distributed, you may need to pre-split your
data before doing high volume inserts. As an example, consider a collection sharded by last
name with the following key distribution:

(-∞, "Henri")
["Henri", "Peters")
["Peters",+∞)

Although the chunk ranges may be split evenly, inserting lots of users with with a common
last name such as "Smith" will potentially hammer a single shard. Making the chunk range
more granular in this portion of the alphabet may improve write performance.

(-∞, "Henri")
["Henri", "Peters")
["Peters", "Smith"]
["Smith", "Tyler"]
["Tyler",+∞)

Monotonically Increasing Values
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. what will happen if you try to do inserts.

Documents with monotonically increasing shard keys, such as the BSON ObjectID, will always
be inserted into the last chunk in a collection. To illustrate why, consider a sharded
collection with two chunks, the second of which has an unbounded upper limit.

(-∞, 100)
[100,+∞)

If the data being inserted has an increasing key, at any given time writes will always hit
the shard containing the chunk with the unbounded upper limit, a problem that is not
alleviated by splitting the "hot" chunk. High volume inserts, therefore, could hinder the
cluster's performance by placing a significant load on a single shard.

If, however, a single shard can handle the write volume, an increasing shard key may have
some advantages. For example, if you need to do queries based on document insertion time,
sharding on the ObjectID ensures that documents created around the same time exist on the
same shard. Data locality helps to improve query performance.

If you decide to use an monotonically increasing shard key and anticipate large inserts,
one solution may be to store the hash of the shard key as a separate field. Hashing may
prevent the need to balance chunks by distributing data equally around the cluster. You can
create a hash client-side. In the future, MongoDB may support automatic hashing:
https://jira.mongodb.org/browse/SERVER-2001

Operations
----------

.. TODO

outline the procedures and rationale for each process.

Pre-Splitting
~~~~~~~~~~~~~

.. when to do this
.. procedure for this process

Pre-splitting is the process of specifying shard key ranges for chunks prior to data
insertion. This process may be important prior to large inserts into a sharded collection
as a way of ensuring that the write operation is evenly spread around the cluster. You
should consider pre-splitting before large inserts if the sharded collection is empty, if
the collection's data or the data being inserted is not evenly distributed, or if the shard
key is monotonically increasing.

In the example below the pre-split command splits the chunk where the _id 99 would reside
using that key as the split point. Note that a key need not exist for a chunk to use it in
its range. The chunk may even be empty.

The first step is to create a sharded collection to contain the data, which can be done in
three steps:

> use admin
> db.runCommand({ enableSharding : "foo" })

Next, we add a unique index to the collection "foo.bar" which is required for the shard
key.

> use foo
> db.bar.ensureIndex({ _id : 1 }, { unique : true })

Finally we shard the collection (which contains no data) using the _id value.

> use admin
switched to db admin
> db.runCommand( { split : "test.foo" , middle : { _id : 99 } } )

Once the key range is specified, chunks can be moved around the cluster using the moveChunk
command.

> db.runCommand( { moveChunk : "test.foo" , find : { _id : 99 } , to : "shard1" } )

You can repeat these steps as many times as necessary to create or move chunks around the
cluster. To get information about the two chunks created in this example:

> db.printShardingStatus()
--- Sharding Status ---
sharding version: { "_id" : 1, "version" : 3 }
shards:
{ "_id" : "shard0000", "host" : "localhost:30000" }
{ "_id" : "shard0001", "host" : "localhost:30001" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "test", "partitioned" : true, "primary" : "shard0001" }
test.foo chunks:
shard0001 1
shard0000 1
{ "_id" : { "$MinKey" : true } } -->> { "_id" : "99" } on : shard0001 { "t" : 2000, "i" : 1 }
{ "_id" : "99" } -->> { "_id" : { "$MaxKey" : true } } on : shard0000 { "t" : 2000, "i" : 0 }

Once the chunks and the key ranges are evenly distributed, you can proceed with a
high volume insert.

Changing Shard Key
~~~~~~~~~~~~~~~~~~

There is no automatic support for changing the shard key for a collection. In addition,
since a document's location within the cluster is determined by its shard key value,
changing the shard key could force data to move from machine to machine, potentially a
highly expensive operation.

Thus it is very important to choose the right shard key up front.

If you do need to change a shard key, an export and import is likely the best solution.
Create a new pre-sharded collection, and then import the exported data to it. If desired
use a dedicated mongos for the export and the import.

https://jira.mongodb.org/browse/SERVER-4000

Pre-allocating Documents
~~~~~~~~~~~~~~~~~~~~~~~~