diff --git a/source/faq/sharding.txt b/source/faq/sharding.txt index 89178f7fe59..76cefd23656 100644 --- a/source/faq/sharding.txt +++ b/source/faq/sharding.txt @@ -254,3 +254,58 @@ size of the connection pool. This setting prevents the :program:`mongos` from causing connection spikes on the individual :term:`shards `. Spikes like these may disrupt the operation and memory allocation of the :term:`shard cluster`. + +What are the best ways to successfully insert larger volumes of data into as sharded collection? +------------------------------------------------------------------------------------------------ + +- what is pre-splitting + + In sharded environments, MongoDB distributes data into :term:`chunks + `, each defined by a range of shard key values. Pre-splitting is a command run + prior to data insertion that specifies the shard key values on which to split up chunks. + +- Pre-splitting is useful before large inserts into a sharded collection when: + +1. inserting data into an empty collection + +If a collection is empty, the database takes time to determine the optimal key +distribution. If you insert many documents in rapid succession, MongoDB will initially +direct writes to a single chunk, potentially having significant impacts on performance. +Predefining splits may improve write performance in the early stages of a bulk import by +eliminating the database's "learning" period. + +2. data is not evenly distributed + +Even if the sharded collection was previously populated with documents and contains multiple +chunks, pre-splitting may be beneficial if the write operation isn't evenly distributed, in +other words, if the inserts have shard keys values contained on a small number of chunks. + +3. monotomically increasing shard key + +If you attempt to insert data with monotonically increasing shard keys, the writes will +always hit the last chunk in the collection. Predefining splits may help to cycle a large +write operation around the cluster; however, pre-splitting in this instance will not +prevent consecutive inserts from hitting a single shard. + +- when does it not matter + +If data insertion is not rapid, MongoDB may have enough time to split and migrate chunks without +impacts on performance. In addition, if the collection already has chunks with an even key +distribution, pre-splitting may not be necessary. + +See the ":doc:`/tutorial/inserting-documents-into-a-sharded-collection`" tutorial for more +information. + + +Is it necessary to pre-split data before high volume inserts into a sharded collection? +--------------------------------------------------------------------------------------- + +The answer depends on the shard key, the existing distribution of chunks, and how +evenly distributed the insert operation is. If a collection is empty prior to a +bulk insert, the database will take time to determine the optimal key +distribution. Predefining splits improves write performance in the early stages +of a bulk import. + +Pre-splitting is also important if the write operation isn't evenly distributed. +When using an increasing shard key, for example, pre-splitting data can prevent +writes from hammering a single shard. diff --git a/source/tutorial/inserting-documents-into-a-sharded-collection.txt b/source/tutorial/inserting-documents-into-a-sharded-collection.txt new file mode 100644 index 00000000000..d0e751c7bef --- /dev/null +++ b/source/tutorial/inserting-documents-into-a-sharded-collection.txt @@ -0,0 +1,161 @@ +============================================= +Inserting Documents into a Sharded Collection +============================================= + +Shard Keys +---------- + +.. TODO + + outline the ways that insert operations work given shard keys of the following types + + + + +Even Distribution +~~~~~~~~~~~~~~~~~ + +If the data's distribution of keys is evenly, MongoDB should be able to distribute writes +evenly around a the cluster once the chunk key ranges are established. MongoDB will +automatically split chunks when they grow to a certain size (~64 MB by default) and will +balance the number of chunks across shards. + +When inserting data into a new collection, it may be important to pre-split the key ranges. +See the section below on pre-splitting and manually moving chunks. + +Uneven Distribution +~~~~~~~~~~~~~~~~~~~ + +If you need to insert data into a collection that isn't evenly distributed, or if the shard +keys of the data being inserted aren't evenly distributed, you may need to pre-split your +data before doing high volume inserts. As an example, consider a collection sharded by last +name with the following key distribution: + +(-∞, "Henri") +["Henri", "Peters") +["Peters",+∞) + +Although the chunk ranges may be split evenly, inserting lots of users with with a common +last name such as "Smith" will potentially hammer a single shard. Making the chunk range +more granular in this portion of the alphabet may improve write performance. + +(-∞, "Henri") +["Henri", "Peters") +["Peters", "Smith"] +["Smith", "Tyler"] +["Tyler",+∞) + +Monotonically Increasing Values +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. what will happen if you try to do inserts. + +Documents with monotonically increasing shard keys, such as the BSON ObjectID, will always +be inserted into the last chunk in a collection. To illustrate why, consider a sharded +collection with two chunks, the second of which has an unbounded upper limit. + +(-∞, 100) +[100,+∞) + +If the data being inserted has an increasing key, at any given time writes will always hit +the shard containing the chunk with the unbounded upper limit, a problem that is not +alleviated by splitting the "hot" chunk. High volume inserts, therefore, could hinder the +cluster's performance by placing a significant load on a single shard. + +If, however, a single shard can handle the write volume, an increasing shard key may have +some advantages. For example, if you need to do queries based on document insertion time, +sharding on the ObjectID ensures that documents created around the same time exist on the +same shard. Data locality helps to improve query performance. + +If you decide to use an monotonically increasing shard key and anticipate large inserts, +one solution may be to store the hash of the shard key as a separate field. Hashing may +prevent the need to balance chunks by distributing data equally around the cluster. You can +create a hash client-side. In the future, MongoDB may support automatic hashing: +https://jira.mongodb.org/browse/SERVER-2001 + +Operations +---------- + +.. TODO + + outline the procedures and rationale for each process. + +Pre-Splitting +~~~~~~~~~~~~~ + +.. when to do this +.. procedure for this process + +Pre-splitting is the process of specifying shard key ranges for chunks prior to data +insertion. This process may be important prior to large inserts into a sharded collection +as a way of ensuring that the write operation is evenly spread around the cluster. You +should consider pre-splitting before large inserts if the sharded collection is empty, if +the collection's data or the data being inserted is not evenly distributed, or if the shard +key is monotonically increasing. + +In the example below the pre-split command splits the chunk where the _id 99 would reside +using that key as the split point. Note that a key need not exist for a chunk to use it in +its range. The chunk may even be empty. + +The first step is to create a sharded collection to contain the data, which can be done in +three steps: + +> use admin +> db.runCommand({ enableSharding : "foo" }) + +Next, we add a unique index to the collection "foo.bar" which is required for the shard +key. + +> use foo +> db.bar.ensureIndex({ _id : 1 }, { unique : true }) + +Finally we shard the collection (which contains no data) using the _id value. + +> use admin +switched to db admin +> db.runCommand( { split : "test.foo" , middle : { _id : 99 } } ) + +Once the key range is specified, chunks can be moved around the cluster using the moveChunk +command. + +> db.runCommand( { moveChunk : "test.foo" , find : { _id : 99 } , to : "shard1" } ) + +You can repeat these steps as many times as necessary to create or move chunks around the +cluster. To get information about the two chunks created in this example: + +> db.printShardingStatus() +--- Sharding Status --- + sharding version: { "_id" : 1, "version" : 3 } + shards: + { "_id" : "shard0000", "host" : "localhost:30000" } + { "_id" : "shard0001", "host" : "localhost:30001" } + databases: + { "_id" : "admin", "partitioned" : false, "primary" : "config" } + { "_id" : "test", "partitioned" : true, "primary" : "shard0001" } + test.foo chunks: + shard0001 1 + shard0000 1 + { "_id" : { "$MinKey" : true } } -->> { "_id" : "99" } on : shard0001 { "t" : 2000, "i" : 1 } + { "_id" : "99" } -->> { "_id" : { "$MaxKey" : true } } on : shard0000 { "t" : 2000, "i" : 0 } + +Once the chunks and the key ranges are evenly distributed, you can proceed with a +high volume insert. + +Changing Shard Key +~~~~~~~~~~~~~~~~~~ + +There is no automatic support for changing the shard key for a collection. In addition, +since a document's location within the cluster is determined by its shard key value, +changing the shard key could force data to move from machine to machine, potentially a +highly expensive operation. + +Thus it is very important to choose the right shard key up front. + +If you do need to change a shard key, an export and import is likely the best solution. +Create a new pre-sharded collection, and then import the exported data to it. If desired +use a dedicated mongos for the export and the import. + +https://jira.mongodb.org/browse/SERVER-4000 + +Pre-allocating Documents +~~~~~~~~~~~~~~~~~~~~~~~~