From f06b38760d617e60fbc81974f7eb0bafcee560cc Mon Sep 17 00:00:00 2001 From: Jenna deBoisblanc Date: Mon, 9 Jul 2012 19:10:39 -0400 Subject: [PATCH 1/6] importance of pre-splitting --- source/faq/sharding.txt | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/source/faq/sharding.txt b/source/faq/sharding.txt index 89178f7fe59..d20cd2d754b 100644 --- a/source/faq/sharding.txt +++ b/source/faq/sharding.txt @@ -55,6 +55,13 @@ collection data to the different shards in the cluster. The cluster automatically corrects imbalances between shards by migrating ranges of data from one shard to another. +Is it necessary to pre-split data before high volume inserts into a sharded collection? +--------------------------------------------------------------------------------------- + +The answer depends on the shard key, the existing distribution of chunks, and how evenly distributed the insert operation is. If a collection is empty prior to a bulk insert, the database will take time to determine the optimal key distribution. Predefining splits improves write performance in the early stages of a bulk import. + +Pre-splitting is also important if the write operation isn't evenly distributed. When using an increasing shard key, for example, pre-splitting data can prevent writes from hammering a single shard. + What happens if a client updates a document in a chunk during a migration? -------------------------------------------------------------------------- From 7fba3453803f54add4a87eb3ea43f1e28256d0c8 Mon Sep 17 00:00:00 2001 From: Jenna deBoisblanc Date: Mon, 16 Jul 2012 23:10:49 -0400 Subject: [PATCH 2/6] edited pre-splitting faq i'm not 100% happy with this writeup, but it's a start (hopefully helpful). I'm still working on the "inserting-into-a-sharded-collection" doc. let's chat tomorrow; i want to make sure i'm including the appropriate amount of detail. --- source/faq/sharding.txt | 62 ++++++++++++++++--- ...ng-documents-into-a-sharded-collection.txt | 41 ++++++++++++ 2 files changed, 96 insertions(+), 7 deletions(-) create mode 100644 source/tutorial/inserting-documents-into-a-sharded-collection.txt diff --git a/source/faq/sharding.txt b/source/faq/sharding.txt index d20cd2d754b..76cefd23656 100644 --- a/source/faq/sharding.txt +++ b/source/faq/sharding.txt @@ -55,13 +55,6 @@ collection data to the different shards in the cluster. The cluster automatically corrects imbalances between shards by migrating ranges of data from one shard to another. -Is it necessary to pre-split data before high volume inserts into a sharded collection? ---------------------------------------------------------------------------------------- - -The answer depends on the shard key, the existing distribution of chunks, and how evenly distributed the insert operation is. If a collection is empty prior to a bulk insert, the database will take time to determine the optimal key distribution. Predefining splits improves write performance in the early stages of a bulk import. - -Pre-splitting is also important if the write operation isn't evenly distributed. When using an increasing shard key, for example, pre-splitting data can prevent writes from hammering a single shard. - What happens if a client updates a document in a chunk during a migration? -------------------------------------------------------------------------- @@ -261,3 +254,58 @@ size of the connection pool. This setting prevents the :program:`mongos` from causing connection spikes on the individual :term:`shards `. Spikes like these may disrupt the operation and memory allocation of the :term:`shard cluster`. + +What are the best ways to successfully insert larger volumes of data into as sharded collection? +------------------------------------------------------------------------------------------------ + +- what is pre-splitting + + In sharded environments, MongoDB distributes data into :term:`chunks + `, each defined by a range of shard key values. Pre-splitting is a command run + prior to data insertion that specifies the shard key values on which to split up chunks. + +- Pre-splitting is useful before large inserts into a sharded collection when: + +1. inserting data into an empty collection + +If a collection is empty, the database takes time to determine the optimal key +distribution. If you insert many documents in rapid succession, MongoDB will initially +direct writes to a single chunk, potentially having significant impacts on performance. +Predefining splits may improve write performance in the early stages of a bulk import by +eliminating the database's "learning" period. + +2. data is not evenly distributed + +Even if the sharded collection was previously populated with documents and contains multiple +chunks, pre-splitting may be beneficial if the write operation isn't evenly distributed, in +other words, if the inserts have shard keys values contained on a small number of chunks. + +3. monotomically increasing shard key + +If you attempt to insert data with monotonically increasing shard keys, the writes will +always hit the last chunk in the collection. Predefining splits may help to cycle a large +write operation around the cluster; however, pre-splitting in this instance will not +prevent consecutive inserts from hitting a single shard. + +- when does it not matter + +If data insertion is not rapid, MongoDB may have enough time to split and migrate chunks without +impacts on performance. In addition, if the collection already has chunks with an even key +distribution, pre-splitting may not be necessary. + +See the ":doc:`/tutorial/inserting-documents-into-a-sharded-collection`" tutorial for more +information. + + +Is it necessary to pre-split data before high volume inserts into a sharded collection? +--------------------------------------------------------------------------------------- + +The answer depends on the shard key, the existing distribution of chunks, and how +evenly distributed the insert operation is. If a collection is empty prior to a +bulk insert, the database will take time to determine the optimal key +distribution. Predefining splits improves write performance in the early stages +of a bulk import. + +Pre-splitting is also important if the write operation isn't evenly distributed. +When using an increasing shard key, for example, pre-splitting data can prevent +writes from hammering a single shard. diff --git a/source/tutorial/inserting-documents-into-a-sharded-collection.txt b/source/tutorial/inserting-documents-into-a-sharded-collection.txt new file mode 100644 index 00000000000..8e44a5a6815 --- /dev/null +++ b/source/tutorial/inserting-documents-into-a-sharded-collection.txt @@ -0,0 +1,41 @@ +============================================= +Inserting Documents into a Sharded Collection +============================================= + +Shard Keys +---------- + +.. TODO + + outline the ways that insert operations work given shard keys of the following types + +Monotonically Increasing Values +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. what will happen if you try to do inserts. +.. pros/cons. + +Even Distribution +~~~~~~~~~~~~~~~~~ + +Uneven Distribution +~~~~~~~~~~~~~~~~~~~ + +Operations +---------- + +.. TODO + + outline the procedures and rationale for each process. + +Pre-Splitting +~~~~~~~~~~~~~ + +.. when to do this +.. procedure for this process + +Changing Shard Key +~~~~~~~~~~~~~~~~~~ + +Incremental Inserts +~~~~~~~~~~~~~~~~~~~ \ No newline at end of file From fce9d4f83778d62245df7f1733cd2bf93cb22e1c Mon Sep 17 00:00:00 2001 From: Jenna deBoisblanc Date: Tue, 17 Jul 2012 11:29:39 -0400 Subject: [PATCH 3/6] more edits --- ...ng-documents-into-a-sharded-collection.txt | 73 ++++++++++++++++++- 1 file changed, 71 insertions(+), 2 deletions(-) diff --git a/source/tutorial/inserting-documents-into-a-sharded-collection.txt b/source/tutorial/inserting-documents-into-a-sharded-collection.txt index 8e44a5a6815..d2688e1838d 100644 --- a/source/tutorial/inserting-documents-into-a-sharded-collection.txt +++ b/source/tutorial/inserting-documents-into-a-sharded-collection.txt @@ -13,7 +13,22 @@ Monotonically Increasing Values ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. what will happen if you try to do inserts. -.. pros/cons. + +Documents with monotonically increasing shard keys, such as the BSON ObjectID, will always +be inserted into the last chunk in a collection. Although MongoDB can automatically split +and migrate chunks when they reach a certain size, using an increasing shard key means that +writes will always hit a single shard at any given time. High volume inserts, therefore, +could hinder the cluster's performance by placing a significant load on a single replica +set. + +If, however, a single shard can handle the write volume, an increasing shard key may have +some advantages. For example, if you frequently do queries based on insertion time, +sharding on the ObjectID ensures that documents created around the same time exist on the +same shard. Data locality helps to improve query performance. + +If you decide to use an monotonically increasing shard key and anticipate large inserts, +one solution may be to store the hash of the shard key as a separate field. + Even Distribution ~~~~~~~~~~~~~~~~~ @@ -34,8 +49,62 @@ Pre-Splitting .. when to do this .. procedure for this process -Changing Shard Key +Pre-splitting is the process of specifying shard key ranges for chunks prior to data +insertion. This process may be important prior to large inserts into a sharded collection +as a way of ensuring that the write operation is evenly spread around the cluster. You +should consider pre-splitting before large inserts if the sharded collection is empty, if +the collection's data or the data being inserted is not evenly distributed, or if the shard +key is monotonically increasing. + +In the example below the pre-split command splits the chunk where the _id 99 would reside +using that key as the split point. Note that a key need not exist for a chunk to use it in +its range. The chunk may even be empty. + +> use admin +switched to db admin +> db.runCommand( { split : "test.foo" , middle : { _id : 99 } } ) + +Once the key range is specified, chunks can be moved around the cluster using the moveChunk +command. + +> db.runCommand( { moveChunk : "test.foo" , find : { _id : 99 } , to : "shard1" } ) + +You can repeat these steps as many times as necessary to create or move chunks around the +cluster. To get information about the two chunks created in this example: + +> db.printShardingStatus() +--- Sharding Status --- + sharding version: { "_id" : 1, "version" : 3 } + shards: + { "_id" : "shard0000", "host" : "localhost:30000" } + { "_id" : "shard0001", "host" : "localhost:30001" } + databases: + { "_id" : "admin", "partitioned" : false, "primary" : "config" } + { "_id" : "test", "partitioned" : true, "primary" : "shard0001" } + test.foo chunks: + shard0001 1 + shard0000 1 + { "_id" : { "$MinKey" : true } } -->> { "_id" : "99" } on : shard0001 { "t" : 2000, "i" : 1 } + { "_id" : "99" } -->> { "_id" : { "$MaxKey" : true } } on : shard0000 { "t" : 2000, "i" : 0 } + +Once the chunks and the key ranges are evenly distributed, you can proceed with a +high volume insert. + +Changing Shard Key ~~~~~~~~~~~~~~~~~~ +There is no automatic support for changing the shard key for a collection. In addition, +since a document's location within the cluster is determined by its shard key value, +changing the shard key could force data to move from machine to machine, potentially a +highly expensive operation. + +Thus it is very important to choose the right shard key up front. + +If you do need to change a shard key, an export and import is likely the best solution. +Create a new pre-sharded collection, and then import the exported data to it. If desired +use a dedicated mongos for the export and the import. + +https://jira.mongodb.org/browse/SERVER-4000 + Incremental Inserts ~~~~~~~~~~~~~~~~~~~ \ No newline at end of file From d1cc2bf5ad5cfac517ee370eae59971029c4c63e Mon Sep 17 00:00:00 2001 From: Jenna deBoisblanc Date: Tue, 17 Jul 2012 13:28:01 -0400 Subject: [PATCH 4/6] updated sharding FAQ and sharding tutorial --- ...ng-documents-into-a-sharded-collection.txt | 42 +++++++++++++++---- 1 file changed, 35 insertions(+), 7 deletions(-) diff --git a/source/tutorial/inserting-documents-into-a-sharded-collection.txt b/source/tutorial/inserting-documents-into-a-sharded-collection.txt index d2688e1838d..3b8ba9fc6c6 100644 --- a/source/tutorial/inserting-documents-into-a-sharded-collection.txt +++ b/source/tutorial/inserting-documents-into-a-sharded-collection.txt @@ -15,27 +15,55 @@ Monotonically Increasing Values .. what will happen if you try to do inserts. Documents with monotonically increasing shard keys, such as the BSON ObjectID, will always -be inserted into the last chunk in a collection. Although MongoDB can automatically split -and migrate chunks when they reach a certain size, using an increasing shard key means that -writes will always hit a single shard at any given time. High volume inserts, therefore, -could hinder the cluster's performance by placing a significant load on a single replica -set. +be inserted into the last chunk in a collection. To illustrate why, consider a sharded +collection with two chunks, the second of which has an unbounded upper limit. + +[-∞, 100) +[100,+∞) + +If the data being inserted has an increasing key, at any given time writes will always hit +the shard containing the chunk with the unbounded upper limit, a problem that is not +alleviated by splitting the "hot" chunk. High volume inserts, therefore, could hinder the +cluster's performance by placing a significant load on a single shard. If, however, a single shard can handle the write volume, an increasing shard key may have -some advantages. For example, if you frequently do queries based on insertion time, +some advantages. For example, if you need to do queries based on document insertion time, sharding on the ObjectID ensures that documents created around the same time exist on the same shard. Data locality helps to improve query performance. If you decide to use an monotonically increasing shard key and anticipate large inserts, -one solution may be to store the hash of the shard key as a separate field. +one solution may be to store the hash of the shard key as a separate field. Hashing may +prevent the need to balance chunks by distributing data equally around the cluster. You can +create a hash client-side. In the future, MongoDB may support automatic hashing: +https://jira.mongodb.org/browse/SERVER-2001 Even Distribution ~~~~~~~~~~~~~~~~~ + Uneven Distribution ~~~~~~~~~~~~~~~~~~~ +If you need to insert data into a collection that isn't evenly distributed, or if the shard +keys of the data being inserted aren't evenly distributed, you may need to pre-split your +data before doing high volume inserts. As an example, consider a collection sharded by last +name with the following key distribution: + +[-∞, "Henri") +["Henri", "Peters") +["Peters",+∞) + +Although the chunk ranges may be split evenly, inserting lots of users with with a common +last name such as "Smith" will potentially hammer a single shard. Making the chunk range +more granular in this portion of the alphabet may improve write performance. + +[-∞, "Henri") +["Henri", "Peters") +["Peters", "Smith"] +["Smith", "Tyler"] +["Tyler",+∞) + Operations ---------- From 6e380bf9ae5d3545b20df4e1953a35ac00caec6a Mon Sep 17 00:00:00 2001 From: Jenna deBoisblanc Date: Tue, 17 Jul 2012 14:03:07 -0400 Subject: [PATCH 5/6] edited tutorial on inserting docs into sharded collection --- ...ng-documents-into-a-sharded-collection.txt | 75 ++++++++++++------- 1 file changed, 49 insertions(+), 26 deletions(-) diff --git a/source/tutorial/inserting-documents-into-a-sharded-collection.txt b/source/tutorial/inserting-documents-into-a-sharded-collection.txt index 3b8ba9fc6c6..5f97330a175 100644 --- a/source/tutorial/inserting-documents-into-a-sharded-collection.txt +++ b/source/tutorial/inserting-documents-into-a-sharded-collection.txt @@ -9,38 +9,19 @@ Shard Keys outline the ways that insert operations work given shard keys of the following types -Monotonically Increasing Values -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. what will happen if you try to do inserts. - -Documents with monotonically increasing shard keys, such as the BSON ObjectID, will always -be inserted into the last chunk in a collection. To illustrate why, consider a sharded -collection with two chunks, the second of which has an unbounded upper limit. -[-∞, 100) -[100,+∞) - -If the data being inserted has an increasing key, at any given time writes will always hit -the shard containing the chunk with the unbounded upper limit, a problem that is not -alleviated by splitting the "hot" chunk. High volume inserts, therefore, could hinder the -cluster's performance by placing a significant load on a single shard. - -If, however, a single shard can handle the write volume, an increasing shard key may have -some advantages. For example, if you need to do queries based on document insertion time, -sharding on the ObjectID ensures that documents created around the same time exist on the -same shard. Data locality helps to improve query performance. - -If you decide to use an monotonically increasing shard key and anticipate large inserts, -one solution may be to store the hash of the shard key as a separate field. Hashing may -prevent the need to balance chunks by distributing data equally around the cluster. You can -create a hash client-side. In the future, MongoDB may support automatic hashing: -https://jira.mongodb.org/browse/SERVER-2001 Even Distribution ~~~~~~~~~~~~~~~~~ +If the data's distribution of keys is evenly, MongoDB should be able to distribute writes +evenly around a the cluster once the chunk key ranges are established. MongoDB will +automatically split chunks when they grow to a certain size (~64 MB by default) and will +balance the number of chunks across shards. + +When inserting data into a new collection, it may be important to pre-split the key ranges. +See the section below on pre-splitting and manually moving chunks. Uneven Distribution ~~~~~~~~~~~~~~~~~~~ @@ -64,6 +45,34 @@ more granular in this portion of the alphabet may improve write performance. ["Smith", "Tyler"] ["Tyler",+∞) +Monotonically Increasing Values +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. what will happen if you try to do inserts. + +Documents with monotonically increasing shard keys, such as the BSON ObjectID, will always +be inserted into the last chunk in a collection. To illustrate why, consider a sharded +collection with two chunks, the second of which has an unbounded upper limit. + +[-∞, 100) +[100,+∞) + +If the data being inserted has an increasing key, at any given time writes will always hit +the shard containing the chunk with the unbounded upper limit, a problem that is not +alleviated by splitting the "hot" chunk. High volume inserts, therefore, could hinder the +cluster's performance by placing a significant load on a single shard. + +If, however, a single shard can handle the write volume, an increasing shard key may have +some advantages. For example, if you need to do queries based on document insertion time, +sharding on the ObjectID ensures that documents created around the same time exist on the +same shard. Data locality helps to improve query performance. + +If you decide to use an monotonically increasing shard key and anticipate large inserts, +one solution may be to store the hash of the shard key as a separate field. Hashing may +prevent the need to balance chunks by distributing data equally around the cluster. You can +create a hash client-side. In the future, MongoDB may support automatic hashing: +https://jira.mongodb.org/browse/SERVER-2001 + Operations ---------- @@ -88,6 +97,20 @@ In the example below the pre-split command splits the chunk where the _id 99 wou using that key as the split point. Note that a key need not exist for a chunk to use it in its range. The chunk may even be empty. +The first step is to create a sharded collection to contain the data, which can be done in +three steps: + +> use admin +> db.runCommand({ enableSharding : "foo" }) + +Next, we add a unique index to the collection "foo.bar" which is required for the shard +key. + +> use foo +> db.bar.ensureIndex({ _id : 1 }, { unique : true }) + +Finally we shard the collection (which contains no data) using the _id value. + > use admin switched to db admin > db.runCommand( { split : "test.foo" , middle : { _id : 99 } } ) From f4a5d75a07007660633e7e5c57f1ca5770c542d2 Mon Sep 17 00:00:00 2001 From: Jenna deBoisblanc Date: Thu, 19 Jul 2012 21:07:59 -0400 Subject: [PATCH 6/6] edited bounds --- .../inserting-documents-into-a-sharded-collection.txt | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/source/tutorial/inserting-documents-into-a-sharded-collection.txt b/source/tutorial/inserting-documents-into-a-sharded-collection.txt index 5f97330a175..d0e751c7bef 100644 --- a/source/tutorial/inserting-documents-into-a-sharded-collection.txt +++ b/source/tutorial/inserting-documents-into-a-sharded-collection.txt @@ -31,7 +31,7 @@ keys of the data being inserted aren't evenly distributed, you may need to pre-s data before doing high volume inserts. As an example, consider a collection sharded by last name with the following key distribution: -[-∞, "Henri") +(-∞, "Henri") ["Henri", "Peters") ["Peters",+∞) @@ -39,7 +39,7 @@ Although the chunk ranges may be split evenly, inserting lots of users with with last name such as "Smith" will potentially hammer a single shard. Making the chunk range more granular in this portion of the alphabet may improve write performance. -[-∞, "Henri") +(-∞, "Henri") ["Henri", "Peters") ["Peters", "Smith"] ["Smith", "Tyler"] @@ -54,7 +54,7 @@ Documents with monotonically increasing shard keys, such as the BSON ObjectID, w be inserted into the last chunk in a collection. To illustrate why, consider a sharded collection with two chunks, the second of which has an unbounded upper limit. -[-∞, 100) +(-∞, 100) [100,+∞) If the data being inserted has an increasing key, at any given time writes will always hit @@ -157,5 +157,5 @@ use a dedicated mongos for the export and the import. https://jira.mongodb.org/browse/SERVER-4000 -Incremental Inserts -~~~~~~~~~~~~~~~~~~~ \ No newline at end of file +Pre-allocating Documents +~~~~~~~~~~~~~~~~~~~~~~~~