mongodb · tychoish · Aug 29, 2012 · Jul 9, 2012 · Jul 17, 2012 · Jul 17, 2012
diff --git a/source/faq/sharding.txt b/source/faq/sharding.txt
@@ -254,3 +254,58 @@ size of the connection pool. This setting prevents the
 :program:`mongos` from causing connection spikes on the individual
 :term:`shards <shard>`. Spikes like these may disrupt the operation
 and memory allocation of the :term:`shard cluster`.
+
+What are the best ways to successfully insert larger volumes of data into as sharded collection? 
+------------------------------------------------------------------------------------------------
+
+- what is pre-splitting 
+
+  In sharded environments, MongoDB distributes data into :term:`chunks
+  <chunk>`, each defined by a range of shard key values. Pre-splitting is a command run
+  prior to data insertion that specifies the shard key values on which to split up chunks.
+
+- Pre-splitting is useful before large inserts into a sharded collection when: 
+
+1. inserting data into an empty collection
+
+If a collection is empty, the database takes time to determine the optimal key
+distribution. If you insert many documents in rapid succession, MongoDB will initially
+direct writes to a single chunk, potentially having significant impacts on performance.
+Predefining splits may improve write performance in the early stages of a bulk import by
+eliminating the database's "learning" period.
+
+2. data is not evenly distributed 
+
+Even if the sharded collection was previously populated with documents and contains multiple
+chunks, pre-splitting may be beneficial if the write operation isn't evenly distributed, in
+other words, if the inserts have shard keys values contained on a small number of chunks.
+
+3. monotomically increasing shard key
+
+If you attempt to insert data with monotonically increasing shard keys, the writes will
+always hit the last chunk in the collection. Predefining splits may help to cycle a large
+write operation around the cluster; however, pre-splitting in this instance will not
+prevent consecutive inserts from hitting a single shard.
+
+- when does it not matter
+
+If data insertion is not rapid, MongoDB may have enough time to split and migrate chunks without
+impacts on performance. In addition, if the collection already has chunks with an even key
+distribution, pre-splitting may not be necessary. 
+
+See the ":doc:`/tutorial/inserting-documents-into-a-sharded-collection`" tutorial for more
+information.
+
+
+Is it necessary to pre-split data before high volume inserts into a sharded collection?
+---------------------------------------------------------------------------------------
+
+The answer depends on the shard key, the existing distribution of chunks, and how
+evenly distributed the insert operation is. If a collection is empty prior to a
+bulk insert, the database will take time to determine the optimal key
+distribution. Predefining splits improves write performance in the early stages
+of a bulk import.
+
+Pre-splitting is also important if the write operation isn't evenly distributed.
+When using an increasing shard key, for example, pre-splitting data can prevent
+writes from hammering a single shard.
diff --git a/source/tutorial/inserting-documents-into-a-sharded-collection.txt b/source/tutorial/inserting-documents-into-a-sharded-collection.txt
@@ -0,0 +1,161 @@
+=============================================
+Inserting Documents into a Sharded Collection
+=============================================
+
+Shard Keys
+----------
+
+.. TODO 
+
+   outline the ways that insert operations work given shard keys of the following types
+
+
+
+
+Even Distribution
+~~~~~~~~~~~~~~~~~
+
+If the data's distribution of keys is evenly, MongoDB should be able to distribute writes
+evenly around a the cluster once the chunk key ranges are established. MongoDB will
+automatically split chunks when they grow to a certain size (~64 MB by default) and will
+balance the number of chunks across shards.
+
+When inserting data into a new collection, it may be important to pre-split the key ranges.
+See the section below on pre-splitting and manually moving chunks.
+
+Uneven Distribution
+~~~~~~~~~~~~~~~~~~~
+
+If you need to insert data into a collection that isn't evenly distributed, or if the shard
+keys of the data being inserted aren't evenly distributed, you may need to pre-split your
+data before doing high volume inserts. As an example, consider a collection sharded by last
+name with the following key distribution:
+
+(-∞, "Henri") 
+["Henri", "Peters") 
+["Peters",+∞) 
+
+Although the chunk ranges may be split evenly, inserting lots of users with with a common
+last name such as "Smith" will potentially hammer a single shard. Making the chunk range
+more granular in this portion of the alphabet may improve write performance.
+
+(-∞, "Henri") 
+["Henri", "Peters")
+["Peters", "Smith"] 
+["Smith", "Tyler"]
+["Tyler",+∞)
+
+Monotonically Increasing Values
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. what will happen if you try to do inserts.
+
+Documents with monotonically increasing shard keys, such as the BSON ObjectID, will always
+be inserted into the last chunk in a collection. To illustrate why, consider a sharded
+collection with two chunks, the second of which has an unbounded upper limit. 
+
+(-∞, 100) 
+[100,+∞)
+
+If the data being inserted has an increasing key, at any given time writes will always hit
+the shard containing the chunk with the unbounded upper limit, a problem that is not
+alleviated by splitting the "hot" chunk. High volume inserts, therefore, could hinder the
+cluster's performance by placing a significant load on a single shard.
+
+If, however, a single shard can handle the write volume, an increasing shard key may have
+some advantages. For example, if you need to do queries based on document insertion time,
+sharding on the ObjectID ensures that documents created around the same time exist on the
+same shard. Data locality helps to improve query performance.
+
+If you decide to use an monotonically increasing shard key and anticipate large inserts,
+one solution may be to store the hash of the shard key as a separate field. Hashing may
+prevent the need to balance chunks by distributing data equally around the cluster. You can
+create a hash client-side. In the future, MongoDB may support automatic hashing:
+https://jira.mongodb.org/browse/SERVER-2001
+
+Operations
+----------
+
+.. TODO 
+
+   outline the procedures and rationale for each process.
+
+Pre-Splitting
+~~~~~~~~~~~~~
+
+.. when to do this
+.. procedure for this process
+
+Pre-splitting is the process of specifying shard key ranges for chunks prior to data
+insertion. This process may be important prior to large inserts into a sharded collection
+as a way of ensuring that the write operation is evenly spread around the cluster. You
+should consider pre-splitting before large inserts if the sharded collection is empty, if
+the collection's data or the data being inserted is not evenly distributed, or if the shard
+key is monotonically increasing.
+
+In the example below the pre-split command splits the chunk where the _id 99 would reside
+using that key as the split point. Note that a key need not exist for a chunk to use it in
+its range. The chunk may even be empty.
+
+The first step is to create a sharded collection to contain the data, which can be done in
+three steps:
+
+> use admin
+> db.runCommand({ enableSharding : "foo" })
+
+Next, we add a unique index to the collection "foo.bar" which is required for the shard
+key.
+
+> use foo
+> db.bar.ensureIndex({ _id : 1 }, { unique : true })
+
+Finally we shard the collection (which contains no data) using the _id value.
+
+> use admin
+switched to db admin
+> db.runCommand( { split : "test.foo" , middle : { _id : 99 } } )
+
+Once the key range is specified, chunks can be moved around the cluster using the moveChunk
+command.
+
+> db.runCommand( { moveChunk : "test.foo" , find : { _id : 99 } , to : "shard1" } )
+
+You can repeat these steps as many times as necessary to create or move chunks around the
+cluster. To get information about the two chunks created in this example:
+
+> db.printShardingStatus()
+--- Sharding Status ---
+  sharding version: { "_id" : 1, "version" : 3 }
+  shards:
+      { "_id" : "shard0000", "host" : "localhost:30000" }
+      { "_id" : "shard0001", "host" : "localhost:30001" }
+  databases:
+	{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
+	{ "_id" : "test", "partitioned" : true, "primary" : "shard0001" }
+		test.foo chunks:
+				shard0001	1
+				shard0000	1
+			{ "_id" : { "$MinKey" : true } } -->> { "_id" : "99" } on : shard0001 { "t" : 2000, "i" : 1 }
+			{ "_id" : "99" } -->> { "_id" : { "$MaxKey" : true } } on : shard0000 { "t" : 2000, "i" : 0 }
+
+Once the chunks and the key ranges are evenly distributed, you can proceed with a
+high volume insert.
+
+Changing Shard Key 
+~~~~~~~~~~~~~~~~~~
+
+There is no automatic support for changing the shard key for a collection. In addition,
+since a document's location within the cluster is determined by its shard key value,
+changing the shard key could force data to move from machine to machine, potentially a
+highly expensive operation.
+
+Thus it is very important to choose the right shard key up front.
+
+If you do need to change a shard key, an export and import is likely the best solution.
+Create a new pre-sharded collection, and then import the exported data to it. If desired
+use a dedicated mongos for the export and the import.
+
+https://jira.mongodb.org/browse/SERVER-4000
+
+Pre-allocating Documents
+~~~~~~~~~~~~~~~~~~~~~~~~