|
| 1 | +============================================= |
| 2 | +Inserting Documents into a Sharded Collection |
| 3 | +============================================= |
| 4 | + |
| 5 | +Shard Keys |
| 6 | +---------- |
| 7 | + |
| 8 | +.. TODO |
| 9 | + |
| 10 | + outline the ways that insert operations work given shard keys of the following types |
| 11 | + |
| 12 | + |
| 13 | + |
| 14 | + |
| 15 | +Even Distribution |
| 16 | +~~~~~~~~~~~~~~~~~ |
| 17 | + |
| 18 | +If the data's distribution of keys is evenly, MongoDB should be able to distribute writes |
| 19 | +evenly around a the cluster once the chunk key ranges are established. MongoDB will |
| 20 | +automatically split chunks when they grow to a certain size (~64 MB by default) and will |
| 21 | +balance the number of chunks across shards. |
| 22 | + |
| 23 | +When inserting data into a new collection, it may be important to pre-split the key ranges. |
| 24 | +See the section below on pre-splitting and manually moving chunks. |
| 25 | + |
| 26 | +Uneven Distribution |
| 27 | +~~~~~~~~~~~~~~~~~~~ |
| 28 | + |
| 29 | +If you need to insert data into a collection that isn't evenly distributed, or if the shard |
| 30 | +keys of the data being inserted aren't evenly distributed, you may need to pre-split your |
| 31 | +data before doing high volume inserts. As an example, consider a collection sharded by last |
| 32 | +name with the following key distribution: |
| 33 | + |
| 34 | +(-∞, "Henri") |
| 35 | +["Henri", "Peters") |
| 36 | +["Peters",+∞) |
| 37 | + |
| 38 | +Although the chunk ranges may be split evenly, inserting lots of users with with a common |
| 39 | +last name such as "Smith" will potentially hammer a single shard. Making the chunk range |
| 40 | +more granular in this portion of the alphabet may improve write performance. |
| 41 | + |
| 42 | +(-∞, "Henri") |
| 43 | +["Henri", "Peters") |
| 44 | +["Peters", "Smith"] |
| 45 | +["Smith", "Tyler"] |
| 46 | +["Tyler",+∞) |
| 47 | + |
| 48 | +Monotonically Increasing Values |
| 49 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 50 | + |
| 51 | +.. what will happen if you try to do inserts. |
| 52 | + |
| 53 | +Documents with monotonically increasing shard keys, such as the BSON ObjectID, will always |
| 54 | +be inserted into the last chunk in a collection. To illustrate why, consider a sharded |
| 55 | +collection with two chunks, the second of which has an unbounded upper limit. |
| 56 | + |
| 57 | +(-∞, 100) |
| 58 | +[100,+∞) |
| 59 | + |
| 60 | +If the data being inserted has an increasing key, at any given time writes will always hit |
| 61 | +the shard containing the chunk with the unbounded upper limit, a problem that is not |
| 62 | +alleviated by splitting the "hot" chunk. High volume inserts, therefore, could hinder the |
| 63 | +cluster's performance by placing a significant load on a single shard. |
| 64 | + |
| 65 | +If, however, a single shard can handle the write volume, an increasing shard key may have |
| 66 | +some advantages. For example, if you need to do queries based on document insertion time, |
| 67 | +sharding on the ObjectID ensures that documents created around the same time exist on the |
| 68 | +same shard. Data locality helps to improve query performance. |
| 69 | + |
| 70 | +If you decide to use an monotonically increasing shard key and anticipate large inserts, |
| 71 | +one solution may be to store the hash of the shard key as a separate field. Hashing may |
| 72 | +prevent the need to balance chunks by distributing data equally around the cluster. You can |
| 73 | +create a hash client-side. In the future, MongoDB may support automatic hashing: |
| 74 | +https://jira.mongodb.org/browse/SERVER-2001 |
| 75 | + |
| 76 | +Operations |
| 77 | +---------- |
| 78 | + |
| 79 | +.. TODO |
| 80 | + |
| 81 | + outline the procedures and rationale for each process. |
| 82 | + |
| 83 | +Pre-Splitting |
| 84 | +~~~~~~~~~~~~~ |
| 85 | + |
| 86 | +.. when to do this |
| 87 | +.. procedure for this process |
| 88 | + |
| 89 | +Pre-splitting is the process of specifying shard key ranges for chunks prior to data |
| 90 | +insertion. This process may be important prior to large inserts into a sharded collection |
| 91 | +as a way of ensuring that the write operation is evenly spread around the cluster. You |
| 92 | +should consider pre-splitting before large inserts if the sharded collection is empty, if |
| 93 | +the collection's data or the data being inserted is not evenly distributed, or if the shard |
| 94 | +key is monotonically increasing. |
| 95 | + |
| 96 | +In the example below the pre-split command splits the chunk where the _id 99 would reside |
| 97 | +using that key as the split point. Note that a key need not exist for a chunk to use it in |
| 98 | +its range. The chunk may even be empty. |
| 99 | + |
| 100 | +The first step is to create a sharded collection to contain the data, which can be done in |
| 101 | +three steps: |
| 102 | + |
| 103 | +> use admin |
| 104 | +> db.runCommand({ enableSharding : "foo" }) |
| 105 | + |
| 106 | +Next, we add a unique index to the collection "foo.bar" which is required for the shard |
| 107 | +key. |
| 108 | + |
| 109 | +> use foo |
| 110 | +> db.bar.ensureIndex({ _id : 1 }, { unique : true }) |
| 111 | + |
| 112 | +Finally we shard the collection (which contains no data) using the _id value. |
| 113 | + |
| 114 | +> use admin |
| 115 | +switched to db admin |
| 116 | +> db.runCommand( { split : "test.foo" , middle : { _id : 99 } } ) |
| 117 | + |
| 118 | +Once the key range is specified, chunks can be moved around the cluster using the moveChunk |
| 119 | +command. |
| 120 | + |
| 121 | +> db.runCommand( { moveChunk : "test.foo" , find : { _id : 99 } , to : "shard1" } ) |
| 122 | + |
| 123 | +You can repeat these steps as many times as necessary to create or move chunks around the |
| 124 | +cluster. To get information about the two chunks created in this example: |
| 125 | + |
| 126 | +> db.printShardingStatus() |
| 127 | +--- Sharding Status --- |
| 128 | + sharding version: { "_id" : 1, "version" : 3 } |
| 129 | + shards: |
| 130 | + { "_id" : "shard0000", "host" : "localhost:30000" } |
| 131 | + { "_id" : "shard0001", "host" : "localhost:30001" } |
| 132 | + databases: |
| 133 | + { "_id" : "admin", "partitioned" : false, "primary" : "config" } |
| 134 | + { "_id" : "test", "partitioned" : true, "primary" : "shard0001" } |
| 135 | + test.foo chunks: |
| 136 | + shard0001 1 |
| 137 | + shard0000 1 |
| 138 | + { "_id" : { "$MinKey" : true } } -->> { "_id" : "99" } on : shard0001 { "t" : 2000, "i" : 1 } |
| 139 | + { "_id" : "99" } -->> { "_id" : { "$MaxKey" : true } } on : shard0000 { "t" : 2000, "i" : 0 } |
| 140 | + |
| 141 | +Once the chunks and the key ranges are evenly distributed, you can proceed with a |
| 142 | +high volume insert. |
| 143 | + |
| 144 | +Changing Shard Key |
| 145 | +~~~~~~~~~~~~~~~~~~ |
| 146 | + |
| 147 | +There is no automatic support for changing the shard key for a collection. In addition, |
| 148 | +since a document's location within the cluster is determined by its shard key value, |
| 149 | +changing the shard key could force data to move from machine to machine, potentially a |
| 150 | +highly expensive operation. |
| 151 | + |
| 152 | +Thus it is very important to choose the right shard key up front. |
| 153 | + |
| 154 | +If you do need to change a shard key, an export and import is likely the best solution. |
| 155 | +Create a new pre-sharded collection, and then import the exported data to it. If desired |
| 156 | +use a dedicated mongos for the export and the import. |
| 157 | + |
| 158 | +https://jira.mongodb.org/browse/SERVER-4000 |
| 159 | + |
| 160 | +Pre-allocating Documents |
| 161 | +~~~~~~~~~~~~~~~~~~~~~~~~ |
0 commit comments