Skip to content

Commit 134eb2a

Browse files
author
Sam Kleinman
committed
merge: DOCS-458
2 parents 9ecd4b3 + f4a5d75 commit 134eb2a

File tree

2 files changed

+215
-0
lines changed

2 files changed

+215
-0
lines changed

draft/faq-sharding-addition.txt

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
What are the best ways to successfully insert larger volumes of data into as sharded collection?
2+
------------------------------------------------------------------------------------------------
3+
4+
- what is pre-splitting
5+
6+
In sharded environments, MongoDB distributes data into :term:`chunks
7+
<chunk>`, each defined by a range of shard key values. Pre-splitting is a command run
8+
prior to data insertion that specifies the shard key values on which to split up chunks.
9+
10+
- Pre-splitting is useful before large inserts into a sharded collection when:
11+
12+
1. inserting data into an empty collection
13+
14+
If a collection is empty, the database takes time to determine the optimal key
15+
distribution. If you insert many documents in rapid succession, MongoDB will initially
16+
direct writes to a single chunk, potentially having significant impacts on performance.
17+
Predefining splits may improve write performance in the early stages of a bulk import by
18+
eliminating the database's "learning" period.
19+
20+
2. data is not evenly distributed
21+
22+
Even if the sharded collection was previously populated with documents and contains multiple
23+
chunks, pre-splitting may be beneficial if the write operation isn't evenly distributed, in
24+
other words, if the inserts have shard keys values contained on a small number of chunks.
25+
26+
3. monotomically increasing shard key
27+
28+
If you attempt to insert data with monotonically increasing shard keys, the writes will
29+
always hit the last chunk in the collection. Predefining splits may help to cycle a large
30+
write operation around the cluster; however, pre-splitting in this instance will not
31+
prevent consecutive inserts from hitting a single shard.
32+
33+
- when does it not matter
34+
35+
If data insertion is not rapid, MongoDB may have enough time to split and migrate chunks without
36+
impacts on performance. In addition, if the collection already has chunks with an even key
37+
distribution, pre-splitting may not be necessary.
38+
39+
See the ":doc:`/tutorial/inserting-documents-into-a-sharded-collection`" tutorial for more
40+
information.
41+
42+
43+
Is it necessary to pre-split data before high volume inserts into a sharded collection?
44+
---------------------------------------------------------------------------------------
45+
46+
The answer depends on the shard key, the existing distribution of chunks, and how
47+
evenly distributed the insert operation is. If a collection is empty prior to a
48+
bulk insert, the database will take time to determine the optimal key
49+
distribution. Predefining splits improves write performance in the early stages
50+
of a bulk import.
51+
52+
Pre-splitting is also important if the write operation isn't evenly distributed.
53+
When using an increasing shard key, for example, pre-splitting data can prevent
54+
writes from hammering a single shard.
Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
=============================================
2+
Inserting Documents into a Sharded Collection
3+
=============================================
4+
5+
Shard Keys
6+
----------
7+
8+
.. TODO
9+
10+
outline the ways that insert operations work given shard keys of the following types
11+
12+
13+
14+
15+
Even Distribution
16+
~~~~~~~~~~~~~~~~~
17+
18+
If the data's distribution of keys is evenly, MongoDB should be able to distribute writes
19+
evenly around a the cluster once the chunk key ranges are established. MongoDB will
20+
automatically split chunks when they grow to a certain size (~64 MB by default) and will
21+
balance the number of chunks across shards.
22+
23+
When inserting data into a new collection, it may be important to pre-split the key ranges.
24+
See the section below on pre-splitting and manually moving chunks.
25+
26+
Uneven Distribution
27+
~~~~~~~~~~~~~~~~~~~
28+
29+
If you need to insert data into a collection that isn't evenly distributed, or if the shard
30+
keys of the data being inserted aren't evenly distributed, you may need to pre-split your
31+
data before doing high volume inserts. As an example, consider a collection sharded by last
32+
name with the following key distribution:
33+
34+
(-∞, "Henri")
35+
["Henri", "Peters")
36+
["Peters",+∞)
37+
38+
Although the chunk ranges may be split evenly, inserting lots of users with with a common
39+
last name such as "Smith" will potentially hammer a single shard. Making the chunk range
40+
more granular in this portion of the alphabet may improve write performance.
41+
42+
(-∞, "Henri")
43+
["Henri", "Peters")
44+
["Peters", "Smith"]
45+
["Smith", "Tyler"]
46+
["Tyler",+∞)
47+
48+
Monotonically Increasing Values
49+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
50+
51+
.. what will happen if you try to do inserts.
52+
53+
Documents with monotonically increasing shard keys, such as the BSON ObjectID, will always
54+
be inserted into the last chunk in a collection. To illustrate why, consider a sharded
55+
collection with two chunks, the second of which has an unbounded upper limit.
56+
57+
(-∞, 100)
58+
[100,+∞)
59+
60+
If the data being inserted has an increasing key, at any given time writes will always hit
61+
the shard containing the chunk with the unbounded upper limit, a problem that is not
62+
alleviated by splitting the "hot" chunk. High volume inserts, therefore, could hinder the
63+
cluster's performance by placing a significant load on a single shard.
64+
65+
If, however, a single shard can handle the write volume, an increasing shard key may have
66+
some advantages. For example, if you need to do queries based on document insertion time,
67+
sharding on the ObjectID ensures that documents created around the same time exist on the
68+
same shard. Data locality helps to improve query performance.
69+
70+
If you decide to use an monotonically increasing shard key and anticipate large inserts,
71+
one solution may be to store the hash of the shard key as a separate field. Hashing may
72+
prevent the need to balance chunks by distributing data equally around the cluster. You can
73+
create a hash client-side. In the future, MongoDB may support automatic hashing:
74+
https://jira.mongodb.org/browse/SERVER-2001
75+
76+
Operations
77+
----------
78+
79+
.. TODO
80+
81+
outline the procedures and rationale for each process.
82+
83+
Pre-Splitting
84+
~~~~~~~~~~~~~
85+
86+
.. when to do this
87+
.. procedure for this process
88+
89+
Pre-splitting is the process of specifying shard key ranges for chunks prior to data
90+
insertion. This process may be important prior to large inserts into a sharded collection
91+
as a way of ensuring that the write operation is evenly spread around the cluster. You
92+
should consider pre-splitting before large inserts if the sharded collection is empty, if
93+
the collection's data or the data being inserted is not evenly distributed, or if the shard
94+
key is monotonically increasing.
95+
96+
In the example below the pre-split command splits the chunk where the _id 99 would reside
97+
using that key as the split point. Note that a key need not exist for a chunk to use it in
98+
its range. The chunk may even be empty.
99+
100+
The first step is to create a sharded collection to contain the data, which can be done in
101+
three steps:
102+
103+
> use admin
104+
> db.runCommand({ enableSharding : "foo" })
105+
106+
Next, we add a unique index to the collection "foo.bar" which is required for the shard
107+
key.
108+
109+
> use foo
110+
> db.bar.ensureIndex({ _id : 1 }, { unique : true })
111+
112+
Finally we shard the collection (which contains no data) using the _id value.
113+
114+
> use admin
115+
switched to db admin
116+
> db.runCommand( { split : "test.foo" , middle : { _id : 99 } } )
117+
118+
Once the key range is specified, chunks can be moved around the cluster using the moveChunk
119+
command.
120+
121+
> db.runCommand( { moveChunk : "test.foo" , find : { _id : 99 } , to : "shard1" } )
122+
123+
You can repeat these steps as many times as necessary to create or move chunks around the
124+
cluster. To get information about the two chunks created in this example:
125+
126+
> db.printShardingStatus()
127+
--- Sharding Status ---
128+
sharding version: { "_id" : 1, "version" : 3 }
129+
shards:
130+
{ "_id" : "shard0000", "host" : "localhost:30000" }
131+
{ "_id" : "shard0001", "host" : "localhost:30001" }
132+
databases:
133+
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
134+
{ "_id" : "test", "partitioned" : true, "primary" : "shard0001" }
135+
test.foo chunks:
136+
shard0001 1
137+
shard0000 1
138+
{ "_id" : { "$MinKey" : true } } -->> { "_id" : "99" } on : shard0001 { "t" : 2000, "i" : 1 }
139+
{ "_id" : "99" } -->> { "_id" : { "$MaxKey" : true } } on : shard0000 { "t" : 2000, "i" : 0 }
140+
141+
Once the chunks and the key ranges are evenly distributed, you can proceed with a
142+
high volume insert.
143+
144+
Changing Shard Key
145+
~~~~~~~~~~~~~~~~~~
146+
147+
There is no automatic support for changing the shard key for a collection. In addition,
148+
since a document's location within the cluster is determined by its shard key value,
149+
changing the shard key could force data to move from machine to machine, potentially a
150+
highly expensive operation.
151+
152+
Thus it is very important to choose the right shard key up front.
153+
154+
If you do need to change a shard key, an export and import is likely the best solution.
155+
Create a new pre-sharded collection, and then import the exported data to it. If desired
156+
use a dedicated mongos for the export and the import.
157+
158+
https://jira.mongodb.org/browse/SERVER-4000
159+
160+
Pre-allocating Documents
161+
~~~~~~~~~~~~~~~~~~~~~~~~

0 commit comments

Comments
 (0)