Sharding-internals draft review and updates

TylerBrock · TylerBrock · commit 4c26ecaa40ab · 2012-06-08T11:42:25.000-04:00
diff --git a/draft/core/sharding-internals.txt b/draft/core/sharding-internals.txt
@@ -10,19 +10,19 @@ Sharding Internals
 
 This document introduces lower level sharding concepts for users who
 are familiar with :term:`sharding` generally and want to learn more
-about the internals of sharding in MongoDB. The
-":doc:`/core/sharding`" document provides an overview of higher level
-sharding concepts while the ":doc:`/administration/sharding`" provides
-an overview of common administrative tasks.
+about the internals of sharding in MongoDB. The ":doc:`/core/sharding`"
+document provides an overview of higher level sharding concepts while
+the ":doc:`/administration/sharding`" provides an overview of common
+administrative tasks.
 
 .. index:: shard key; internals
 .. _sharding-internals-shard-keys:
 
 Shard Keys
 ----------
 
-Shard keys are the field in the collection that MongoDB uses to
-distribute :term:`documents` among a shard cluster. See the
+Shard keys are the field in a collection that MongoDB uses to
+distribute :term:`documents` within a sharded cluster. See the
 :ref:`overview of shard keys <sharding-shard-keys>` for an
 introduction these topics.
 
@@ -32,36 +32,38 @@ introduction these topics.
 Cardinality
 ~~~~~~~~~~~
 
-Cardinality refers to the property of the data set that allows MongoDB
-to split it into :term:`chunks`. For example, consider a collection
-of data such as an "address book" that stores address records:
+In the context of MongoDB, Cardinality, which generally refers to the
+concept of counting or measuring the number of items in a set, represents
+the number of possible :term:`chunks` that data can be partitioned into.
 
-- Consider using a ``state`` field:
+For example, consider a collection of data such as an "address book"
+that stores address records:
 
-  This would hold the US state for an address document, as a shard
-  key. This field has a *low cardinality*. All documents that have the
+- Consider the use of a ``state`` field as a shard key:
+
+  The state key's value holds the US state for a given address document.
+  This field has a *low cardinality* as all documents that have the
   same value in the ``state`` field *must* reside on the same shard,
-  even if the chunk exceeds the chunk size.
+  even if a particular state's chunk exceeds the maximum chunk size.
 
   Because there are a limited number of possible values for this
-  field, it is easier for your data may not be evenly distributed, you
-  risk having data distributed unevenly among a fixed or small number
-  of chunks. In this may have a number of effects:
+  field, you risk having data unevenly distributed among a small
+  number of fixed chunks. This may have a number of effects:
 
   - If MongoDB cannot split a chunk because it all of its documents
-    have the same shard key, migrations involving these chunk will take
-    longer than other migrations, and it will be more difficult for
-    your data to balance evenly.
+    have the same shard key, migrations involving these un-splittable 
+    chunks will take longer than other migrations, and it will be more
+    difficult for your data to stay balanced.
 
-  - If you have a fixed maximum number of chunks you will never be
+  - If you have a fixed maximum number of chunks, you will never be
     able to use more than that number of shards for this collection.
 
-- Consider using the ``postal-code`` field (i.e. zip code:)
+- Consider the use of a ``postal-code`` field (i.e. zip code) as a shard key:
 
   While this field has a large number of possible values, and thus has
   *higher cardinality,* it's possible that a large number of users
   could have the same value for the shard key, which would make this
-  chunk of users un-splitable.
+  chunk of users un-splittable.
 
   In these cases, cardinality depends on the data. If your address book
   stores records for a geographically distributed contact list
@@ -70,18 +72,17 @@ of data such as an "address book" that stores address records:
   more geographically concentrated (e.g "ice cream stores in Boston
   Massachusetts,") then you may have a much lower cardinality.
 
-- Consider using the ``phone-number`` field:
+- Consider the use of a ``phone-number`` field as a shard key:
 
   The contact's telephone number has a *higher cardinality,* because
   most users will have a unique value for this field, MongoDB will be
   able to split in as many chunks as needed.
 
 While "high cardinality," is necessary for ensuring an even
-distribution of data, having a high cardinality does not garen tee
+distribution of data, having a high cardinality does not guarantee
 sufficient :ref:`query isolation <sharding-shard-key-query-isolation>`
-or appropriate :ref:`write scaling
-<sharding-shard-key-write-scaling>`. Continue reading for more
-information on these topics.
+or appropriate :ref:`write scaling <sharding-shard-key-write-scaling>`.
+Continue reading for more information on these topics.
 
 .. index:: shard key; write scaling
 .. _sharding-shard-key-write-scaling:
@@ -94,15 +95,15 @@ the increased write capacity that the shard cluster can provide, while
 others do not. Consider the following example where you shard by the
 default :term:`_id` field, which holds an :term:`ObjectID`.
 
-The ``ObjectID`` holds a value, computed upon creation, that is a
-unique identifier for the object. However, the most significant data in
-this value a is time stamp, which means that they increment
+The ``ObjectID`` holds a value, computed upon document creation, that is a
+unique identifier for the object. However, the most significant bits of data
+in this value represent a time stamp, which means that they increment
 in a regular and predictable pattern. Even though this value has
-:ref:`high cardinality <sharding-shard-key-cardinality>`, when
-this, or *any date or other incrementing number* as the shard key all
-insert operations will always end up on the same shard. As a result,
-the capacity of this node will define the effective capacity of the
-cluster.
+:ref:`high cardinality <sharding-shard-key-cardinality>`, when using
+this, or *any date or other monotonically increasing number* as the shard
+key, all insert operations will be storing data into a single chunk, and
+therefore, a single shard. As a result, the write capacity of this node 
+will define the effective write capacity of the cluster.
 
 In most cases want to avoid these kinds of shard keys, except in some
 situations: For example if you have a very low insert rate, most of
@@ -113,7 +114,7 @@ have *both* high cardinality and that will generally distribute write
 operations across the *entire cluster*.
 
 Typically, a computed shard key that has some amount of "randomness,"
-such as ones that include a cryptograpphic hash (i.e. MD5 or SHA1) of
+such as ones that include a cryptographic hash (i.e. MD5 or SHA1) of
 other content in the document, will allow the cluster to scale write
 operations. However, random shard keys do not typically provide
 :ref:`query isolation <sharding-shard-key-query-isolation>`, which is
@@ -122,16 +123,16 @@ another important characteristic of shard keys.
 Querying
 ~~~~~~~~
 
-The :program:`mongos` provides an interface for applications that use
-sharded database instances. The :program:`mongos` hides all of the
-complexity of :term:`partitioning <partition>` from the
-application. The :program:`mongos` receives queries from applications,
-and then using the metadata from the :ref:`config server
-<sharding-config-database>` to route the query to the
-:program:`mongod` instances that provide the :term:`shards
-<shard>`. While the :program:`mongos` succeeds in making all querying
-operational in sharded environments, the :term:`shard key` you select
-can have a profound affect on query performance.
+The :program:`mongos` program provides an interface for applications
+that query sharded clusters and :program:`mongos` hides all of the
+complexity of data :term:`partitioning <partition>` from the
+application. A :program:`mongos` receives queries from applications,
+and then, using the metadata from the :ref:`config server
+<sharding-config-database>`, routes queries to the :program:`mongod` 
+instances that hold the appropriate the data. While the :program:`mongos`
+succeeds in making all querying operational in sharded environments,
+the :term:`shard key` you select can have a profound affect on query
+performance.
 
 .. seealso:: The ":ref:`mongos and Sharding <sharding-mongos>`" and
    ":ref:`config server <sharding-config-server>`" sections for a more
@@ -153,15 +154,15 @@ application, which can be a long running operation.
 If your query includes the first component of a compound :term:`shard
 key` [#shard-key-index], then the :program:`mongos` can route the
 query directly to a single shard, or a small number of shards, which
-provides much greater performance. Even you query values of the shard
+provides much greater performance. Even if you query values of the shard
 key that reside in different chunks, the :program:`mongos` will route
-queires directly to the specific shard.
+queries directly to the specific shard.
 
-To select a shard key for a collection: determine which fields your
-queries select by most frequently and then which of these operations
+To select a shard key for a collection: determine which fields are included
+most frequently in queries for a given application and which of these operations
 are most performance dependent. If this field is not sufficiently
 selective (i.e. has low cardinality) you can add a second field to the
-compound shard key to make the cluster more splitable.
+compound shard key to make the data more splittable.
 
 .. see:: ":ref:`sharding-mongos`" for more information on query
    operations in the context of sharded clusters.
@@ -197,13 +198,13 @@ are:
 - to ensure that :program:`mongos` can isolate most to specific
   :program:`mongod` instances.
 
-In addition, consider the following operation consideration that the
+In addition, consider the following operational factors that the
 shard key can affect.
 
 Because each shard should be a :term:`replica set`, if a specific
 :program:`mongod` instance fails, the replica set will elect another
-member of that set to :term:`primary` and continue function. However,
-if an entire shard is unreachable or fails for some reason then that
+member of that set to be :term:`primary` and continue to function. However,
+if an entire shard is unreachable or fails for some reason, that
 data will be unavailable. If your shard key distributes data required
 for every operation throughout the cluster, then the failure of the
 entire shard will render the entire cluster unusable. By contrast, if
@@ -236,23 +237,23 @@ are three options:
    is insignificant in your use case given limited write volume,
    expected data size, or query patterns and demands.
 
-From a decision making stand point, begin by finding the the field
+From a decision making stand point, begin by finding the field
 that will provide the required :ref:`query isolation
 <sharding-shard-key-query-isolation>`, ensure that :ref:`writes will
 scale across the cluster <sharding-shard-key-query-isolation>`, and
 then add an additional field to provide additional :ref:`cardinality
 <sharding-shard-key-cardinality>` if your primary key does not have
-split-ability.
+sufficient split-ability.
 
 .. index:: balancing; internals
 .. _sharding-balancing-internals:
 
 Sharding Balancer
 -----------------
 
-The :ref:`balancer <sharding-balancing>` process is responsible for
+The :ref:`balancer <sharding-balancing>` sub-process is responsible for
 redistributing chunks evenly among the shards and ensuring that each
-member of the cluster is responsible for the same amount of data.
+member of the cluster is responsible for the same volume of data.
 
 This section contains complete documentation of the balancer process
 and operations. For a higher level introduction see
@@ -261,35 +262,33 @@ the :ref:`Balancing <sharding-balancer>` section.
 Balancing Internals
 ~~~~~~~~~~~~~~~~~~~
 
-The balancer originates from an arbitrary :program:`mongos`
+A balancing round originates from an arbitrary :program:`mongos`
 instance. Because your shard cluster can have a number of
 :program:`mongos` instances, when a balancer process is active it
-creates a "lock" document in the ``locks`` collection of the
-``config`` database on the :term:`config server`.
+acquires a "lock" by modifying a document on the :term:`config server`.
 
 By default, the balancer process is always running. When the number of
 chunks in a collection is unevenly distributed among the shards, the
 balancer begins migrating :term:`chunks` from shards with a
-disproportionate number of chunks to a shard with fewer number of
-chunks. The balancer will continue migrating chunks, one at a time
-beginning with the shard that has the lowest shard key, until the data
-is evenly distributed among the shards (i.e. the difference between
-any two chunks is less than 2 chunks.)
-
-While these automatic chunk migrations crucial for distributing data
-they carry some overhead in terms of bandwidth and system workload,
+disproportionate number of chunks to shards with a fewer number of
+chunks. The balancer will continue migrating chunks, one at a time,
+until the data is evenly distributed among the shards (i.e. the 
+difference between any two shards is less than 2 chunks.)
+
+While these automatic chunk migrations are crucial for distributing
+data, they carry some overhead in terms of bandwidth and workload,
 both of which can impact database performance. As a result, MongoDB
 attempts to minimize the effect of balancing by only migrating chunks
 when the disparity between numbers of chunks on a shard is greater
 than 8.
 
 .. index:: balancing; migration
 
-The migration process ensures consistency and maximize availability of
+The migration process ensures consistency and maximizes availability of
 chunks during balancing: when MongoDB begins migrating a chunk, the
 database begins copying the data to the new server and tracks incoming
 write operations. After migrating the chunks, the "from"
-:program:`mongod` sends all new writes, to the "to" server, and *then*
+:program:`mongod` sends all new writes, to the receiving server, and *then*
 updates the chunk record in the :term:`config database` to reflect the
 new location of the chunk.
 
@@ -301,14 +300,14 @@ Chunk Size
 
 .. TODO link this section to <glossary:chunk size>
 
-The default :term:`chunk` size in MongoDB is 64 megabytes.
+The default maximum :term:`chunk` size in MongoDB is 64 megabytes.
 
-When chunks grow beyond the :ref:`specified chunk size
+When chunks grow beyond the :ref:`specified maximum chunk size
 <sharding-chunk-size>` a :program:`mongos` instance will split the
 chunk in half, which will eventually lead to migrations, when chunks
 become unevenly distributed among the cluster, the :program:`mongos`
-instances will initiate a round migrations to redistribute data in the
-cluster.
+instances will initiate a round of migrations to redistribute data 
+in the cluster.
 
 Chunk size is somewhat arbitrary and must account for the
 following effects: