diff --git a/draft/core/sharding-internals.txt b/draft/core/sharding-internals.txt index 966c96c11c9..847e220c9a2 100644 --- a/draft/core/sharding-internals.txt +++ b/draft/core/sharding-internals.txt @@ -10,10 +10,10 @@ Sharding Internals This document introduces lower level sharding concepts for users who are familiar with :term:`sharding` generally and want to learn more -about the internals of sharding in MongoDB. The -":doc:`/core/sharding`" document provides an overview of higher level -sharding concepts while the ":doc:`/administration/sharding`" provides -an overview of common administrative tasks. +about the internals of sharding in MongoDB. The ":doc:`/core/sharding`" +document provides an overview of higher level sharding concepts while +the ":doc:`/administration/sharding`" provides an overview of common +administrative tasks. .. index:: shard key; internals .. _sharding-internals-shard-keys: @@ -21,8 +21,8 @@ an overview of common administrative tasks. Shard Keys ---------- -Shard keys are the field in the collection that MongoDB uses to -distribute :term:`documents` among a shard cluster. See the +Shard keys are the field in a collection that MongoDB uses to +distribute :term:`documents` within a sharded cluster. See the :ref:`overview of shard keys ` for an introduction these topics. @@ -32,36 +32,38 @@ introduction these topics. Cardinality ~~~~~~~~~~~ -Cardinality refers to the property of the data set that allows MongoDB -to split it into :term:`chunks`. For example, consider a collection -of data such as an "address book" that stores address records: +In the context of MongoDB, Cardinality, which generally refers to the +concept of counting or measuring the number of items in a set, represents +the number of possible :term:`chunks` that data can be partitioned into. -- Consider using a ``state`` field: +For example, consider a collection of data such as an "address book" +that stores address records: - This would hold the US state for an address document, as a shard - key. This field has a *low cardinality*. All documents that have the +- Consider the use of a ``state`` field as a shard key: + + The state key's value holds the US state for a given address document. + This field has a *low cardinality* as all documents that have the same value in the ``state`` field *must* reside on the same shard, - even if the chunk exceeds the chunk size. + even if a particular state's chunk exceeds the maximum chunk size. Because there are a limited number of possible values for this - field, it is easier for your data may not be evenly distributed, you - risk having data distributed unevenly among a fixed or small number - of chunks. In this may have a number of effects: + field, you risk having data unevenly distributed among a small + number of fixed chunks. This may have a number of effects: - If MongoDB cannot split a chunk because it all of its documents - have the same shard key, migrations involving these chunk will take - longer than other migrations, and it will be more difficult for - your data to balance evenly. + have the same shard key, migrations involving these un-splittable + chunks will take longer than other migrations, and it will be more + difficult for your data to stay balanced. - - If you have a fixed maximum number of chunks you will never be + - If you have a fixed maximum number of chunks, you will never be able to use more than that number of shards for this collection. -- Consider using the ``postal-code`` field (i.e. zip code:) +- Consider the use of a ``postal-code`` field (i.e. zip code) as a shard key: While this field has a large number of possible values, and thus has *higher cardinality,* it's possible that a large number of users could have the same value for the shard key, which would make this - chunk of users un-splitable. + chunk of users un-splittable. In these cases, cardinality depends on the data. If your address book stores records for a geographically distributed contact list @@ -70,18 +72,17 @@ of data such as an "address book" that stores address records: more geographically concentrated (e.g "ice cream stores in Boston Massachusetts,") then you may have a much lower cardinality. -- Consider using the ``phone-number`` field: +- Consider the use of a ``phone-number`` field as a shard key: The contact's telephone number has a *higher cardinality,* because most users will have a unique value for this field, MongoDB will be able to split in as many chunks as needed. While "high cardinality," is necessary for ensuring an even -distribution of data, having a high cardinality does not garen tee +distribution of data, having a high cardinality does not guarantee sufficient :ref:`query isolation ` -or appropriate :ref:`write scaling -`. Continue reading for more -information on these topics. +or appropriate :ref:`write scaling `. +Continue reading for more information on these topics. .. index:: shard key; write scaling .. _sharding-shard-key-write-scaling: @@ -94,15 +95,15 @@ the increased write capacity that the shard cluster can provide, while others do not. Consider the following example where you shard by the default :term:`_id` field, which holds an :term:`ObjectID`. -The ``ObjectID`` holds a value, computed upon creation, that is a -unique identifier for the object. However, the most significant data in -this value a is time stamp, which means that they increment +The ``ObjectID`` holds a value, computed upon document creation, that is a +unique identifier for the object. However, the most significant bits of data +in this value represent a time stamp, which means that they increment in a regular and predictable pattern. Even though this value has -:ref:`high cardinality `, when -this, or *any date or other incrementing number* as the shard key all -insert operations will always end up on the same shard. As a result, -the capacity of this node will define the effective capacity of the -cluster. +:ref:`high cardinality `, when using +this, or *any date or other monotonically increasing number* as the shard +key, all insert operations will be storing data into a single chunk, and +therefore, a single shard. As a result, the write capacity of this node +will define the effective write capacity of the cluster. In most cases want to avoid these kinds of shard keys, except in some situations: For example if you have a very low insert rate, most of @@ -113,7 +114,7 @@ have *both* high cardinality and that will generally distribute write operations across the *entire cluster*. Typically, a computed shard key that has some amount of "randomness," -such as ones that include a cryptograpphic hash (i.e. MD5 or SHA1) of +such as ones that include a cryptographic hash (i.e. MD5 or SHA1) of other content in the document, will allow the cluster to scale write operations. However, random shard keys do not typically provide :ref:`query isolation `, which is @@ -122,16 +123,16 @@ another important characteristic of shard keys. Querying ~~~~~~~~ -The :program:`mongos` provides an interface for applications that use -sharded database instances. The :program:`mongos` hides all of the -complexity of :term:`partitioning ` from the -application. The :program:`mongos` receives queries from applications, -and then using the metadata from the :ref:`config server -` to route the query to the -:program:`mongod` instances that provide the :term:`shards -`. While the :program:`mongos` succeeds in making all querying -operational in sharded environments, the :term:`shard key` you select -can have a profound affect on query performance. +The :program:`mongos` program provides an interface for applications +that query sharded clusters and :program:`mongos` hides all of the +complexity of data :term:`partitioning ` from the +application. A :program:`mongos` receives queries from applications, +and then, using the metadata from the :ref:`config server +`, routes queries to the :program:`mongod` +instances that hold the appropriate the data. While the :program:`mongos` +succeeds in making all querying operational in sharded environments, +the :term:`shard key` you select can have a profound affect on query +performance. .. seealso:: The ":ref:`mongos and Sharding `" and ":ref:`config server `" sections for a more @@ -153,15 +154,15 @@ application, which can be a long running operation. If your query includes the first component of a compound :term:`shard key` [#shard-key-index], then the :program:`mongos` can route the query directly to a single shard, or a small number of shards, which -provides much greater performance. Even you query values of the shard +provides much greater performance. Even if you query values of the shard key that reside in different chunks, the :program:`mongos` will route -queires directly to the specific shard. +queries directly to the specific shard. -To select a shard key for a collection: determine which fields your -queries select by most frequently and then which of these operations +To select a shard key for a collection: determine which fields are included +most frequently in queries for a given application and which of these operations are most performance dependent. If this field is not sufficiently selective (i.e. has low cardinality) you can add a second field to the -compound shard key to make the cluster more splitable. +compound shard key to make the data more splittable. .. see:: ":ref:`sharding-mongos`" for more information on query operations in the context of sharded clusters. @@ -197,13 +198,13 @@ are: - to ensure that :program:`mongos` can isolate most to specific :program:`mongod` instances. -In addition, consider the following operation consideration that the +In addition, consider the following operational factors that the shard key can affect. Because each shard should be a :term:`replica set`, if a specific :program:`mongod` instance fails, the replica set will elect another -member of that set to :term:`primary` and continue function. However, -if an entire shard is unreachable or fails for some reason then that +member of that set to be :term:`primary` and continue to function. However, +if an entire shard is unreachable or fails for some reason, that data will be unavailable. If your shard key distributes data required for every operation throughout the cluster, then the failure of the entire shard will render the entire cluster unusable. By contrast, if @@ -236,13 +237,13 @@ are three options: is insignificant in your use case given limited write volume, expected data size, or query patterns and demands. -From a decision making stand point, begin by finding the the field +From a decision making stand point, begin by finding the field that will provide the required :ref:`query isolation `, ensure that :ref:`writes will scale across the cluster `, and then add an additional field to provide additional :ref:`cardinality ` if your primary key does not have -split-ability. +sufficient split-ability. .. index:: balancing; internals .. _sharding-balancing-internals: @@ -250,9 +251,9 @@ split-ability. Sharding Balancer ----------------- -The :ref:`balancer ` process is responsible for +The :ref:`balancer ` sub-process is responsible for redistributing chunks evenly among the shards and ensuring that each -member of the cluster is responsible for the same amount of data. +member of the cluster is responsible for the same volume of data. This section contains complete documentation of the balancer process and operations. For a higher level introduction see @@ -261,23 +262,21 @@ the :ref:`Balancing ` section. Balancing Internals ~~~~~~~~~~~~~~~~~~~ -The balancer originates from an arbitrary :program:`mongos` +A balancing round originates from an arbitrary :program:`mongos` instance. Because your shard cluster can have a number of :program:`mongos` instances, when a balancer process is active it -creates a "lock" document in the ``locks`` collection of the -``config`` database on the :term:`config server`. +acquires a "lock" by modifying a document on the :term:`config server`. By default, the balancer process is always running. When the number of chunks in a collection is unevenly distributed among the shards, the balancer begins migrating :term:`chunks` from shards with a -disproportionate number of chunks to a shard with fewer number of -chunks. The balancer will continue migrating chunks, one at a time -beginning with the shard that has the lowest shard key, until the data -is evenly distributed among the shards (i.e. the difference between -any two chunks is less than 2 chunks.) - -While these automatic chunk migrations crucial for distributing data -they carry some overhead in terms of bandwidth and system workload, +disproportionate number of chunks to shards with a fewer number of +chunks. The balancer will continue migrating chunks, one at a time, +until the data is evenly distributed among the shards (i.e. the +difference between any two shards is less than 2 chunks.) + +While these automatic chunk migrations are crucial for distributing +data, they carry some overhead in terms of bandwidth and workload, both of which can impact database performance. As a result, MongoDB attempts to minimize the effect of balancing by only migrating chunks when the disparity between numbers of chunks on a shard is greater @@ -285,11 +284,11 @@ than 8. .. index:: balancing; migration -The migration process ensures consistency and maximize availability of +The migration process ensures consistency and maximizes availability of chunks during balancing: when MongoDB begins migrating a chunk, the database begins copying the data to the new server and tracks incoming write operations. After migrating the chunks, the "from" -:program:`mongod` sends all new writes, to the "to" server, and *then* +:program:`mongod` sends all new writes, to the receiving server, and *then* updates the chunk record in the :term:`config database` to reflect the new location of the chunk. @@ -301,14 +300,14 @@ Chunk Size .. TODO link this section to -The default :term:`chunk` size in MongoDB is 64 megabytes. +The default maximum :term:`chunk` size in MongoDB is 64 megabytes. -When chunks grow beyond the :ref:`specified chunk size +When chunks grow beyond the :ref:`specified maximum chunk size ` a :program:`mongos` instance will split the chunk in half, which will eventually lead to migrations, when chunks become unevenly distributed among the cluster, the :program:`mongos` -instances will initiate a round migrations to redistribute data in the -cluster. +instances will initiate a round of migrations to redistribute data +in the cluster. Chunk size is somewhat arbitrary and must account for the following effects: diff --git a/draft/core/sharding.txt b/draft/core/sharding.txt index 1d0ebd03611..ad3c10876c0 100644 --- a/draft/core/sharding.txt +++ b/draft/core/sharding.txt @@ -250,7 +250,7 @@ The ideal shard key: - is easily divisible which makes it easy for MongoDB to distribute content among the shards. Shard keys that have a limited number of - possible values are un-ideal, as they can result in some chunks that + possible values are not ideal as they can result in some chunks that are "un-splitable." See the ":ref:`sharding-shard-key-cardinality`" section for more information. @@ -327,7 +327,7 @@ up time is not required for a functioning shard cluster. As a result, backing up the config servers is not difficult. Backups of config servers are crucial as shard clusters become totally inoperable when you loose all configuration instances and data. Precautions to ensure -that the config servers remain available and intact are critial. +that the config servers remain available and intact are critical. .. index:: mongos .. _sharding-mongos: @@ -403,7 +403,7 @@ possible. Operations have the following targeting characteristics: - All single :func:`update() ` operations target to one shard. This includes :term:`upsert` operations. -- The :program:`mongos` brodcasts multi-update operations to every +- The :program:`mongos` broadcasts multi-update operations to every shard. - The :program:`mongos` broadcasts :func:`remove()