From 20bd361045f3a699f87c26ec0a7d63dcde512f96 Mon Sep 17 00:00:00 2001 From: Bob Grabar Date: Tue, 2 Oct 2012 18:15:24 -0400 Subject: [PATCH 1/2] DOCS-467 new info on repl set troubleshooting --- source/administration/replica-sets.txt | 150 +++++++++++++++++++++---- 1 file changed, 127 insertions(+), 23 deletions(-) diff --git a/source/administration/replica-sets.txt b/source/administration/replica-sets.txt index 908e9227672..0d448dfc9c5 100644 --- a/source/administration/replica-sets.txt +++ b/source/administration/replica-sets.txt @@ -385,7 +385,7 @@ Removing Members ~~~~~~~~~~~~~~~~ You may remove a member of a replica at any time. Use the -:method:`rs.remove()` function in the :program:`mongo` shell while +:method:`rs.remove()` method in the :program:`mongo` shell while connected to the current :term:`primary`. Issue the :method:`db.isMaster()` command when connected to *any* member of the set to determine the current primary. Use a command in either @@ -561,43 +561,76 @@ OpenSSL package to generate "random" content for use in a key file: Key file permissions are not checked on Windows systems. -Troubleshooting ---------------- +Troubleshooting Replica Sets +---------------------------- -This section defines reasonable troubleshooting processes for common -operational challenges. While there is no single causes or guaranteed -response strategies for any of these symptoms, the following sections -provide good places to start a troubleshooting investigation with +This section describes common strategies for troubleshooting :term:`replica sets `. .. seealso:: :doc:`/administration/monitoring`. +.. _replica-set-troubleshooting-check-replication-status: + +Check Replica Set Status +~~~~~~~~~~~~~~~~~~~~~~~~ + +To display the current state of the replica set and current state of +each member, run the :method:`rs.status()` method in a :program:`mongo` +shell connected to the replica set's :term:`primary`. For descriptions +of the information displayed by :method:`rs.status()`, see +:doc:`/reference/replica-status`. + +.. note:: The :method:`rs.status()` method is a wrapper that runs the + :doc:`/reference/command/replSetGetStatus` database command. + .. _replica-set-replication-lag: -Replication Lag -~~~~~~~~~~~~~~~ +Check the Replication Lag +~~~~~~~~~~~~~~~~~~~~~~~~~ Replication lag is a delay between an operation on the :term:`primary` and the application of that operation from the :term:`oplog` to the -:term:`secondary`. Such lag can be a significant issue and can +:term:`secondary`. Replication lag can be a significant issue and can seriously affect MongoDB :term:`replica set` deployments. Excessive replication lag makes "lagged" members ineligible to quickly become primary and increases the possibility that distributed read operations will be inconsistent. -Identify replication lag by checking the value of -:data:`members[n].optimeDate` for each member of the replica set -using the :method:`rs.status()` function in the :program:`mongo` -shell. +To check the current duration of replication lag, do one of the following: + +- Run the :doc:`db.printSlaveReplicationInfo() + ` method in a + :program:`mongo` shell connected to the replica set's primary. + + The output displays each member's ``syncedTo`` value, which is the + last time the member read from the oplog, as shown in the following + example: + + .. code-block:: javascript + + source: m1.example.net:30001 + syncedTo: Tue Oct 02 2012 11:33:40 GMT-0400 (EDT) + = 7475 secs ago (2.08hrs) + source: m2.example.net:30002 + syncedTo: Tue Oct 02 2012 11:33:40 GMT-0400 (EDT) + = 7475 secs ago (2.08hrs) -Also, you can monitor how fast replication occurs by watching the oplog -time in the "replica" graph in the `MongoDB Monitoring Service`_. Also -see the `documentation for MMS`_. + .. note:: The :method:`rs.status()` method is a wrapper that runs the + :doc:`/reference/command/replSetGetStatus` database command. + +- Run the :method:`rs.status()` method in a :program:`mongo` shell + connected to the replica set's primary. The output displays each + member's `optimeDate` value, which is the last time the member read + from the oplog. + +- Monitor how fast replication occurs by watching the oplog time in the + "replica" graph in the `MongoDB Monitoring Service`_. For more + information see the `documentation for MMS`_. .. _`MongoDB Monitoring Service`: http://mms.10gen.com/ .. _`documentation for MMS`: http://mms.10gen.com/help/ -Possible causes of replication lag include: +If replication lag is too large, check the following: - **Network Latency** @@ -635,9 +668,9 @@ Possible causes of replication lag include: If you are performing a large data ingestion or bulk load operation that requires a large number of writes to the primary, the - secondaries will not be able to read the :term:`oplog` fast enough to keep - up with changes. Setting some level :ref:`write concern `, can - slow the overall progress of the batch, but will prevent the + secondaries will not be able to read the oplog fast enough to keep + up with changes. Setting some level of write concern can + slow the overall progress of the batch but will prevent the secondary from falling too far behind. To prevent this, use write concern so that MongoDB will perform @@ -653,9 +686,79 @@ Possible causes of replication lag include: - :ref:`replica-set-write-concern`. - The :ref:`replica-set-oplog-sizing` topic in the :doc:`/core/replication` document. - The :ref:`replica-set-oplog` topic in the :doc:`/core/replication-internals` document. - - The :ref:`replica-set-procedure-change-oplog-size` topic this document. + - The :ref:`replica-set-procedure-change-oplog-size` topic in this document. - The :doc:`/tutorial/change-oplog-size` tutorial. +.. _replica-set-troubleshooting-check-oplog-size: + +Check the Size of the Oplog +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The :term:`oplog` size can be the difference between a :term:`secondary` +staying up-to-date or becoming stale. + +To check the size of the oplog for a given :term:`replica set` member, +connect to the member in a :program:`mongo` shell and run the +:doc:`db.printReplicationInfo() +` method. + +The method displays the size of the oplog and the date ranges of the +operations contained in the oplog. In the following example, the oplog +is about 10MB and is able to fit only about 20 minutes (1200 seconds) of +operations: + +.. code-block:: javascript + + configured oplog size: 10.10546875MB + log length start to end: 1200secs (0.33hrs) + oplog first event time: Mon Mar 19 2012 13:50:38 GMT-0400 (EDT) + oplog last event time: Tue Oct 02 2012 16:31:38 GMT-0400 (EDT) + now: Tue Oct 02 2012 17:04:20 GMT-0400 (EDT) + + +The above example is likely a case where you would want to increase the +size of the oplog. For more information on how oplog size affects +operations, see: + +- The :ref:`replica-set-oplog-sizing` topic in the :doc:`/core/replication` document. +- The :ref:`replica-set-delayed-members` topic in this document. +- The :ref:`replica-set-replication-lag` topic in this document. + +To change oplog size, see :ref:`replica-set-procedure-change-oplog-size` +in this document or see the :doc:`/tutorial/change-oplog-size` tutorial. + +.. _replica-set-troubleshooting-check-connection: + +Test the Connection Between Each Member +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There must be connectivity from every :term:`replica set` member to +every other member in order for replication to work. Problems with +network or firewall rules can prevent this connectivity and prevent +replication from working. To test the connection from every member to +every other member, in both directions, consider the following example: + +.. example:: Given a replica set with three members running on three separate + hosts: + + - ``m1.example.net`` + - ``m2.example.net`` + - ``m3.example.net`` + + You can test the connection from ``m1.example.net`` to the other hosts by running + the following operations from ``m1.example.net``: + + .. code-block:: sh + + mongo --host m2.example.net --port 27017" + + mongo --host m3.example.net --port 27017" + + Repeat the process on hosts ``m2.example.net`` and ``m3.example.net``. + + If any of the connections fails, there's a networking or firewall + issue that needs to be diagnosed separately + .. index:: pair: replica set; failover .. _replica-set-failover-administration: .. _failover: @@ -663,7 +766,8 @@ Possible causes of replication lag include: Failover and Recovery ~~~~~~~~~~~~~~~~~~~~~ -.. todo:: Revisit whether this belongs in troubleshooting. Perhaps this should be an H2 before troubleshooting. +.. TODO Revisit whether this belongs in troubleshooting. Perhaps this + should be an H2 before troubleshooting. Replica sets feature automated failover. If the :term:`primary` goes offline or becomes unresponsive and a majority of the original From 68f50ee218e8ba1da1a8b7de7d22e1f51532aa88 Mon Sep 17 00:00:00 2001 From: Bob Grabar Date: Wed, 3 Oct 2012 15:10:34 -0400 Subject: [PATCH 2/2] DOCS-467 minor: review edits --- source/administration/replica-sets.txt | 76 +++++++++++++++----------- 1 file changed, 43 insertions(+), 33 deletions(-) diff --git a/source/administration/replica-sets.txt b/source/administration/replica-sets.txt index 0d448dfc9c5..9f7b0eed7a8 100644 --- a/source/administration/replica-sets.txt +++ b/source/administration/replica-sets.txt @@ -581,7 +581,7 @@ of the information displayed by :method:`rs.status()`, see :doc:`/reference/replica-status`. .. note:: The :method:`rs.status()` method is a wrapper that runs the - :doc:`/reference/command/replSetGetStatus` database command. + :dbcommand:`replSetGetStatus` database command. .. _replica-set-replication-lag: @@ -596,14 +596,13 @@ replication lag makes "lagged" members ineligible to quickly become primary and increases the possibility that distributed read operations will be inconsistent. -To check the current duration of replication lag, do one of the following: +To check the current length of replication lag: -- Run the :doc:`db.printSlaveReplicationInfo() - ` method in a - :program:`mongo` shell connected to the replica set's primary. +- In a :program:`mongo` shell connected to the primary, call the + :method:`db.printSlaveReplicationInfo()` method. - The output displays each member's ``syncedTo`` value, which is the - last time the member read from the oplog, as shown in the following + The outputted document displays the ``syncedTo`` value for each member, + which shows you when each member last read from the oplog, as shown in the following example: .. code-block:: javascript @@ -616,21 +615,16 @@ To check the current duration of replication lag, do one of the following: = 7475 secs ago (2.08hrs) .. note:: The :method:`rs.status()` method is a wrapper that runs the - :doc:`/reference/command/replSetGetStatus` database command. + :dbcommand:`replSetGetStatus` database command. -- Run the :method:`rs.status()` method in a :program:`mongo` shell - connected to the replica set's primary. The output displays each - member's `optimeDate` value, which is the last time the member read - from the oplog. - -- Monitor how fast replication occurs by watching the oplog time in the +- Monitor the rate of replication by watching the oplog time in the "replica" graph in the `MongoDB Monitoring Service`_. For more information see the `documentation for MMS`_. .. _`MongoDB Monitoring Service`: http://mms.10gen.com/ .. _`documentation for MMS`: http://mms.10gen.com/help/ -If replication lag is too large, check the following: +Possible causes of replication lag include: - **Network Latency** @@ -699,31 +693,36 @@ staying up-to-date or becoming stale. To check the size of the oplog for a given :term:`replica set` member, connect to the member in a :program:`mongo` shell and run the -:doc:`db.printReplicationInfo() -` method. +:method:`db.printReplicationInfo()` method. -The method displays the size of the oplog and the date ranges of the +The output displays the size of the oplog and the date ranges of the operations contained in the oplog. In the following example, the oplog -is about 10MB and is able to fit only about 20 minutes (1200 seconds) of +is about 10MB and is able to fit about 26 hours (94400 seconds) of operations: .. code-block:: javascript configured oplog size: 10.10546875MB - log length start to end: 1200secs (0.33hrs) + log length start to end: 94400 (26.22hrs) oplog first event time: Mon Mar 19 2012 13:50:38 GMT-0400 (EDT) - oplog last event time: Tue Oct 02 2012 16:31:38 GMT-0400 (EDT) - now: Tue Oct 02 2012 17:04:20 GMT-0400 (EDT) + oplog last event time: Wed Oct 03 2012 14:59:10 GMT-0400 (EDT) + now: Wed Oct 03 2012 15:00:21 GMT-0400 (EDT) +The oplog should be long enough to hold all transactions for the longest +downtime you expect on a secondary. In many cases, an oplog should fit +at minimum 24 hours of operations. A size of 72 hours is often +preferred. And it is not uncommon for an oplog to fit a week's worth of +operations. -The above example is likely a case where you would want to increase the -size of the oplog. For more information on how oplog size affects -operations, see: +For more information on how oplog size affects operations, see: - The :ref:`replica-set-oplog-sizing` topic in the :doc:`/core/replication` document. - The :ref:`replica-set-delayed-members` topic in this document. - The :ref:`replica-set-replication-lag` topic in this document. +.. note:: You normally want the oplog to be the same size on all + members. If you resize the oplog, resize it on all members. + To change oplog size, see :ref:`replica-set-procedure-change-oplog-size` in this document or see the :doc:`/tutorial/change-oplog-size` tutorial. @@ -745,19 +744,30 @@ every other member, in both directions, consider the following example: - ``m2.example.net`` - ``m3.example.net`` - You can test the connection from ``m1.example.net`` to the other hosts by running - the following operations from ``m1.example.net``: + 1. Test the connection from ``m1.example.net`` to the other hosts by running + the following operations from ``m1.example.net``: + + .. code-block:: sh + + mongo --host m2.example.net --port 27017" - .. code-block:: sh + mongo --host m3.example.net --port 27017" - mongo --host m2.example.net --port 27017" + #. Test the connection from ``m2.example.net`` to the other two + hosts by running similar appropriate operations from ``m2.example.net``. - mongo --host m3.example.net --port 27017" + This means you have now tested the connection between + ``m2.example.net`` and ``m1.example.net`` twice, but each time + from a different direction. This is important to verifying + connectivity. Network topologies and firewalls might allow a + connection in one direction but not the other. Therefore you must + make sure to verify that the connection works in both directions. - Repeat the process on hosts ``m2.example.net`` and ``m3.example.net``. + #. Test the connection from ``m3.example.net`` to the other two + hosts by running the operations from ``m3.example.net``. - If any of the connections fails, there's a networking or firewall - issue that needs to be diagnosed separately + If a connection in any direction fails, there's a networking or + firewall issue that needs to be diagnosed separately. .. index:: pair: replica set; failover .. _replica-set-failover-administration: