From 20bd361045f3a699f87c26ec0a7d63dcde512f96 Mon Sep 17 00:00:00 2001
From: Bob Grabar <bob.grabar@10gen.com>
Date: Tue, 2 Oct 2012 18:15:24 -0400
Subject: [PATCH 1/2] DOCS-467 new info on repl set troubleshooting

---
 source/administration/replica-sets.txt | 150 +++++++++++++++++++++----
 1 file changed, 127 insertions(+), 23 deletions(-)
diff --git a/source/administration/replica-sets.txt b/source/administration/replica-sets.txt
index 908e9227672..0d448dfc9c5 100644
--- a/source/administration/replica-sets.txt
+++ b/source/administration/replica-sets.txt
@@ -385,7 +385,7 @@ Removing Members
 ~~~~~~~~~~~~~~~~
 
 You may remove a member of a replica at any time. Use the
-:method:`rs.remove()` function in the :program:`mongo` shell while
+:method:`rs.remove()` method in the :program:`mongo` shell while
 connected to the current :term:`primary`. Issue the
 :method:`db.isMaster()` command when connected to *any* member of the
 set to determine the current primary. Use a command in either
@@ -561,43 +561,76 @@ OpenSSL package to generate "random" content for use in a key file:
 
    Key file permissions are not checked on Windows systems.
 
-Troubleshooting
----------------
+Troubleshooting Replica Sets
+----------------------------
 
-This section defines reasonable troubleshooting processes for common
-operational challenges. While there is no single causes or guaranteed
-response strategies for any of these symptoms, the following sections
-provide good places to start a troubleshooting investigation with
+This section describes common strategies for troubleshooting
 :term:`replica sets <replica set>`.
 
 .. seealso:: :doc:`/administration/monitoring`.
 
+.. _replica-set-troubleshooting-check-replication-status:
+
+Check Replica Set Status
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+To display the current state of the replica set and current state of
+each member, run the :method:`rs.status()` method in a :program:`mongo`
+shell connected to the replica set's :term:`primary`. For descriptions
+of the information displayed by :method:`rs.status()`, see
+:doc:`/reference/replica-status`.
+
+.. note:: The :method:`rs.status()` method is a wrapper that runs the
+   :doc:`/reference/command/replSetGetStatus` database command.
+
 .. _replica-set-replication-lag:
 
-Replication Lag
-~~~~~~~~~~~~~~~
+Check the Replication Lag
+~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Replication lag is a delay between an operation on the :term:`primary`
 and the application of that operation from the :term:`oplog` to the
-:term:`secondary`. Such lag can be a significant issue and can
+:term:`secondary`. Replication lag can be a significant issue and can
 seriously affect MongoDB :term:`replica set` deployments. Excessive
 replication lag makes "lagged" members ineligible to quickly become
 primary and increases the possibility that distributed
 read operations will be inconsistent.
 
-Identify replication lag by checking the value of
-:data:`members[n].optimeDate` for each member of the replica set
-using the :method:`rs.status()` function in the :program:`mongo`
-shell.
+To check the current duration of replication lag, do one of the following:
+
+- Run the :doc:`db.printSlaveReplicationInfo()
+  </reference/method/db.printSlaveReplicationInfo>` method in a
+  :program:`mongo` shell connected to the replica set's primary.
+
+  The output displays each member's ``syncedTo`` value, which is the
+  last time the member read from the oplog, as shown in the following
+  example:
+
+  .. code-block:: javascript
+
+     source:   m1.example.net:30001
+         syncedTo: Tue Oct 02 2012 11:33:40 GMT-0400 (EDT)
+             = 7475 secs ago (2.08hrs)
+     source:   m2.example.net:30002
+         syncedTo: Tue Oct 02 2012 11:33:40 GMT-0400 (EDT)
+             = 7475 secs ago (2.08hrs)
 
-Also, you can monitor how fast replication occurs by watching the oplog
-time in the "replica" graph in the `MongoDB Monitoring Service`_. Also
-see the `documentation for MMS`_.
+  .. note:: The :method:`rs.status()` method is a wrapper that runs the
+     :doc:`/reference/command/replSetGetStatus` database command.
+
+- Run the :method:`rs.status()` method in a :program:`mongo` shell
+  connected to the replica set's primary. The output displays each
+  member's `optimeDate` value, which is the last time the member read
+  from the oplog.
+
+- Monitor how fast replication occurs by watching the oplog time in the
+  "replica" graph in the `MongoDB Monitoring Service`_. For more
+  information see the `documentation for MMS`_.
 
 .. _`MongoDB Monitoring Service`: http://mms.10gen.com/
 .. _`documentation for MMS`: http://mms.10gen.com/help/
 
-Possible causes of replication lag include:
+If replication lag is too large, check the following:
 
 - **Network Latency**
 
@@ -635,9 +668,9 @@ Possible causes of replication lag include:
 
   If you are performing a large data ingestion or bulk load operation
   that requires a large number of writes to the primary, the
-  secondaries will not be able to read the :term:`oplog` fast enough to keep
-  up with changes. Setting some level :ref:`write concern <write-concern>`, can
-  slow the overall progress of the batch, but will prevent the
+  secondaries will not be able to read the oplog fast enough to keep
+  up with changes. Setting some level of write concern can
+  slow the overall progress of the batch but will prevent the
   secondary from falling too far behind.
 
   To prevent this, use write concern so that MongoDB will perform
@@ -653,9 +686,79 @@ Possible causes of replication lag include:
   - :ref:`replica-set-write-concern`.
   - The :ref:`replica-set-oplog-sizing` topic in the :doc:`/core/replication` document.
   - The :ref:`replica-set-oplog` topic in the :doc:`/core/replication-internals` document.
-  - The :ref:`replica-set-procedure-change-oplog-size` topic this document.
+  - The :ref:`replica-set-procedure-change-oplog-size` topic in this document.
   - The :doc:`/tutorial/change-oplog-size` tutorial.
 
+.. _replica-set-troubleshooting-check-oplog-size:
+
+Check the Size of the Oplog
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The :term:`oplog` size can be the difference between a :term:`secondary`
+staying up-to-date or becoming stale.
+
+To check the size of the oplog for a given :term:`replica set` member,
+connect to the member in a :program:`mongo` shell and run the
+:doc:`db.printReplicationInfo()
+</reference/method/db.printReplicationInfo>` method.
+
+The method displays the size of the oplog and the date ranges of the
+operations contained in the oplog. In the following example, the oplog
+is about 10MB and is able to fit only about 20 minutes (1200 seconds) of
+operations:
+
+.. code-block:: javascript
+
+   configured oplog size:   10.10546875MB
+   log length start to end: 1200secs (0.33hrs)
+   oplog first event time:  Mon Mar 19 2012 13:50:38 GMT-0400 (EDT)
+   oplog last event time:   Tue Oct 02 2012 16:31:38 GMT-0400 (EDT)
+   now:                     Tue Oct 02 2012 17:04:20 GMT-0400 (EDT)
+
+
+The above example is likely a case where you would want to increase the
+size of the oplog. For more information on how oplog size affects
+operations, see:
+
+- The :ref:`replica-set-oplog-sizing` topic in the :doc:`/core/replication` document.
+- The :ref:`replica-set-delayed-members` topic in this document.
+- The :ref:`replica-set-replication-lag` topic in this document.
+
+To change oplog size, see :ref:`replica-set-procedure-change-oplog-size`
+in this document or see the :doc:`/tutorial/change-oplog-size` tutorial.
+
+.. _replica-set-troubleshooting-check-connection:
+
+Test the Connection Between Each Member
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+There must be connectivity from every :term:`replica set` member to
+every other member in order for replication to work. Problems with
+network or firewall rules can prevent this connectivity and prevent
+replication from working. To test the connection from every member to
+every other member, in both directions, consider the following example:
+
+.. example:: Given a replica set with three members running on three separate
+   hosts:
+
+   - ``m1.example.net``
+   - ``m2.example.net``
+   - ``m3.example.net``
+
+   You can test the connection from ``m1.example.net`` to the other hosts by running
+   the following operations from ``m1.example.net``:
+
+   .. code-block:: sh
+
+      mongo --host m2.example.net --port 27017"
+
+      mongo --host m3.example.net --port 27017"
+
+   Repeat the process on hosts ``m2.example.net`` and ``m3.example.net``.
+
+   If any of the connections fails, there's a networking or firewall
+   issue that needs to be diagnosed separately
+
 .. index:: pair: replica set; failover
 .. _replica-set-failover-administration:
 .. _failover:
@@ -663,7 +766,8 @@ Possible causes of replication lag include:
 Failover and Recovery
 ~~~~~~~~~~~~~~~~~~~~~
 
-.. todo:: Revisit whether this belongs in troubleshooting. Perhaps this should be an H2 before troubleshooting.
+.. TODO Revisit whether this belongs in troubleshooting. Perhaps this
+   should be an H2 before troubleshooting.
 
 Replica sets feature automated failover. If the :term:`primary`
 goes offline or becomes unresponsive and a majority of the original

From 68f50ee218e8ba1da1a8b7de7d22e1f51532aa88 Mon Sep 17 00:00:00 2001
From: Bob Grabar <bob.grabar@10gen.com>
Date: Wed, 3 Oct 2012 15:10:34 -0400
Subject: [PATCH 2/2] DOCS-467 minor: review edits

---
 source/administration/replica-sets.txt | 76 +++++++++++++++-----------
 1 file changed, 43 insertions(+), 33 deletions(-)

diff --git a/source/administration/replica-sets.txt b/source/administration/replica-sets.txt
index 0d448dfc9c5..9f7b0eed7a8 100644
--- a/source/administration/replica-sets.txt
+++ b/source/administration/replica-sets.txt
@@ -581,7 +581,7 @@ of the information displayed by :method:`rs.status()`, see
 :doc:`/reference/replica-status`.
 
 .. note:: The :method:`rs.status()` method is a wrapper that runs the
-   :doc:`/reference/command/replSetGetStatus` database command.
+   :dbcommand:`replSetGetStatus` database command.
 
 .. _replica-set-replication-lag:
 
@@ -596,14 +596,13 @@ replication lag makes "lagged" members ineligible to quickly become
 primary and increases the possibility that distributed
 read operations will be inconsistent.
 
-To check the current duration of replication lag, do one of the following:
+To check the current length of replication lag:
 
-- Run the :doc:`db.printSlaveReplicationInfo()
-  </reference/method/db.printSlaveReplicationInfo>` method in a
-  :program:`mongo` shell connected to the replica set's primary.
+- In a :program:`mongo` shell connected to the primary, call the
+  :method:`db.printSlaveReplicationInfo()` method.
 
-  The output displays each member's ``syncedTo`` value, which is the
-  last time the member read from the oplog, as shown in the following
+  The outputted document displays the ``syncedTo`` value for each member,
+  which shows you when each member last read from the oplog, as shown in the following
   example:
 
   .. code-block:: javascript
@@ -616,21 +615,16 @@ To check the current duration of replication lag, do one of the following:
              = 7475 secs ago (2.08hrs)
 
   .. note:: The :method:`rs.status()` method is a wrapper that runs the
-     :doc:`/reference/command/replSetGetStatus` database command.
+     :dbcommand:`replSetGetStatus` database command.
 
-- Run the :method:`rs.status()` method in a :program:`mongo` shell
-  connected to the replica set's primary. The output displays each
-  member's `optimeDate` value, which is the last time the member read
-  from the oplog.
-
-- Monitor how fast replication occurs by watching the oplog time in the
+- Monitor the rate of replication by watching the oplog time in the
   "replica" graph in the `MongoDB Monitoring Service`_. For more
   information see the `documentation for MMS`_.
 
 .. _`MongoDB Monitoring Service`: http://mms.10gen.com/
 .. _`documentation for MMS`: http://mms.10gen.com/help/
 
-If replication lag is too large, check the following:
+Possible causes of replication lag include:
 
 - **Network Latency**
 
@@ -699,31 +693,36 @@ staying up-to-date or becoming stale.
 
 To check the size of the oplog for a given :term:`replica set` member,
 connect to the member in a :program:`mongo` shell and run the
-:doc:`db.printReplicationInfo()
-</reference/method/db.printReplicationInfo>` method.
+:method:`db.printReplicationInfo()` method.
 
-The method displays the size of the oplog and the date ranges of the
+The output displays the size of the oplog and the date ranges of the
 operations contained in the oplog. In the following example, the oplog
-is about 10MB and is able to fit only about 20 minutes (1200 seconds) of
+is about 10MB and is able to fit about 26 hours (94400 seconds) of
 operations:
 
 .. code-block:: javascript
 
    configured oplog size:   10.10546875MB
-   log length start to end: 1200secs (0.33hrs)
+   log length start to end: 94400 (26.22hrs)
    oplog first event time:  Mon Mar 19 2012 13:50:38 GMT-0400 (EDT)
-   oplog last event time:   Tue Oct 02 2012 16:31:38 GMT-0400 (EDT)
-   now:                     Tue Oct 02 2012 17:04:20 GMT-0400 (EDT)
+   oplog last event time:   Wed Oct 03 2012 14:59:10 GMT-0400 (EDT)
+   now:                     Wed Oct 03 2012 15:00:21 GMT-0400 (EDT)
 
+The oplog should be long enough to hold all transactions for the longest
+downtime you expect on a secondary. In many cases, an oplog should fit
+at minimum 24 hours of operations. A size of 72 hours is often
+preferred. And it is not uncommon for an oplog to fit a week's worth of
+operations.
 
-The above example is likely a case where you would want to increase the
-size of the oplog. For more information on how oplog size affects
-operations, see:
+For more information on how oplog size affects operations, see:
 
 - The :ref:`replica-set-oplog-sizing` topic in the :doc:`/core/replication` document.
 - The :ref:`replica-set-delayed-members` topic in this document.
 - The :ref:`replica-set-replication-lag` topic in this document.
 
+.. note:: You normally want the oplog to be the same size on all
+   members. If you resize the oplog, resize it on all members.
+
 To change oplog size, see :ref:`replica-set-procedure-change-oplog-size`
 in this document or see the :doc:`/tutorial/change-oplog-size` tutorial.
 
@@ -745,19 +744,30 @@ every other member, in both directions, consider the following example:
    - ``m2.example.net``
    - ``m3.example.net``
 
-   You can test the connection from ``m1.example.net`` to the other hosts by running
-   the following operations from ``m1.example.net``:
+   1. Test the connection from ``m1.example.net`` to the other hosts by running
+      the following operations from ``m1.example.net``:
+
+      .. code-block:: sh
+
+         mongo --host m2.example.net --port 27017"
 
-   .. code-block:: sh
+         mongo --host m3.example.net --port 27017"
 
-      mongo --host m2.example.net --port 27017"
+   #. Test the connection from ``m2.example.net`` to the other two
+      hosts by running similar appropriate operations from ``m2.example.net``.
 
-      mongo --host m3.example.net --port 27017"
+      This means you have now tested the connection between
+      ``m2.example.net`` and ``m1.example.net`` twice, but each time
+      from a different direction. This is important to verifying
+      connectivity. Network topologies and firewalls might allow a
+      connection in one direction but not the other. Therefore you must
+      make sure to verify that the connection works in both directions.
 
-   Repeat the process on hosts ``m2.example.net`` and ``m3.example.net``.
+   #. Test the connection from ``m3.example.net`` to the other two
+      hosts by running the operations from ``m3.example.net``.
 
-   If any of the connections fails, there's a networking or firewall
-   issue that needs to be diagnosed separately
+   If a connection in any direction fails, there's a networking or
+   firewall issue that needs to be diagnosed separately.
 
 .. index:: pair: replica set; failover
 .. _replica-set-failover-administration: