Skip to content

Commit 151f07b

Browse files
DOCSP-43348 -- DB and Application Resilience (#26)
* DOCSP-43348 -- WIP * DOCSP-43348 -- WIP * DOCSP-43348 -- WIP * DOCSP-43348 -- draft without examples * DOCSP-43348 -- minor edits * DOCSP-43348 -- add links * DOCSP-43348 -- fix includes * DOCSP-43348 -- fix includes * DOCSP-43348 -- WIP * DOCSP-43348 -- WIP * DOCSP-43348 -- copy review revisions * DOCSP-43348 -- copy review revsions * DCOSP-43348 -- fix build errors
1 parent 2e471d5 commit 151f07b

File tree

3 files changed

+194
-30
lines changed

3 files changed

+194
-30
lines changed

snooty.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -302,6 +302,8 @@ piv = ":abbr:`PIV (Personal Identity Verification)`"
302302
prometheus = "`Prometheus <https://prometheus.io/>`__"
303303
rag = ":abbr:`RAG (Retrieval-Augmented Generation)`"
304304
rdp = ":abbr:`RDP (Remote Desktop Protocol)`"
305+
rpo = ":abbr: `RPO (Recovery Point Objective)`"
306+
rto = ":abbr: `RTO (Recovery Time Objective)`"
305307
rtpp = ":abbr:`RTPP (Real-Time Performance Panel)`"
306308
rbi = ":abbr:`RBI (Reserve Bank of India)`"
307309
restapi = ":abbr:`REST (Representational State Transfer)` :abbr:`API (Application Programming Interface)`"
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
To improve the resiliency of your cluster, upgrade your cluster to MongoDB 8.0.
2+
MongoDB 8.0 introduces the following performance improvements and new features
3+
related to resilience:
4+
5+
- `Improved memory management <https://www.mongodb.com/docs/atlas/resilient-application/#std-label-resilient-upgraded-tcmalloc>`__
6+
7+
- `Operation rejection filters <https://www.mongodb.com/docs/atlas/resilient-application/#std-label-resilient-operations-rejection-filters>`__ to reactively mitigate expensive queries
8+
9+
- `Cluster-level timeouts <https://www.mongodb.com/docs/atlas/resilient-application/#std-label-resilient-default-read-timeout>`__ for proactive protection against expensive read operations
10+
11+
- Better workload isolation with the `moveCollection command <https://www.mongodb.com/docs/atlas/resilient-application/#std-label-resilient-move-collection>`__

source/resiliency.txt

Lines changed: 181 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -9,45 +9,196 @@ Application and Database Resiliency
99
.. contents:: On this page
1010
:local:
1111
:backlinks: none
12-
:depth: 1
12+
:depth: 2
1313
:class: onecol
1414

15-
Intro statement
15+
MongoDB |service| is a highly-performant database that is designed
16+
to maintain uptime regardless of infrastructure outages, system maintenance,
17+
and more. Use the guidance on this page to plan settings to maximize the
18+
resiliency of your application and database.
1619

17-
{+service+} Features and Best Practices for Resiliency
18-
------------------------------------------------------
20+
{+service+} Features and Recommendations for Resiliency
21+
-------------------------------------------------------
1922

20-
Content here
23+
Features
24+
~~~~~~~~
2125

22-
Examples
23-
--------
26+
Database Replication
27+
````````````````````
2428

25-
The following examples <perform this action> using |service|
26-
:ref:`tools for automation <arch-center-automation>`.
29+
|service| {+clusters+} consist of a minimum of three nodes, and you can increase
30+
the node count to any odd number of nodes you require. |service| first writes data
31+
from your application to a primary node, and then |service| incrementally replicates
32+
and stores that data across all secondary nodes within your {+cluster+}.
2733

28-
These examples also apply other recommended configurations, including:
34+
By default, |service| distributes {+cluster+} nodes across availability zones within
35+
one of your chosen cloud provider's availability regions. For example, if your
36+
{+cluster+} is deployed to the cloud provider region ``us-east``, |service| deploys
37+
nodes to ``us-east-a``, ``us-east-b`` and ``us-east-c`` by default.
2938

30-
.. tabs::
39+
To learn more about high availability and node distribution across regions,
40+
see :ref:`arch-center-high-availability`.
3141

32-
.. tab:: Dev and Test Environments
33-
:tabid: devtest
42+
Self-Healing Deployments
43+
````````````````````````
3444

35-
.. include:: /includes/shared-settings-clusters-devtest.rst
45+
|service| {+clusters+} must consist of an odd number of nodes, because only one
46+
node can be elected as the primary node to and from which your application writes
47+
and reads directly.
3648

37-
.. tab:: Staging and Prod Environments
38-
:tabid: stagingprod
39-
40-
.. include:: /includes/shared-settings-clusters-stagingprod.rst
41-
42-
.. tabs::
43-
44-
.. tab:: CLI
45-
:tabid: cli
46-
47-
Content here
48-
49-
.. tab:: Terraform
50-
:tabid: Terraform
51-
52-
Content here
49+
In the event that a primary node is unavailable, because of infrastructure
50+
outages, maintenance windows or any other reason, |service| {+clusters+} self-heal
51+
by converting an existing secondary node into your primary node to maintain
52+
database availability.
53+
54+
Maintenance Window Uptime
55+
`````````````````````````
5356

57+
|service| maintains uptime during scheduled maintenance by applying updates in
58+
a rolling fashion to one node at a time. During this process, |service| elects a new
59+
primary when necessary just as it does during any other unplanned primary node
60+
outage.
61+
62+
Monitoring
63+
``````````
64+
65+
|service| provides `built-in tools <https://www.mongodb.com/docs/atlas/monitoring-alerts/>`__
66+
to monitor {+cluster+} performance, query performance and more. Additionally, |service|
67+
integrates easily with `third-party services <https://www.mongodb.com/docs/atlas/tutorial/third-party-service-integrations/#std-label-third-party-integrations>`__.
68+
69+
By actively monitoring your {+clusters+}, you can gain valuable insights into
70+
query and deployment performance. To learn more about monitoring in |service|, see
71+
`Monitor Your Clusters <https://www.mongodb.com/docs/atlas/monitoring-alerts/>`__
72+
and :ref:`Monitoring and Alerts <arch-center-monitoring-alerts>`.
73+
74+
Deployment Resilience Testing
75+
`````````````````````````````
76+
77+
You can simulate various scenarios that require disaster recovery workflows in
78+
order to measure your preparedness for such events. Specifically, with |service|
79+
you can `test primary node failover <https://www.mongodb.com/docs/atlas/tutorial/test-resilience/test-primary-failover/#std-label-test-failover>`__
80+
and `simulate regional outages <https://www.mongodb.com/docs/atlas/tutorial/test-resilience/simulate-regional-outage/#std-label-test-outage>`__.
81+
82+
Cluster Termination Safeguards
83+
``````````````````````````````
84+
85+
You can prevent accidental deletion of |service| {+clusters+} by enabling
86+
`termination protection <https://www.mongodb.com/docs/atlas/cluster-additional-settings/#termination-protection>`__.
87+
To delete a cluster that has termination protection enabled, you must first
88+
disable termination protection. By default, Atlas disables termination protection
89+
for all clusters.
90+
91+
Database Backups
92+
````````````````
93+
94+
|service| Cloud Backups facilitate cloud backup storage using the native
95+
snapshot functionality of cloud service provider on which your {+cluster+} is
96+
deployed. For example, if your cluster is deployed to AWS, you can elect to
97+
backup your {+cluster+}'s data with snapshots taken at configurable intervals
98+
in AWS S3.
99+
100+
To learn more about database backup and snapshot retrieval, see `Back Up Your Cluster <https://www.mongodb.com/docs/atlas/backup/cloud-backup/overview/>`__.
101+
102+
Recommendations
103+
~~~~~~~~~~~~~~~
104+
105+
Use MongoDB 8.0
106+
````````````````
107+
108+
.. include:: /includes/cloud-docs/cluster-resilience.rst
109+
110+
Connecting Your Application to |service|
111+
`````````````````````````````````````````
112+
113+
We recommend that you use the most `current driver version <https://www.mongodb.com/docs/drivers/>`__
114+
for your application's programming language whenever possible. And while the
115+
default connection string |service| provides is a good place to start, you might
116+
want to tune it for performance in the context of your specific application
117+
and deployment architecture. `Tuning your connection pool settings <https://www.mongodb.com/docs/manual/tutorial/connection-pool-performance-tuning/>`__
118+
is particularly important in the context of enterprise level application deployments.
119+
120+
Connection Pool Considerations for Performant Applications
121+
```````````````````````````````````````````````````````````
122+
123+
Opening database client connections is one of the most resource intensive processes
124+
involved in maintaining a client connection pool that facilitates application access
125+
to your |service| {+cluster+}.
126+
127+
Because of this, it is worth thinking about how and when you would like this
128+
process of opening client connections to unfold in the context of your specific
129+
application.
130+
131+
For example, if you are scaling your |service| {+cluster+} to meet user demand,
132+
consider what the minimum pool size of connections your application will
133+
consisently need, so that when the connection pool scales the additional
134+
networking and compute load that comes with opening new client connections
135+
doesn't undermine your application's time-sensitive need for increased
136+
database operations.
137+
138+
Min and Max Conection Pool Size
139+
```````````````````````````````
140+
141+
If your ``minPoolSize`` and ``maxPoolSize`` values are similar, the majority of your
142+
database client connections will open at application startup. In turn, the
143+
additional networking load that comes with opening such connections will happen
144+
at the same time. However, if there is a large range in size between your
145+
minimum and maximum pool size, additional connections are opened more frequently
146+
during application runtime.
147+
148+
This process of incrementally increasing your connection pool size during
149+
application runtime distributes the total workload of connecting clients from
150+
your application to |service| over a longer period of time, which often makes it
151+
manageable for a given use case, but it is important to note that the associated
152+
increase in network load occurs during application runtime, which has
153+
the potential to impact perceived database - and by extension - application
154+
performance for end-users.
155+
156+
Your application's architecture is central to this consideration. If, for example,
157+
you deploy your application as microservices in an elastic environment, consider
158+
which services should call |service| directly as a means of controlling the
159+
dynamic expansion and contraction of your connection pool.
160+
161+
Query Timeout
162+
`````````````
163+
Almost invariably, workload-specific queries from your application will vary in
164+
terms of the amount of time they take to execute in |service| and in terms of
165+
the amount of time your application can wait for a response.
166+
167+
Consider defining query classes that handle categories or buckets of similar
168+
request requirements. For example, you can define a query category with a fast
169+
timeout for end-user driven requests, a middle tier timeout bucket for general
170+
purpose requests, and a long-running query class for things like analytics
171+
queries that require the most time to execute in |service|.
172+
173+
You can set `query timeout <https://www.mongodb.com/docs/manual/tutorial/query-documents/specify-query-timeout/>`__
174+
behavior globally in |service|, and you can also define it at the query level.
175+
176+
Retryable Database Reads and Writes
177+
```````````````````````````````````
178+
179+
|service| supports `retryable read <https://www.mongodb.com/docs/manual/core/retryable-reads/>`__
180+
and `retryable write <https://www.mongodb.com/docs/manual/core/retryable-writes/>`__
181+
operations. When enabled, |service| retries read and write operations once as a
182+
safeguard against intermittent network outages.
183+
184+
Configure Read and Write Concerns
185+
`````````````````````````````````
186+
187+
|service| {+clusters+} eventually replicate all data across all nodes. However,
188+
you can configure the number of nodes across which data must be repicated before
189+
a read or write operation is reported to have been successful. You can define
190+
`read concerns <https://www.mongodb.com/docs/manual/reference/read-concern/>`__ and
191+
`write concerns <https://www.mongodb.com/docs/manual/reference/write-concern/>`__
192+
globally in |service|, and you can also define them at the client level in your
193+
connection string.
194+
195+
Disaster Recovery
196+
`````````````````
197+
198+
We strongly recommend that you prepare a comprehensive disaster recovery (DR) plan
199+
that includes elements such as `data backup policies <https://www.mongodb.com/docs/atlas/backup/cloud-backup/backup-compliance-policy/#std-label-backup-compliance-policy>`__,
200+
your designated recovery point objective (RPO), your designated recovery time objective (RTO), and any automated processes that
201+
facilitate alignment with these objectives.
202+
203+
To learn more about DR with |service|, see
204+
`Data Resilience with MongoDB <https://www.mongodb.com/resources/products/capabilities/data-resilience-strategy-with-mongodb-atlas>`__.

0 commit comments

Comments
 (0)