Skip to content

Commit 932cc58

Browse files
committed
DOCS-661 data modeling page
1 parent 1dc2547 commit 932cc58

File tree

2 files changed

+62
-56
lines changed

2 files changed

+62
-56
lines changed

source/core/data-modeling.txt

Lines changed: 47 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -10,23 +10,26 @@ Overview
1010
Collections in MongoDB have flexible schema; they do not define nor
1111
enforce the fields of its documents. Each document can have only the
1212
fields that are relevant to that entity, although in practice, you
13-
would generally choose to store similar documents in each collection.
14-
With this flexible schema, you can model your data to reflect more
15-
closely the actual entity rather than enforce a rigid data structure.
16-
17-
In MongoDB, data modeling takes into consideration not only how data
18-
relates to each other, but also how the data is used, how the data will
19-
grow and be maintained. These considerations involve decisions about
20-
whether to embed data within a single document or reference data among
21-
different documents, which fields to index, and whether to use special
22-
features.
23-
24-
Choosing the correct data model can provide both performance and
25-
maintenance gains for your applications.
26-
27-
This document provide some general guidelines for data modeling and
28-
possible options. These guidelines and options may not be appropriate
29-
for your situation.
13+
would generally choose to maintain a consistent structure across
14+
documents in each collection. With this flexible schema, you can model
15+
your data to reflect more closely the actual application-level entity
16+
rather than enforce a rigid data structure.
17+
18+
In MongoDB, data modeling takes into consideration not only the
19+
inherent properties of the data entities themselves and how they relate
20+
to each other, but also how the data is used, how the data will grow
21+
and possibly change over time, and how the data will be maintained.
22+
These considerations involve decisions about whether to embed data
23+
within a single document or to reference data in different documents,
24+
which fields to index, and whether to take advantage of rich document
25+
features, such as arrays.
26+
27+
Choosing the best data model for your application can have both huge
28+
performance and maintenance advantages for your applications.
29+
30+
This document provide some general guidelines and principles for schema
31+
design and highlight possible data modeling options. Not all guidelines
32+
and options may be appropriate for your specific situation.
3033

3134
.. _data-modeling-decisions:
3235

@@ -46,28 +49,29 @@ Embedding
4649
De-normalization of data involves embedding documents within other
4750
documents.
4851

49-
Operations within a document are easy for the server to handle.
52+
Operations within a document are less expensive for the server than
53+
operations that involve multiple documents.
5054

5155
In general, choose the embedded data model when:
5256

5357
- you have "contains" relationships between entities. See
5458
:ref:`data-modeling-example-one-to-one`.
5559

5660
- you have one-to-many relationships where the "many" objects always
57-
appear with or are viewed in the context of their parents. See
58-
:ref:`data-modeling-example-one-to-many`.
61+
appear with or are viewed in the context of their parent documents.
62+
See :ref:`data-modeling-example-one-to-many`.
5963

6064
Embedding provides the following benefits:
6165

6266
- Great for read performance
6367

6468
- Single roundtrip to database to retrieve the complete object
6569

66-
However, with embedding, write operations can be slow if you are adding
67-
objects frequently. Additionally, you cannot embed documents that will
68-
cause the containing document to exceed the :limit:`maximum BSON
69-
document size <BSON Document Size>`. For documents that exceed the
70-
maximum BSON document size, see :doc:`/applications/gridfs`.
70+
Keep in mind that embedding documents that have unbound growth over
71+
time may slow write operations. Additionally, such documents may cause
72+
their containing documents to exceed the :limit:`maximum BSON document
73+
size <BSON Document Size>`. For documents that exceed the maximum BSON
74+
document size, see :doc:`/applications/gridfs`.
7175

7276
For examples in accessing embedded documents, see
7377
:ref:`read-operations-subdocuments`.
@@ -92,17 +96,19 @@ Normalization of data requires storing :doc:`references
9296

9397
In general, choose the referenced data model when:
9498

95-
- embedding would result in duplication of data.
96-
99+
- when embedding would result in duplication of data but would not
100+
provide sufficient read performance advantages to outweigh the
101+
implications of the duplication
102+
97103
- you have many-to-many relationships.
98104

99105
- you are modeling large hierarchical data. See
100106
:ref:`data-modeling-trees`.
101107

102108
Referencing provides more flexibility than embedding; however, to
103109
resolve the references, client-side applications must issue follow-up
104-
queries. Additionally, the referencing data model involves performing
105-
many seeks and random reads.
110+
queries. In other words, using references requires more roundtrips to
111+
the server.
106112

107113
See :ref:`data-modeling-publisher-and-books` for an example of
108114
referencing.
@@ -131,8 +137,8 @@ maintenance efforts.
131137
Data Lifecycle Management
132138
~~~~~~~~~~~~~~~~~~~~~~~~~
133139

134-
Data lifecycle management concerns contribute to the decision making
135-
process around data modeling.
140+
Data modeling decisions should also take data lifecycle management into
141+
consideration.
136142

137143
The :doc:`Time to Live or TTL feature </tutorial/expire-data>` of
138144
collections expires documents after a period of time. Consider using
@@ -148,7 +154,7 @@ documents based on insertion order.
148154
Large Number of Collections
149155
~~~~~~~~~~~~~~~~~~~~~~~~~~~
150156

151-
In certain situation, you might choose to store information in several
157+
In certain situations, you might choose to store information in several
152158
collections instead of a single collection.
153159

154160
Consider a sample collection ``logs`` that stores log documents for
@@ -208,7 +214,7 @@ you want an index in MongoDB. Indexes in MongoDB are needed for
208214
efficient query processing, and as such, you may want to think about
209215
the queries first and then build indexes based upon them. Generally,
210216
you would index the fields that you query by and the fields that you
211-
sort by. The ``_id`` field is automatically indexed.
217+
sort by. A unique index is automatically created on the ``_id`` field.
212218

213219
As you create indexes, consider the following behaviors of indexes:
214220

@@ -217,11 +223,11 @@ As you create indexes, consider the following behaviors of indexes:
217223
- Adding an index has some negative performance impact for write
218224
operations. For collections with high write-to-read ratio, indexes
219225
are expensive as each insert must add keys to each index.
220-
221-
- Read operations supported by the index perform better, and read
222-
operations not supported by the index have no performance impact from
223-
the index. This allows for for collections with high read-to-write
224-
ratio to have many indexes.
226+
227+
- Collections with high read-to-write ratio benefit from having many
228+
indexes. Read operations supported by the index have high
229+
performance, and read operations not supported by the index are
230+
unaffected by it.
225231

226232
See :doc:`/applications/indexes` for more information on determining
227233
indexes. Additionally, MongoDB :wiki:`Database Profiler` provides
@@ -337,7 +343,7 @@ the ``parent``.
337343
}
338344

339345
{
340-
patron_id = "joe",
346+
patron_id: "joe",
341347
street: "123 Fake Street",
342348
city: "Faketon",
343349
state: "MA",
@@ -354,7 +360,7 @@ the ``parent``.
354360

355361
If your application frequently retrieves the ``address`` data with the
356362
``name`` information, then your application needs to issue multiple
357-
queries to resolve the references. The better data model would be to
363+
queries to resolve the references. A more optimal schema would be to
358364
embed the ``address`` data entities in the ``patron`` data, as in the
359365
following document:
360366

@@ -389,7 +395,7 @@ One-to-Many: Referencing
389395

390396
Consider the following example that maps publisher and book
391397
relationships. The example illustrates the advantage of referencing
392-
over embedding to prevent the repetition of the publisher information.
398+
over embedding to avoid repetition of the publisher information.
393399

394400
Embedding the publisher document inside the book document would lead to
395401
**repetition** of the publisher data, as the following documents show:

source/faq/developers.txt

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -630,29 +630,29 @@ is frequently the choice for:
630630
with or are viewed in the context of their parents.
631631

632632
You should also consider embedding for performance reasons if you have
633-
a collection with a large amount of small documents. If small, separate
634-
documents represent the natural model for the data, then you should
635-
maintain that model.
633+
a collection with a large number of small documents. Nevertheless, if
634+
small, separate documents represent the natural model for the data,
635+
then you should maintain that model.
636636

637637
If, however, you can group these small documents by some logical
638638
relationship *and* you frequently retrieve the documents by this
639639
grouping, you might consider "rolling-up" the small documents into
640-
larger documents that contain an array of subdocuments. But if you
641-
often only need to retrieve a subset of the documents within the group,
642-
then "rolling-up" the documents may not provide better performance.
640+
larger documents that contain an array of subdocuments. Keep in mind
641+
that if you often only need to retrieve a subset of the documents
642+
within the group, then "rolling-up" the documents may not provide
643+
better performance.
643644

644-
By "rolling up" these small documents into logical groupings, queries
645-
to retrieve the group of the documents involve sequential reads and
646-
less random disk accesses.
645+
"Rolling up" these small documents into logical groupings means that queries to
646+
retrieve a group of documents involve sequential reads and fewer random disk
647+
accesses.
647648

648649
.. Will probably need to break up the following sentence:
649650

650-
Additionally, if the individual documents were indexed on common
651-
fields, then by "rolling up" the documents and moving the common fields
652-
to the larger document, there would be fewer copies of the common
653-
fields *and* there would be fewer associated key entries in the
654-
corresponding index. See :doc:`/core/indexes` for more information on
655-
indexes.
651+
Additionally, "rolling up" documents and moving common fields to the
652+
larger document benefit the index on these fields. There would be fewer
653+
copies of the common fields *and* there would be fewer associated key
654+
entries in the corresponding index. See :doc:`/core/indexes` for more
655+
information on indexes.
656656

657657
.. Commenting out.. If the data is too large to fit entirely in RAM,
658658
embedding provides better RAM cache utilization.

0 commit comments

Comments
 (0)