From 885b07fe20deca2342fc9bb3fb7fee4f9e46f736 Mon Sep 17 00:00:00 2001 From: kay Date: Thu, 13 Dec 2012 10:48:26 -0500 Subject: [PATCH 1/3] DOCS-661 data modeling page --- source/applications.txt | 1 + source/core/data-modeling.txt | 761 +++++++++++++++++++++ source/faq/developers.txt | 45 ++ source/reference/configuration-options.txt | 7 +- source/reference/limits.txt | 5 +- source/reference/mongod.txt | 10 +- 6 files changed, 820 insertions(+), 9 deletions(-) create mode 100644 source/core/data-modeling.txt diff --git a/source/applications.txt b/source/applications.txt index 3dfd2f93165..3ad56429d05 100644 --- a/source/applications.txt +++ b/source/applications.txt @@ -31,6 +31,7 @@ The following documents outline basic application development topics: :maxdepth: 2 applications/drivers + core/data-modeling applications/database-references applications/gridfs diff --git a/source/core/data-modeling.txt b/source/core/data-modeling.txt new file mode 100644 index 00000000000..c888bd57dd4 --- /dev/null +++ b/source/core/data-modeling.txt @@ -0,0 +1,761 @@ +============= +Data Modeling +============= + +.. default-domain:: mongodb + +Overview +-------- + +In MongoDB, the schema design takes into consideration both the data to +be modeled *and* the use cases. Schema design determines the structure +of the documents, the number and types of collections, indexing and +sharding. + +.. _data-modeling-decisions: + +Data Modeling Decisions +----------------------- + +Data modeling decisions involve determining how to structure the +documents to effectively model the data. The primary decision is +whether to :ref:`embed ` or to :ref:`link +` related data. + +.. _data-modeling-embedding: + +Embedding +~~~~~~~~~ + +Embedding, or de-normalization of data, is a bit like "prejoined" data. +Operations within a document are easy for the server to handle. +Embedding allows for large sequential reads. + +Embedding is frequently the choice for: + +- "contains" relationships between entities. See + :ref:`data-modeling-example-one-to-one`. + +- one-to-many relationships when the "many" objects always appear with + or are viewed in the context of their parents. See + :ref:`data-modeling-example-many-addresses`. + +Embedding provides the following benefits: + +- Great for read performance + +- Single roundtrip to database to retrieve the complete object + +However, embedding presents some considerations: + +- Writes can be slow if adding to objects frequently. + +- You cannot embed documents that will cause the containing document to + exceed the :limit:`maximum BSON document size `. + For documents that exceed the maximum BSON document size, see + :doc:`/applications/gridfs`. + +- You cannot have nested embedded levels more than the specified + :limit:`limit on embedded levels `. + +For examples in accessing embedded documents, see +:ref:`read-operations-subdocuments`. + +.. _data-modeling-linking: + +Linking +~~~~~~~ + +Linking, or normalization of data, joins separate documents using +:doc:`references ` or links. Links +are processed client-side by the application; the application does this +by issuing a follow-up query. Linking involves performing many seeks +and random reads. + +Linking is frequently the choice for: + +- embedding would result in duplication of data. + +- many-to-many relationships. + +See :ref:`data-modeling-publisher-and-books` for example of linking. + +Linking provides more flexibility than embedding; however, linking +requires client-side processing to resolve the link by issuing +follow-up queries to retrieve the entire object. + +.. _data-modeling-atomicity: + +Atomicity +~~~~~~~~~ + +Atomicity influences the decision to embed or link. The modification of +a single document is atomic, even if the write operation modifies +multiple sub-documents *within* the single document. + +Embed fields relevant to the atomic operation in the same document. + +Operational Considerations +-------------------------- + +Operational considerations involve the data lifecycle managements, +determining the number of collections, + +Data Lifecycle Management +~~~~~~~~~~~~~~~~~~~~~~~~~ + +Data lifecycle management determines the type of collection or +collection feature to implement in your schema design. + +Some data may only need to persist in a database for a limited period +of time. In these cases, you should consider using the :doc:`Time to +Live or TTL feature` of collections. + +Additionally, if you will be only concerned with the most recent +documents, you should consider :doc:`/core/capped-collections`. Capped +collections provide *first-in-first-out* management of inserted +documents and support high-throughput operations that insert, read, +and delete documents based on insertion order. + +Large Number of Collections +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In certain situation, you might choose to store information in several +collections instead of a single collection. + +Consider a sample collection ``logs`` that stores log documents for +various environment and applications. The ``logs`` collection contains +documents of the following form: + +.. code-block:: javascript + + { log: "dev", ts: ..., info: ... } + { log: "debug", ts: ..., info: ...} + +If the number of different logs is not too high, you may decide to have +separate log collections, such as ``logs.dev`` and ``logs.debug``. The +``logs.dev`` collection would contain only the log documents related to +the dev environment. + +Generally, having large number of collections has no significant +performance penalty and results in very good performance. Independent +collections are very important for high-throughput batch processing. + +When creating large numbers of collections, consider the following +items: + +- Each collection has a certain minimum overhead of a few kilobytes per + collection. + +- Each index requires at least 8KB of data space as the B-tree page + size is 8KB. + +- :limit:`Limits on the number of namespaces ` + exist: + + - Each index, as well as each collection, counts as a namespace. + + - Each namespace is 628 bytes. + + - Namespaces are stored per database in a ``.ns`` file: + + - The ``.ns`` file defaults to 16 MB. + + - To change the size of the ``.ns`` file, use + :option:`--nssize \ ` on server + startup. + + .. note:: + + - The maximum size of the ``.ns`` file is 2047 MB. + + - :option:`--nssize ` sets the size used for + *new* ``.ns`` files. For existing databases, after + starting up the server with :option:`--nssize `, run the :dbcommand:`db.repairDatabase()` command + from the :program:`mongo` shell. + +You can check the namespaces by querying the ``system.namespaces`` +collection, as in the following example: + +.. code-block:: javascript + + db.system.namespaces.count() + +Indexes +~~~~~~~ + +As a general rule, where you want an index in a relational database, +you want an index in Mongo. Indexes in MongoDB are needed for efficient +query processing, as such, you may want to think about the queries +first and then build indexes based upon them. Generally, you would +index the fields that you query by and the fields that you sort by. + +As you create indexes, consider the following aspects of indexes: + +- Each index requires at least 8KB of data space as the B-tree page + size is 8KB. + +- The ``_id`` field is automatically indexed. + +- Adding an index slows write operations but not read operations: + + - This allows for for collections with high read-to-write ratio to + have lots of indexes. + + - For collections with high write-to-read ratio, indexes are + expensive as each insert must add keys to each index. + +See :doc:`/applications/indexes` for more information on determining +indexes. Additionally, MongoDB :wiki:`Database Profiler` provides +information for determining if an index is needed. + +Sharding +~~~~~~~~ + +MongoDB's sharding system allows users to :term:`partition` a +:term:`collection` within a database to distribute the collection's +documents across a number of :program:`mongod` instances or +:term:`shards `. A BSON document (which may have significant +amounts of embedding) resides on one and only one shard. + +Sharding provides the following benefits: + +- increases write capacity, + +- provides the ability to support larger working sets, and + +- raises the limits of total data size beyond the physical resources of + a single node. + +When a collection is sharded, the shard key determines how the +collection is partitioned among shards. Typically (but not always) +queries on a sharded collection involve the shard key as part of the +query expression. + +See :doc:`/core/sharding` for more information on sharding. + +Document Growth +~~~~~~~~~~~~~~~ + +Document growth forces MongoDB to move the document on disk, which can +be time and resource consuming relative to other operations. + +See :doc:`/use-cases/pre-aggregated-reports/` for some approaches to +handling document growth. + +Patterns and Examples +--------------------- + +.. _data-modeling-example-one-to-one: + +One-to-one: Patron and Address +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Consider the following linking example that has two data objects, ``patron`` +and the corresponding ``address`` information, that have a one-to-one +relationship. + +.. code-block:: javascript + + patron = { + _id: "joe", + name: "Joe Bookreader" + } + + address = { + patron_id: "joe", + street: "123 Fake Street", + city: "Faketon", + state: "MA" + zip: 12345 + } + +Because ``address`` is owned by the ``patron``, this one-to-one +relationship may be better modeled by embedding the ``address`` +document in the ``patron`` document, as in the following document: + +.. code-block:: javascript + + patron = { + _id: "joe", + name: "Joe Bookreader", + address: { + street: "123 Fake Street", + city: "Faketon", + state: "MA" + zip: 12345 + } + } + +.. _data-modeling-example-one-to-many: + +One-to-Many +~~~~~~~~~~~ + +.. _data-modeling-example-many-addresses: + +Patron with Many Addresses +`````````````````````````` + +Consider the following embedding example that has three data objects, +``patron`` and two separate ``address`` information, that have a +one-to-many relationship: + +.. code-block:: javascript + + patron = { + _id: "joe", + name: "Joe Bookreader" + } + + address1 = { + patron_id = "joe", + street: "123 Fake Street", + city: "Faketon", + state: "MA", + zip: 12345 + } + + address2 = { + patron_id: "joe", + street: "1 Some Other Street", + city: "Boston", + state: "MA", + zip: 12345 + } + +This one-to-many relationship may be modeled by embedding the +``address`` documents in the ``patron`` document, as in the following +document: + +.. code-block:: javascript + + patron = { + _id: "joe", + name: "Joe Bookreader", + addresses: [ + { + street: "123 Fake Street", + city: "Faketon", + state: "MA", + zip: 12345 + }, + { + street: "1 Some Other Street", + city: "Boston", + state: "MA", + zip: 12345 + } + ] + } + +.. _data-modeling-publisher-and-books: + +Modeling Publishers and Books +````````````````````````````` + +Consider the following example that models the one-to-many +relationship between publishers and books. + +.. code-block:: javascript + + publisher = { + name: "O'Reilly Media", + founded: 1980, + location: "CA" + } + + book1 = { + title: "MongoDB: The Definitive Guide", + author: [ "Kristina Chodorow", "Mike Dirolf" ], + published_date: ISODate("2010-09-24"), + pages: 216, + language: "English" + } + + book2 = { + title: "50 Tips and Tricks for MongoDB Developer", + author: "Kristina Chodorow", + published_date: ISODate("2011-05-06"), + pages: 68, + language: "English" + } + +Embedding the publisher document inside the book document would lead to +**repetition** of the publisher data in the books published by the +particular publisher, as the following documents show: + +.. code-block:: javascript + :emphasize-lines: 7-11,20-24 + + book1 = { + title: "MongoDB: The Definitive Guide", + author: [ "Kristina Chodorow", "Mike Dirolf" ], + published_date: ISODate("2010-09-24"), + pages: 216, + language: "English", + publisher: { + name: "O'Reilly Media", + founded: 1980, + location: "CA" + } + } + + book2 = { + title: "50 Tips and Tricks for MongoDB Developer", + author: "Kristina Chodorow", + published_date: ISODate("2011-05-06"), + pages: 68, + language: "English", + publisher: { + name: "O'Reilly Media", + founded: 1980, + location: "CA" + } + } + +To avoid repetition of the publisher data, use *linking* and keep the +publisher information in a separate collection from the book +collection. + +When linking the one-to-many relationship, the size and growth of the +relationships determine where to store the reference. If the number +of books per publisher is small with limited growth, storing the book +reference inside the publisher document may be useful; otherwise, if +the number of books per publisher is unbounded, this data model would +lead to mutable, growing arrays, as in the following example: + +.. code-block:: javascript + :emphasize-lines: 5 + + publisher = { + name: "O'Reilly Media", + founded: 1980, + location: "CA", + books: [12346789, 234567890, ...] + } + + book1 = { + _id: 123456789, + title: "MongoDB: The Definitive Guide", + author: [ "Kristina Chodorow", "Mike Dirolf" ], + published_date: ISODate("2010-09-24"), + pages: 216, + language: "English" + } + + book2 = { + _id: 234567890, + title: "50 Tips and Tricks for MongoDB Developer", + author: "Kristina Chodorow", + published_date: ISODate("2011-05-06"), + pages: 68, + language: "English" + } + +Because the number of books per publisher may continue to grow, +embedding the publisher reference inside the book document is preferred: + +.. code-block:: javascript + :emphasize-lines: 15, 25 + + publisher = { + _id: "oreilly", + name: "O'Reilly Media", + founded: 1980, + location: "CA", + } + + book1 = { + _id: 123456789, + title: "MongoDB: The Definitive Guide", + author: [ "Kristina Chodorow", "Mike Dirolf" ], + published_date: ISODate("2010-09-24"), + pages: 216, + language: "English", + publisher_id: "oreilly" + } + + book2 = { + _id: 234567890, + title: "50 Tips and Tricks for MongoDB Developer", + author: "Kristina Chodorow", + published_date: ISODate("2011-05-06"), + pages: 68, + language: "English", + publisher_id: "oreilly" + } + +.. _data-modeling-trees: + +Trees +~~~~~ + +MongoDB provides various patterns to store *Trees* or hierarchical data. + +Parent Links +```````````` + +The *Parent Links* pattern stores each tree node in a document; in +addition to the tree node, the document stores the id of the node's +parent. + +Consider the following example that models a tree of categories using +*Parent Links*: + +.. code-block:: javascript + + db.categories.insert( { _id: "MongoDB", parent: "Databases" } ) + db.categories.insert( { _id: "Postgres", parent: "Databases" } ) + db.categories.insert( { _id: "Databases", parent: "Programming" } ) + db.categories.insert( { _id: "Languages", parent: "Programming" } ) + db.categories.insert( { _id: "Programming", parent: "Books" } ) + db.categories.insert( { _id: "Books", parent: null } ) + +- The query to retrieve the parent of a node is fast and + straightforward: + + .. code-block:: javascript + + db.categories.findOne( { _id: "MongoDB" } ).parent + +- You can create an index on the field ``parent`` to enable fast search + by the parent node: + + .. code-block:: javascript + + db.categories.ensureIndex( { parent: 1 } ) + +- You can query by the ``parent`` field to find its immediate children + nodes: + + .. code-block:: javascript + + db.categories.find( { parent: "Databases" } ) + +The *Parent Links* pattern provides a simple solution to tree storage, +but requires successive queries to the database to retrieve subtrees. + +Child Links +``````````` + +The *Child Links* pattern stores each tree node in a document; in +addition to the tree node, document stores in an array the id(s) of the +node's children. + +Consider the following example that models a tree of categories using +*Child Links*: + +.. code-block:: javascript + + db.categories.insert( { _id: "MongoDB", children: [] } ) + db.categories.insert( { _id: "Postgres", children: [] } ) + db.categories.insert( { _id: "Databases", children: [ "MongoDB", "Postgres" ] } ) + db.categories.insert( { _id: "Languages", children: [] } ) + db.categories.insert( { _id: "Programming", children: [ "Databases", "Languages" ] } ) + db.categories.insert( { _id: "Books", children: [ "Programming" ] } ) + +- The query to retrieve the immediate children of a node is fast and + straightforward: + + .. code-block:: javascript + + db.categories.findOne( { _id: "Databases" } ).children + +- You can create an index on the field ``children`` to enable fast + search by the child nodes: + + .. code-block:: javascript + + db.categories.ensureIndex( { children: 1 } ) + +- You can query for a node in the ``children`` field to find its parent + node as well as its siblings: + + .. code-block:: javascript + + db.categories.find( { children: "MongoDB" } ) + +The *Child Links* pattern provides a suitable solution to tree storage +as long as no operations on subtrees are necessary. This pattern may +also provide a suitable solution for storing graphs where a node may +have multiple parents. + +Array of Ancestors +`````````````````` + +The *Array of Ancestors* pattern stores each tree node in a document; +in addition to the tree node, document stores in an array the id(s) of +the node's ancestors or path. + +Consider the following example that models a tree of categories using +*Array of Ancestors*: + +.. code-block:: javascript + + db.categories.insert( { _id: "MongoDB", ancestors: [ "Books", "Programming", "Databases" ], parent: "Databases" } ) + db.categories.insert( { _id: "Postgres", ancestors: [ "Books", "Programming", "Databases" ], parent: "Databases" } ) + db.categories.insert( { _id: "Databases", ancestors: [ "Books", "Programming" ], parent: "Programming" } ) + db.categories.insert( { _id: "Languages", ancestors: [ "Books", "Programming" ], parent: "Programming" } ) + db.categories.insert( { _id: "Programming", ancestors: [ "Books" ], parent: "Books" } ) + db.categories.insert( { _id: "Books", ancestors: [ ], parent: null } ) + +- The query to retrieve the ancestors or path of a node is fast and + straightforward: + + .. code-block:: javascript + + db.categories.findOne( { _id: "MongoDB" } ).ancestors + +- You can create an index on the field ``ancestors`` to enable fast + search by the ancestors nodes: + + .. code-block:: javascript + + db.categories.ensureIndex( { ancestors: 1 } ) + +- You can query by the ``ancestors`` to find all its descendants: + + .. code-block:: javascript + + db.categories.find( { ancestors: "Programming" } ) + +The *Array of Ancestors* pattern provides a fast and efficient solution +to find the descendants and the ancestors of a node by creating an +index on the elements of the ancestors field. This makes *Array of +Ancestors* a good choice for working with subtrees. + +The *Array of Ancestors* pattern is slightly slower than the +*Materialized Paths* pattern but is more straightforward to use. + +Materialized Paths +`````````````````` + +The *Materialized Paths* pattern stores each tree node in a document; +in addition to the tree node, document stores as a string the id(s) of +the node's ancestors or path. Although the *Materialized Paths* pattern +requires additional steps of working with strings and regular +expressions, the pattern also provides more flexibility in working with +the path, such as finding nodes by partial paths. + +Consider the following example that models a tree of categories using +*Materialized Paths* ; the path string uses the comma ``,`` as a +delimiter: + +.. code-block:: javascript + + db.categories.insert( { _id: "Books", path: null } ) + db.categories.insert( { _id: "Programming", path: "Books," } ) + db.categories.insert( { _id: "Databases", path: "Books,Programming," } ) + db.categories.insert( { _id: "Languages", path: "Books,Programming," } ) + db.categories.insert( { _id: "MongoDB", path: "Books,Programming,Databases," } ) + db.categories.insert( { _id: "Postgres", path: "Books,Programming,Databases," } ) + +- You can query to retrieve the whole tree, sorting by the ``path``: + + .. code-block:: javascript + + db.categories.find().sort( { path: 1 } ) + +- You can create an index on the field ``path`` to enable fast search + by the path: + + .. code-block:: javascript + + db.categories.ensureIndex( { path: 1 } ) + +- You can use regular expressions on the ``path`` field to find the + descendants of ``Programming``: + + .. code-block:: javascript + + db.categories.find( { path: /,Programming,/ } ) + +Nested Sets +``````````` + +The *Nested Sets* pattern identifies each node in the tree as stops in +a round-trip traversal of the tree. Each node is visited twice; first +during the initial trip, and second during the return trip. The *Nested +Sets* pattern stores each tree node in a document; in addition to the +tree node, document stores the id of node's parent, the node's initial +stop in the ``left`` field, and its return stop in the ``right`` field. + +Consider the following example that models a tree of categories using +*Nested Sets*: + +.. code-block:: javascript + + db.categories.insert( { _id: "Books", parent: 0, left: 1, right: 12 } ) + db.categories.insert( { _id: "Programming", parent: "Books", left: 2, right: 11 } ) + db.categories.insert( { _id: "Languages", parent: "Programming", left: 3, right: 4 } ) + db.categories.insert( { _id: "Databases", parent: "Programming", left: 5, right: 10 } ) + db.categories.insert( { _id: "MongoDB", parent: "Databases", left: 6, right: 7 } ) + db.categories.insert( { _id: "Postgres", parent: "Databases", left: 8, right: 9 } ) + +You can query to retrieve the descendants of a node: + + .. code-block:: javascript + + var databaseCategory = db.v.findOne( { _id: "Databases" } ); + db.categories.find( { left: { $gt: databaseCategory.left }, right: { $lt: databaseCategory.right } } ); + +The *Nested Sets* pattern provides a fast and efficient solution for +finding subtrees but is inefficient for modifying the tree structure. +As such, this pattern is best for static trees that do not change. + +.. seealso:: + + - `Ruby Example of Materialized Paths + `_ + + - `Sean Cribs Blog Post + `_ + which was the source for much of the :ref:`data-modeling-trees` content. + +Queues +~~~~~~ + +Consider the following ``book`` document that stores the number of +available copies: + +.. code-block:: javascript + :emphasize-lines: 9 + + book = { + _id: 123456789, + title: "MongoDB: The Definitive Guide", + author: [ "Kristina Chodorow", "Mike Dirolf" ], + published_date: ISODate("2010-09-24"), + pages: 216, + language: "English", + publisher_id: "oreilly", + available: 3 + } + +Using the :method:`db.collection.findAndModify()` method, you can +atomically find and decrement the ``available`` field as the book is +checked out: + +.. code-block:: javascript + + db.books.findAndModify ( { + query: { _id: 123456789 } , + update: { $inc: { available: -1 } } + } ) + +Additional Resources +-------------------- + +.. seealso:: + + - `Schema Design by Example `_ + + - `Walkthrough MongoDB Data Modeling `_ + + - `Document Design for MongoDB `_ + + - `Dynamic Schema Blog Post `_ + + - :wiki:`MongoDB Data Modeling and Rails` diff --git a/source/faq/developers.txt b/source/faq/developers.txt index b784d9736e8..4b339e9ef29 100644 --- a/source/faq/developers.txt +++ b/source/faq/developers.txt @@ -615,3 +615,48 @@ explicitly force the query to use that index. .. [#id-is-immutable] MongoDB does not permit changes to the value of the ``_id`` field; it is not possible for a cursor that transverses this index to pass the same document more than once. + +.. _faq-developers-isolate-cursors: + +When should I embed documents? +------------------------------ + +During :doc:`/core/data-modeling`, embedding is frequently the choice +for: + +- "contains" relationships between entities. + +- one-to-many relationships when the "many" objects *always* appear + with or are viewed in the context of their parents. + +You should also consider embedding for performance reasons if you have +a collection with a huge amount of small documents. If small, separate +documents represent the natural model for the data, then you should +maintain that model. If, however, you can group these small documents +by some logical relationship [#embed-caveat-grouping]_ and you +frequently retrieve the documents by this +grouping[#embed-caveat-subset]_, you might consider "rolling-up" the +small documents into larger documents that contain an array of +subdocuments. + +Embedding these small documents provides the following benefits: + +- Returning the group of the small documents involves sequential reads + and less random disk accesses. + +- If the data is too large to fit entirely in RAM, embedding provides + better RAM cache utilization. [#embed-caveat-ram-cache]_ + +- If the small documents contained common index keys, the common keys + would be stored in fewer copies with fewer associated key entries in + the corresponding index. + +.. [#embed-caveat-grouping] If grouping the small documents would be + awkward, don't do it. + +.. [#embed-caveat-subset] If you often only need a subset of the items you would + group, this approach could be inefficient compared to alternatives. + +.. [#embed-caveat-ram-cache] If your small documents are approximately + the page cache unit size, there is no benefit for ram cache efficiency, + although embedding will provide some benefit regarding random disk i/o. diff --git a/source/reference/configuration-options.txt b/source/reference/configuration-options.txt index 5cc5dde58ac..6629cb53d56 100644 --- a/source/reference/configuration-options.txt +++ b/source/reference/configuration-options.txt @@ -440,14 +440,15 @@ Settings *Default:* 16 - Specify this value in megabytes. + Specify this value in megabytes. The maximum size is 2047 megabytes. Use this setting to control the default size for all newly created namespace files (i.e ``.ns``). This option has no impact on the size of existing namespace files. - The default value is 16 megabytes, this provides for effectively - 12,000 possible namespace. The maximum size is 2 gigabytes. + The default value is 16 megabytes; this provides for approximately + 24,000 namespaces. Each collection, as well as each index, counts as + a namespace. .. setting:: profile diff --git a/source/reference/limits.txt b/source/reference/limits.txt index f825fc9fe75..6a7cac98281 100644 --- a/source/reference/limits.txt +++ b/source/reference/limits.txt @@ -46,12 +46,13 @@ Namespaces The limitation on the number of namespaces is the size of the namespace file divided by 628. - A 16 megabyte namespace file can support approximately 24,000 namespaces. + A 16 megabyte namespace file can support approximately 24,000 + namespaces. Each index also counts as a namespace. .. _limit-size-of-namespace-file: .. limit:: Size of Namespace File - Namespace files can be no larger than 2 gigabytes. + Namespace files can be no larger than 2047 megabytes. By default namespace files are 16 megabytes. You can configure the size using the :setting:`nssize`. diff --git a/source/reference/mongod.txt b/source/reference/mongod.txt index e267694bc74..5d3c87892e1 100644 --- a/source/reference/mongod.txt +++ b/source/reference/mongod.txt @@ -310,11 +310,13 @@ Options .. option:: --nssize - Specifies the default value for namespace files (i.e ``.ns``). This - option has no impact on the size of existing namespace files. + Specifies the default size for namespace files (i.e ``.ns``). This + option has no impact on the size of existing namespace files. The + maximum size is 2047 megabytes. - The default value is 16 megabytes, this provides for effectively - 12,000 possible namespaces. The maximum size is 2 gigabytes. + The default value is 16 megabytes; this provides for approximately + 24,000 namespaces. Each collection, as well as each index, counts as + a namespace. .. option:: --profile From 1dc25472c1ce329dd986ab3a0b752ca828e31feb Mon Sep 17 00:00:00 2001 From: kay Date: Sun, 16 Dec 2012 23:02:32 -0500 Subject: [PATCH 2/3] DOCS-661 schema page incorporate comments by sam and ed --- source/core/data-modeling.txt | 732 +++++++++++---------- source/faq/developers.txt | 70 +- source/reference/configuration-options.txt | 4 +- 3 files changed, 417 insertions(+), 389 deletions(-) diff --git a/source/core/data-modeling.txt b/source/core/data-modeling.txt index c888bd57dd4..bb2ba9f3988 100644 --- a/source/core/data-modeling.txt +++ b/source/core/data-modeling.txt @@ -7,10 +7,26 @@ Data Modeling Overview -------- -In MongoDB, the schema design takes into consideration both the data to -be modeled *and* the use cases. Schema design determines the structure -of the documents, the number and types of collections, indexing and -sharding. +Collections in MongoDB have flexible schema; they do not define nor +enforce the fields of its documents. Each document can have only the +fields that are relevant to that entity, although in practice, you +would generally choose to store similar documents in each collection. +With this flexible schema, you can model your data to reflect more +closely the actual entity rather than enforce a rigid data structure. + +In MongoDB, data modeling takes into consideration not only how data +relates to each other, but also how the data is used, how the data will +grow and be maintained. These considerations involve decisions about +whether to embed data within a single document or reference data among +different documents, which fields to index, and whether to use special +features. + +Choosing the correct data model can provide both performance and +maintenance gains for your applications. + +This document provide some general guidelines for data modeling and +possible options. These guidelines and options may not be appropriate +for your situation. .. _data-modeling-decisions: @@ -18,27 +34,28 @@ Data Modeling Decisions ----------------------- Data modeling decisions involve determining how to structure the -documents to effectively model the data. The primary decision is -whether to :ref:`embed ` or to :ref:`link -` related data. +documents to model the data effectively. The primary decision is +whether to :ref:`embed ` or to :ref:`use +references `. .. _data-modeling-embedding: Embedding ~~~~~~~~~ -Embedding, or de-normalization of data, is a bit like "prejoined" data. +De-normalization of data involves embedding documents within other +documents. + Operations within a document are easy for the server to handle. -Embedding allows for large sequential reads. -Embedding is frequently the choice for: +In general, choose the embedded data model when: -- "contains" relationships between entities. See +- you have "contains" relationships between entities. See :ref:`data-modeling-example-one-to-one`. -- one-to-many relationships when the "many" objects always appear with - or are viewed in the context of their parents. See - :ref:`data-modeling-example-many-addresses`. +- you have one-to-many relationships where the "many" objects always + appear with or are viewed in the context of their parents. See + :ref:`data-modeling-example-one-to-many`. Embedding provides the following benefits: @@ -46,43 +63,49 @@ Embedding provides the following benefits: - Single roundtrip to database to retrieve the complete object -However, embedding presents some considerations: - -- Writes can be slow if adding to objects frequently. - -- You cannot embed documents that will cause the containing document to - exceed the :limit:`maximum BSON document size `. - For documents that exceed the maximum BSON document size, see - :doc:`/applications/gridfs`. - -- You cannot have nested embedded levels more than the specified - :limit:`limit on embedded levels `. +However, with embedding, write operations can be slow if you are adding +objects frequently. Additionally, you cannot embed documents that will +cause the containing document to exceed the :limit:`maximum BSON +document size `. For documents that exceed the +maximum BSON document size, see :doc:`/applications/gridfs`. For examples in accessing embedded documents, see :ref:`read-operations-subdocuments`. -.. _data-modeling-linking: +.. seealso:: -Linking -~~~~~~~ + - :term:`dot notation` for information on "reaching into" embedded + sub-documents. + + - :ref:`read-operations-arrays` for more examples on accessing arrays + + - :ref:`read-operations-subdocuments` for more examples on accessing + subdocuments + +.. _data-modeling-referencing: -Linking, or normalization of data, joins separate documents using -:doc:`references ` or links. Links -are processed client-side by the application; the application does this -by issuing a follow-up query. Linking involves performing many seeks -and random reads. +Referencing +~~~~~~~~~~~ + +Normalization of data requires storing :doc:`references +` from one document to another. + +In general, choose the referenced data model when: -Linking is frequently the choice for: +- embedding would result in duplication of data. -- embedding would result in duplication of data. +- you have many-to-many relationships. -- many-to-many relationships. +- you are modeling large hierarchical data. See + :ref:`data-modeling-trees`. -See :ref:`data-modeling-publisher-and-books` for example of linking. +Referencing provides more flexibility than embedding; however, to +resolve the references, client-side applications must issue follow-up +queries. Additionally, the referencing data model involves performing +many seeks and random reads. -Linking provides more flexibility than embedding; however, linking -requires client-side processing to resolve the link by issuing -follow-up queries to retrieve the entire object. +See :ref:`data-modeling-publisher-and-books` for an example of +referencing. .. _data-modeling-atomicity: @@ -93,29 +116,34 @@ Atomicity influences the decision to embed or link. The modification of a single document is atomic, even if the write operation modifies multiple sub-documents *within* the single document. -Embed fields relevant to the atomic operation in the same document. +Embed fields that need to be modified together atomically in the same +document. See :ref:`data-modeling-atomic-operation` for an example of +atomic updates within a single document. Operational Considerations -------------------------- -Operational considerations involve the data lifecycle managements, -determining the number of collections, +Operational considerations involve decisions related to data lifecycle +management, number of collections, indexing, sharding, and managing +document growth. These decisions can improve performance and facilitate +maintenance efforts. Data Lifecycle Management ~~~~~~~~~~~~~~~~~~~~~~~~~ -Data lifecycle management determines the type of collection or -collection feature to implement in your schema design. +Data lifecycle management concerns contribute to the decision making +process around data modeling. -Some data may only need to persist in a database for a limited period -of time. In these cases, you should consider using the :doc:`Time to -Live or TTL feature` of collections. +The :doc:`Time to Live or TTL feature ` of +collections expires documents after a period of time. Consider using +the TTL feature if your application requires some data to persist in +the database for a limited period of time. -Additionally, if you will be only concerned with the most recent -documents, you should consider :doc:`/core/capped-collections`. Capped -collections provide *first-in-first-out* management of inserted -documents and support high-throughput operations that insert, read, -and delete documents based on insertion order. +Additionally, if your application is concerned only with the most +recent documents, you might consider :doc:`/core/capped-collections`. +Capped collections provide *first-in-first-out* (FIFO) management of +inserted documents and support operations that insert, read, and delete +documents based on insertion order. Large Number of Collections ~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -132,7 +160,7 @@ documents of the following form: { log: "dev", ts: ..., info: ... } { log: "debug", ts: ..., info: ...} -If the number of different logs is not too high, you may decide to have +If the number of different logs is not high, you may decide to have separate log collections, such as ``logs.dev`` and ``logs.debug``. The ``logs.dev`` collection would contain only the log documents related to the dev environment. @@ -142,142 +170,138 @@ performance penalty and results in very good performance. Independent collections are very important for high-throughput batch processing. When creating large numbers of collections, consider the following -items: - -- Each collection has a certain minimum overhead of a few kilobytes per - collection. - -- Each index requires at least 8KB of data space as the B-tree page - size is 8KB. - -- :limit:`Limits on the number of namespaces ` - exist: - - - Each index, as well as each collection, counts as a namespace. - - - Each namespace is 628 bytes. - - - Namespaces are stored per database in a ``.ns`` file: +behaviors: - - The ``.ns`` file defaults to 16 MB. +- Each collection has a certain minimum overhead of a few kilobytes. - - To change the size of the ``.ns`` file, use - :option:`--nssize \ ` on server - startup. +- Each index requires at least 8KB of data space. - .. note:: +Namespaces are stored per database in a ``.ns`` file. All +indexes and collections have their own entry in the namespace file, and +each namespace entry is 628 bytes. - - The maximum size of the ``.ns`` file is 2047 MB. - - - :option:`--nssize ` sets the size used for - *new* ``.ns`` files. For existing databases, after - starting up the server with :option:`--nssize `, run the :dbcommand:`db.repairDatabase()` command - from the :program:`mongo` shell. - -You can check the namespaces by querying the ``system.namespaces`` -collection, as in the following example: +Because of :limit:`limits on namespaces `, you +may wish to know the current number of namespaces in order to determine +how many additional namespaces the database can support, as in the +following example: .. code-block:: javascript db.system.namespaces.count() +The ``.ns`` file defaults to 16 MB. To change +the size of the ``.ns`` file, pass a new size to +:option:`--nssize option \ ` on server +startup. + +The :option:`--nssize ` sets the size for *new* +``.ns`` files. For existing databases, after starting up the +server with :option:`--nssize `, run the +:dbcommand:`db.repairDatabase()` command from the :program:`mongo` +shell. + Indexes ~~~~~~~ As a general rule, where you want an index in a relational database, -you want an index in Mongo. Indexes in MongoDB are needed for efficient -query processing, as such, you may want to think about the queries -first and then build indexes based upon them. Generally, you would -index the fields that you query by and the fields that you sort by. +you want an index in MongoDB. Indexes in MongoDB are needed for +efficient query processing, and as such, you may want to think about +the queries first and then build indexes based upon them. Generally, +you would index the fields that you query by and the fields that you +sort by. The ``_id`` field is automatically indexed. -As you create indexes, consider the following aspects of indexes: +As you create indexes, consider the following behaviors of indexes: -- Each index requires at least 8KB of data space as the B-tree page - size is 8KB. +- Each index requires at least 8KB of data space. -- The ``_id`` field is automatically indexed. +- Adding an index has some negative performance impact for write + operations. For collections with high write-to-read ratio, indexes + are expensive as each insert must add keys to each index. -- Adding an index slows write operations but not read operations: - - - This allows for for collections with high read-to-write ratio to - have lots of indexes. - - - For collections with high write-to-read ratio, indexes are - expensive as each insert must add keys to each index. +- Read operations supported by the index perform better, and read + operations not supported by the index have no performance impact from + the index. This allows for for collections with high read-to-write + ratio to have many indexes. See :doc:`/applications/indexes` for more information on determining indexes. Additionally, MongoDB :wiki:`Database Profiler` provides information for determining if an index is needed. +.. TODO link to new database profiler manual page once migrated + Sharding ~~~~~~~~ -MongoDB's sharding system allows users to :term:`partition` a +:term:`Sharding ` allows users to :term:`partition` a :term:`collection` within a database to distribute the collection's documents across a number of :program:`mongod` instances or -:term:`shards `. A BSON document (which may have significant -amounts of embedding) resides on one and only one shard. - -Sharding provides the following benefits: - -- increases write capacity, - -- provides the ability to support larger working sets, and - -- raises the limits of total data size beyond the physical resources of - a single node. +:term:`shards `. When a collection is sharded, the shard key determines how the -collection is partitioned among shards. Typically (but not always) -queries on a sharded collection involve the shard key as part of the -query expression. +collection is partitioned among shards. Selecting the proper +:ref:`shard key ` can have a significant impact on +performance. -See :doc:`/core/sharding` for more information on sharding. +See :doc:`/core/sharding` for more information on sharding and +the selection of the :ref:`shard key `. Document Growth ~~~~~~~~~~~~~~~ -Document growth forces MongoDB to move the document on disk, which can -be time and resource consuming relative to other operations. +Certain updates to documents can increase the document size, such as +pushing elements to an array and adding new fields. If the document +size exceeds the allocated space for that document, MongoDB relocates +the document on disk. This internal relocation can be both time and +resource consuming. + +Although MongoDB automatically provides padding to minimize the +occurrence of relocations, you may still need to manually handle +document growth. Refer to :doc:`/use-cases/pre-aggregated-reports` for +an example of the *Pre-allocation* approach to handle document growth. -See :doc:`/use-cases/pre-aggregated-reports/` for some approaches to -handling document growth. +.. TODO add link to padding factor page once migrated Patterns and Examples --------------------- .. _data-modeling-example-one-to-one: -One-to-one: Patron and Address -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +One-to-One: Embedding +~~~~~~~~~~~~~~~~~~~~~ -Consider the following linking example that has two data objects, ``patron`` -and the corresponding ``address`` information, that have a one-to-one -relationship. +Consider the following example that maps patron and address +relationships. The example illustrates the advantage of embedding over +referencing if you need to view one data entity in context of the +other. In this one-to-one relationship between ``patron`` and +``address`` data, the ``address`` belongs to the ``patron``. + +In the normalized data model, the ``address`` contains a reference to +the ``parent``. .. code-block:: javascript - patron = { - _id: "joe", - name: "Joe Bookreader" - } + { + _id: "joe", + name: "Joe Bookreader" + } - address = { - patron_id: "joe", - street: "123 Fake Street", - city: "Faketon", - state: "MA" - zip: 12345 - } + { + patron_id: "joe", + street: "123 Fake Street", + city: "Faketon", + state: "MA" + zip: 12345 + } -Because ``address`` is owned by the ``patron``, this one-to-one -relationship may be better modeled by embedding the ``address`` -document in the ``patron`` document, as in the following document: +If the ``address`` data is frequently retrieved with the ``name`` +information, then with referencing, your application needs to issue +multiple queries to resolve the reference. The better data model would +be to embed the ``address`` data in the ``patron`` data, as in the +following document: .. code-block:: javascript - patron = { + { _id: "joe", name: "Joe Bookreader", address: { @@ -288,221 +312,260 @@ document in the ``patron`` document, as in the following document: } } -.. _data-modeling-example-one-to-many: +With the embedded data model, your application can retrieve the +complete patron information with one query. -One-to-Many -~~~~~~~~~~~ +.. _data-modeling-example-one-to-many: -.. _data-modeling-example-many-addresses: +One-to-Many: Embedding +~~~~~~~~~~~~~~~~~~~~~~ -Patron with Many Addresses -`````````````````````````` +Consider the following example that maps patron and multiple address +relationships. The example illustrates the advantage of embedding over +referencing if you need to view many data entities in context of +another. In this one-to-many relationship between ``patron`` and +``address`` data, the ``patron`` has multiple ``address`` entities. -Consider the following embedding example that has three data objects, -``patron`` and two separate ``address`` information, that have a -one-to-many relationship: +In the normalized data model, the ``address`` contains a reference to +the ``parent``. .. code-block:: javascript - patron = { - _id: "joe", - name: "Joe Bookreader" - } - - address1 = { - patron_id = "joe", - street: "123 Fake Street", - city: "Faketon", - state: "MA", - zip: 12345 - } - - address2 = { - patron_id: "joe", - street: "1 Some Other Street", - city: "Boston", - state: "MA", - zip: 12345 - } - -This one-to-many relationship may be modeled by embedding the -``address`` documents in the ``patron`` document, as in the following -document: + { + _id: "joe", + name: "Joe Bookreader" + } + + { + patron_id = "joe", + street: "123 Fake Street", + city: "Faketon", + state: "MA", + zip: 12345 + } + + { + patron_id: "joe", + street: "1 Some Other Street", + city: "Boston", + state: "MA", + zip: 12345 + } + +If your application frequently retrieves the ``address`` data with the +``name`` information, then your application needs to issue multiple +queries to resolve the references. The better data model would be to +embed the ``address`` data entities in the ``patron`` data, as in the +following document: .. code-block:: javascript - patron = { - _id: "joe", - name: "Joe Bookreader", - addresses: [ - { - street: "123 Fake Street", - city: "Faketon", - state: "MA", - zip: 12345 - }, - { - street: "1 Some Other Street", - city: "Boston", - state: "MA", - zip: 12345 - } - ] - } + { + _id: "joe", + name: "Joe Bookreader", + addresses: [ + { + street: "123 Fake Street", + city: "Faketon", + state: "MA", + zip: 12345 + }, + { + street: "1 Some Other Street", + city: "Boston", + state: "MA", + zip: 12345 + } + ] + } + +With the embedded data model, your application can retrieve the +complete patron information with one query. .. _data-modeling-publisher-and-books: -Modeling Publishers and Books -````````````````````````````` +One-to-Many: Referencing +```````````````````````` + +Consider the following example that maps publisher and book +relationships. The example illustrates the advantage of referencing +over embedding to prevent the repetition of the publisher information. -Consider the following example that models the one-to-many -relationship between publishers and books. +Embedding the publisher document inside the book document would lead to +**repetition** of the publisher data, as the following documents show: .. code-block:: javascript + :emphasize-lines: 7-11,20-24 - publisher = { + { + title: "MongoDB: The Definitive Guide", + author: [ "Kristina Chodorow", "Mike Dirolf" ], + published_date: ISODate("2010-09-24"), + pages: 216, + language: "English", + publisher: { name: "O'Reilly Media", founded: 1980, location: "CA" } + } - book1 = { - title: "MongoDB: The Definitive Guide", - author: [ "Kristina Chodorow", "Mike Dirolf" ], - published_date: ISODate("2010-09-24"), - pages: 216, - language: "English" - } + { + title: "50 Tips and Tricks for MongoDB Developer", + author: "Kristina Chodorow", + published_date: ISODate("2011-05-06"), + pages: 68, + language: "English", + publisher: { + name: "O'Reilly Media", + founded: 1980, + location: "CA" + } + } - book2 = { - title: "50 Tips and Tricks for MongoDB Developer", - author: "Kristina Chodorow", - published_date: ISODate("2011-05-06"), - pages: 68, - language: "English" - } +To avoid repetition of the publisher data, use *references* and keep +the publisher information in a separate collection from the book +collection. -Embedding the publisher document inside the book document would lead to -**repetition** of the publisher data in the books published by the -particular publisher, as the following documents show: +When using references, the growth of the relationships determine where +to store the reference. If the number of books per publisher is small +with limited growth, storing the book reference inside the publisher +document may sometimes be useful. Otherwise, if the number of books per +publisher is unbounded, this data model would lead to mutable, growing +arrays, as in the following example: .. code-block:: javascript - :emphasize-lines: 7-11,20-24 + :emphasize-lines: 5 - book1 = { - title: "MongoDB: The Definitive Guide", - author: [ "Kristina Chodorow", "Mike Dirolf" ], - published_date: ISODate("2010-09-24"), - pages: 216, - language: "English", - publisher: { - name: "O'Reilly Media", - founded: 1980, - location: "CA" - } - } + { + name: "O'Reilly Media", + founded: 1980, + location: "CA", + books: [12346789, 234567890, ...] + } - book2 = { - title: "50 Tips and Tricks for MongoDB Developer", - author: "Kristina Chodorow", - published_date: ISODate("2011-05-06"), - pages: 68, - language: "English", - publisher: { - name: "O'Reilly Media", - founded: 1980, - location: "CA" - } - } + { + _id: 123456789, + title: "MongoDB: The Definitive Guide", + author: [ "Kristina Chodorow", "Mike Dirolf" ], + published_date: ISODate("2010-09-24"), + pages: 216, + language: "English" + } -To avoid repetition of the publisher data, use *linking* and keep the -publisher information in a separate collection from the book -collection. + { + _id: 234567890, + title: "50 Tips and Tricks for MongoDB Developer", + author: "Kristina Chodorow", + published_date: ISODate("2011-05-06"), + pages: 68, + language: "English" + } -When linking the one-to-many relationship, the size and growth of the -relationships determine where to store the reference. If the number -of books per publisher is small with limited growth, storing the book -reference inside the publisher document may be useful; otherwise, if -the number of books per publisher is unbounded, this data model would -lead to mutable, growing arrays, as in the following example: +To avoid mutable, growing arrays, store the publisher reference inside +the book document: .. code-block:: javascript - :emphasize-lines: 5 - - publisher = { - name: "O'Reilly Media", - founded: 1980, - location: "CA", - books: [12346789, 234567890, ...] - } + :emphasize-lines: 15, 25 - book1 = { - _id: 123456789, - title: "MongoDB: The Definitive Guide", - author: [ "Kristina Chodorow", "Mike Dirolf" ], - published_date: ISODate("2010-09-24"), - pages: 216, - language: "English" - } + { + _id: "oreilly", + name: "O'Reilly Media", + founded: 1980, + location: "CA" + } - book2 = { - _id: 234567890, - title: "50 Tips and Tricks for MongoDB Developer", - author: "Kristina Chodorow", - published_date: ISODate("2011-05-06"), - pages: 68, - language: "English" - } + { + _id: 123456789, + title: "MongoDB: The Definitive Guide", + author: [ "Kristina Chodorow", "Mike Dirolf" ], + published_date: ISODate("2010-09-24"), + pages: 216, + language: "English", + publisher_id: "oreilly" + } -Because the number of books per publisher may continue to grow, -embedding the publisher reference inside the book document is preferred: + { + _id: 234567890, + title: "50 Tips and Tricks for MongoDB Developer", + author: "Kristina Chodorow", + published_date: ISODate("2011-05-06"), + pages: 68, + language: "English", + publisher_id: "oreilly" + } + +.. Reworked the Queue slide from the presentation to Atomic Operation +.. TODO later, include a separate queue example for maybe checkout requests, + and possibly bucket example that is separate from the pre-allocation + example link above in the Document Growth section + +.. _data-modeling-atomic-operation: + +Atomic Operation +~~~~~~~~~~~~~~~~ + +Consider the following example that keeps a library book and its +checkout information. The example illustrates how embedding fields +related to an atomic update within the same document ensures that the +fields are in sync. + +Consider the following ``book`` document that stores the number of +available copies for checkout and the current checkout information: .. code-block:: javascript - :emphasize-lines: 15, 25 - - publisher = { - _id: "oreilly", - name: "O'Reilly Media", - founded: 1980, - location: "CA", - } + :emphasize-lines: 9 - book1 = { + book = { _id: 123456789, title: "MongoDB: The Definitive Guide", author: [ "Kristina Chodorow", "Mike Dirolf" ], published_date: ISODate("2010-09-24"), pages: 216, language: "English", - publisher_id: "oreilly" + publisher_id: "oreilly", + available: 3, + checkout: [ { by: "joe", date: ISODate("2012-10-15") } ] } - book2 = { - _id: 234567890, - title: "50 Tips and Tricks for MongoDB Developer", - author: "Kristina Chodorow", - published_date: ISODate("2011-05-06"), - pages: 68, - language: "English", - publisher_id: "oreilly" - } +You can use the :method:`db.collection.findAndModify()` method to +atomically determine if a book is available for checkout and update +with the new checkout information. Embedding the ``available`` field +and the ``checkout`` field within the same document ensures that the +updates to these fields are in sync: + +.. code-block:: javascript + + db.books.findAndModify ( { + query: { + _id: 123456789, + available: { $gt: 0 } + }, + update: { + $inc: { available: -1 }, + $push: { checkout: { by: "abc", date: new Date() } } + } + } ) .. _data-modeling-trees: Trees ~~~~~ -MongoDB provides various patterns to store *Trees* or hierarchical data. +To model hierarchical or nested data relationships, you can use +references to implement tree-like structures. The following *Tree* +pattern examples model book categories that have hierarchical +relationships. -Parent Links -```````````` +Parent References +````````````````` -The *Parent Links* pattern stores each tree node in a document; in +The *Parent References* pattern stores each tree node in a document; in addition to the tree node, the document stores the id of the node's parent. Consider the following example that models a tree of categories using -*Parent Links*: +*Parent References*: .. code-block:: javascript @@ -537,15 +600,15 @@ Consider the following example that models a tree of categories using The *Parent Links* pattern provides a simple solution to tree storage, but requires successive queries to the database to retrieve subtrees. -Child Links -``````````` +Child References +````````````````` -The *Child Links* pattern stores each tree node in a document; in +The *Child References* pattern stores each tree node in a document; in addition to the tree node, document stores in an array the id(s) of the node's children. Consider the following example that models a tree of categories using -*Child Links*: +*Child References*: .. code-block:: javascript @@ -577,7 +640,7 @@ Consider the following example that models a tree of categories using db.categories.find( { children: "MongoDB" } ) -The *Child Links* pattern provides a suitable solution to tree storage +The *Child References* pattern provides a suitable solution to tree storage as long as no operations on subtrees are necessary. This pattern may also provide a suitable solution for storing graphs where a node may have multiple parents. @@ -696,66 +759,33 @@ Consider the following example that models a tree of categories using You can query to retrieve the descendants of a node: - .. code-block:: javascript +.. code-block:: javascript - var databaseCategory = db.v.findOne( { _id: "Databases" } ); - db.categories.find( { left: { $gt: databaseCategory.left }, right: { $lt: databaseCategory.right } } ); + var databaseCategory = db.v.findOne( { _id: "Databases" } ); + db.categories.find( { left: { $gt: databaseCategory.left }, right: { $lt: databaseCategory.right } } ); The *Nested Sets* pattern provides a fast and efficient solution for finding subtrees but is inefficient for modifying the tree structure. As such, this pattern is best for static trees that do not change. -.. seealso:: - - - `Ruby Example of Materialized Paths - `_ - - - `Sean Cribs Blog Post - `_ - which was the source for much of the :ref:`data-modeling-trees` content. - -Queues -~~~~~~ - -Consider the following ``book`` document that stores the number of -available copies: - -.. code-block:: javascript - :emphasize-lines: 9 - - book = { - _id: 123456789, - title: "MongoDB: The Definitive Guide", - author: [ "Kristina Chodorow", "Mike Dirolf" ], - published_date: ISODate("2010-09-24"), - pages: 216, - language: "English", - publisher_id: "oreilly", - available: 3 - } - -Using the :method:`db.collection.findAndModify()` method, you can -atomically find and decrement the ``available`` field as the book is -checked out: +Additional Resources +-------------------- -.. code-block:: javascript +For more information, consider the following external resources: - db.books.findAndModify ( { - query: { _id: 123456789 } , - update: { $inc: { available: -1 } } - } ) +- `Schema Design by Example `_ -Additional Resources --------------------- +- `Walkthrough MongoDB Data Modeling `_ -.. seealso:: +- `Document Design for MongoDB `_ - - `Schema Design by Example `_ +- `Dynamic Schema Blog Post `_ - - `Walkthrough MongoDB Data Modeling `_ - - - `Document Design for MongoDB `_ +- :wiki:`MongoDB Data Modeling and Rails` - - `Dynamic Schema Blog Post `_ +- `Ruby Example of Materialized Paths + `_ - - :wiki:`MongoDB Data Modeling and Rails` +- `Sean Cribs Blog Post + `_ + which was the source for much of the :ref:`data-modeling-trees` content. diff --git a/source/faq/developers.txt b/source/faq/developers.txt index 4b339e9ef29..6eca53435b2 100644 --- a/source/faq/developers.txt +++ b/source/faq/developers.txt @@ -616,47 +616,47 @@ explicitly force the query to use that index. ``_id`` field; it is not possible for a cursor that transverses this index to pass the same document more than once. -.. _faq-developers-isolate-cursors: +.. _faq-developers-embed-documents: -When should I embed documents? ------------------------------- +When should I embed documents within other documents? +----------------------------------------------------- -During :doc:`/core/data-modeling`, embedding is frequently the choice -for: +When :doc:`modeling data in MongoDB `, embedding +is frequently the choice for: -- "contains" relationships between entities. +- "contains" relationships between entities. - one-to-many relationships when the "many" objects *always* appear with or are viewed in the context of their parents. You should also consider embedding for performance reasons if you have -a collection with a huge amount of small documents. If small, separate +a collection with a large amount of small documents. If small, separate documents represent the natural model for the data, then you should -maintain that model. If, however, you can group these small documents -by some logical relationship [#embed-caveat-grouping]_ and you -frequently retrieve the documents by this -grouping[#embed-caveat-subset]_, you might consider "rolling-up" the -small documents into larger documents that contain an array of -subdocuments. - -Embedding these small documents provides the following benefits: - -- Returning the group of the small documents involves sequential reads - and less random disk accesses. - -- If the data is too large to fit entirely in RAM, embedding provides - better RAM cache utilization. [#embed-caveat-ram-cache]_ - -- If the small documents contained common index keys, the common keys - would be stored in fewer copies with fewer associated key entries in - the corresponding index. - -.. [#embed-caveat-grouping] If grouping the small documents would be - awkward, don't do it. - -.. [#embed-caveat-subset] If you often only need a subset of the items you would - group, this approach could be inefficient compared to alternatives. - -.. [#embed-caveat-ram-cache] If your small documents are approximately - the page cache unit size, there is no benefit for ram cache efficiency, - although embedding will provide some benefit regarding random disk i/o. +maintain that model. + +If, however, you can group these small documents by some logical +relationship *and* you frequently retrieve the documents by this +grouping, you might consider "rolling-up" the small documents into +larger documents that contain an array of subdocuments. But if you +often only need to retrieve a subset of the documents within the group, +then "rolling-up" the documents may not provide better performance. + +By "rolling up" these small documents into logical groupings, queries +to retrieve the group of the documents involve sequential reads and +less random disk accesses. + +.. Will probably need to break up the following sentence: + +Additionally, if the individual documents were indexed on common +fields, then by "rolling up" the documents and moving the common fields +to the larger document, there would be fewer copies of the common +fields *and* there would be fewer associated key entries in the +corresponding index. See :doc:`/core/indexes` for more information on +indexes. + +.. Commenting out.. If the data is too large to fit entirely in RAM, + embedding provides better RAM cache utilization. + +.. Commenting out.. If your small documents are approximately the page + cache unit size, there is no benefit for ram cache efficiency, although + embedding will provide some benefit regarding random disk i/o. diff --git a/source/reference/configuration-options.txt b/source/reference/configuration-options.txt index 6629cb53d56..69900c254b2 100644 --- a/source/reference/configuration-options.txt +++ b/source/reference/configuration-options.txt @@ -446,9 +446,7 @@ Settings namespace files (i.e ``.ns``). This option has no impact on the size of existing namespace files. - The default value is 16 megabytes; this provides for approximately - 24,000 namespaces. Each collection, as well as each index, counts as - a namespace. + See :limit:`Limits on namespaces `. .. setting:: profile From 932cc58656bb2dc4dceff9e200c3d8bcd5c212c4 Mon Sep 17 00:00:00 2001 From: kay Date: Wed, 19 Dec 2012 11:19:20 -0500 Subject: [PATCH 3/3] DOCS-661 data modeling page --- source/core/data-modeling.txt | 88 +++++++++++++++++++---------------- source/faq/developers.txt | 30 ++++++------ 2 files changed, 62 insertions(+), 56 deletions(-) diff --git a/source/core/data-modeling.txt b/source/core/data-modeling.txt index bb2ba9f3988..59635d476de 100644 --- a/source/core/data-modeling.txt +++ b/source/core/data-modeling.txt @@ -10,23 +10,26 @@ Overview Collections in MongoDB have flexible schema; they do not define nor enforce the fields of its documents. Each document can have only the fields that are relevant to that entity, although in practice, you -would generally choose to store similar documents in each collection. -With this flexible schema, you can model your data to reflect more -closely the actual entity rather than enforce a rigid data structure. - -In MongoDB, data modeling takes into consideration not only how data -relates to each other, but also how the data is used, how the data will -grow and be maintained. These considerations involve decisions about -whether to embed data within a single document or reference data among -different documents, which fields to index, and whether to use special -features. - -Choosing the correct data model can provide both performance and -maintenance gains for your applications. - -This document provide some general guidelines for data modeling and -possible options. These guidelines and options may not be appropriate -for your situation. +would generally choose to maintain a consistent structure across +documents in each collection. With this flexible schema, you can model +your data to reflect more closely the actual application-level entity +rather than enforce a rigid data structure. + +In MongoDB, data modeling takes into consideration not only the +inherent properties of the data entities themselves and how they relate +to each other, but also how the data is used, how the data will grow +and possibly change over time, and how the data will be maintained. +These considerations involve decisions about whether to embed data +within a single document or to reference data in different documents, +which fields to index, and whether to take advantage of rich document +features, such as arrays. + +Choosing the best data model for your application can have both huge +performance and maintenance advantages for your applications. + +This document provide some general guidelines and principles for schema +design and highlight possible data modeling options. Not all guidelines +and options may be appropriate for your specific situation. .. _data-modeling-decisions: @@ -46,7 +49,8 @@ Embedding De-normalization of data involves embedding documents within other documents. -Operations within a document are easy for the server to handle. +Operations within a document are less expensive for the server than +operations that involve multiple documents. In general, choose the embedded data model when: @@ -54,8 +58,8 @@ In general, choose the embedded data model when: :ref:`data-modeling-example-one-to-one`. - you have one-to-many relationships where the "many" objects always - appear with or are viewed in the context of their parents. See - :ref:`data-modeling-example-one-to-many`. + appear with or are viewed in the context of their parent documents. + See :ref:`data-modeling-example-one-to-many`. Embedding provides the following benefits: @@ -63,11 +67,11 @@ Embedding provides the following benefits: - Single roundtrip to database to retrieve the complete object -However, with embedding, write operations can be slow if you are adding -objects frequently. Additionally, you cannot embed documents that will -cause the containing document to exceed the :limit:`maximum BSON -document size `. For documents that exceed the -maximum BSON document size, see :doc:`/applications/gridfs`. +Keep in mind that embedding documents that have unbound growth over +time may slow write operations. Additionally, such documents may cause +their containing documents to exceed the :limit:`maximum BSON document +size `. For documents that exceed the maximum BSON +document size, see :doc:`/applications/gridfs`. For examples in accessing embedded documents, see :ref:`read-operations-subdocuments`. @@ -92,8 +96,10 @@ Normalization of data requires storing :doc:`references In general, choose the referenced data model when: -- embedding would result in duplication of data. - +- when embedding would result in duplication of data but would not + provide sufficient read performance advantages to outweigh the + implications of the duplication + - you have many-to-many relationships. - you are modeling large hierarchical data. See @@ -101,8 +107,8 @@ In general, choose the referenced data model when: Referencing provides more flexibility than embedding; however, to resolve the references, client-side applications must issue follow-up -queries. Additionally, the referencing data model involves performing -many seeks and random reads. +queries. In other words, using references requires more roundtrips to +the server. See :ref:`data-modeling-publisher-and-books` for an example of referencing. @@ -131,8 +137,8 @@ maintenance efforts. Data Lifecycle Management ~~~~~~~~~~~~~~~~~~~~~~~~~ -Data lifecycle management concerns contribute to the decision making -process around data modeling. +Data modeling decisions should also take data lifecycle management into +consideration. The :doc:`Time to Live or TTL feature ` of collections expires documents after a period of time. Consider using @@ -148,7 +154,7 @@ documents based on insertion order. Large Number of Collections ~~~~~~~~~~~~~~~~~~~~~~~~~~~ -In certain situation, you might choose to store information in several +In certain situations, you might choose to store information in several collections instead of a single collection. Consider a sample collection ``logs`` that stores log documents for @@ -208,7 +214,7 @@ you want an index in MongoDB. Indexes in MongoDB are needed for efficient query processing, and as such, you may want to think about the queries first and then build indexes based upon them. Generally, you would index the fields that you query by and the fields that you -sort by. The ``_id`` field is automatically indexed. +sort by. A unique index is automatically created on the ``_id`` field. As you create indexes, consider the following behaviors of indexes: @@ -217,11 +223,11 @@ As you create indexes, consider the following behaviors of indexes: - Adding an index has some negative performance impact for write operations. For collections with high write-to-read ratio, indexes are expensive as each insert must add keys to each index. - -- Read operations supported by the index perform better, and read - operations not supported by the index have no performance impact from - the index. This allows for for collections with high read-to-write - ratio to have many indexes. + +- Collections with high read-to-write ratio benefit from having many + indexes. Read operations supported by the index have high + performance, and read operations not supported by the index are + unaffected by it. See :doc:`/applications/indexes` for more information on determining indexes. Additionally, MongoDB :wiki:`Database Profiler` provides @@ -337,7 +343,7 @@ the ``parent``. } { - patron_id = "joe", + patron_id: "joe", street: "123 Fake Street", city: "Faketon", state: "MA", @@ -354,7 +360,7 @@ the ``parent``. If your application frequently retrieves the ``address`` data with the ``name`` information, then your application needs to issue multiple -queries to resolve the references. The better data model would be to +queries to resolve the references. A more optimal schema would be to embed the ``address`` data entities in the ``patron`` data, as in the following document: @@ -389,7 +395,7 @@ One-to-Many: Referencing Consider the following example that maps publisher and book relationships. The example illustrates the advantage of referencing -over embedding to prevent the repetition of the publisher information. +over embedding to avoid repetition of the publisher information. Embedding the publisher document inside the book document would lead to **repetition** of the publisher data, as the following documents show: diff --git a/source/faq/developers.txt b/source/faq/developers.txt index 6eca53435b2..24585fc82b3 100644 --- a/source/faq/developers.txt +++ b/source/faq/developers.txt @@ -630,29 +630,29 @@ is frequently the choice for: with or are viewed in the context of their parents. You should also consider embedding for performance reasons if you have -a collection with a large amount of small documents. If small, separate -documents represent the natural model for the data, then you should -maintain that model. +a collection with a large number of small documents. Nevertheless, if +small, separate documents represent the natural model for the data, +then you should maintain that model. If, however, you can group these small documents by some logical relationship *and* you frequently retrieve the documents by this grouping, you might consider "rolling-up" the small documents into -larger documents that contain an array of subdocuments. But if you -often only need to retrieve a subset of the documents within the group, -then "rolling-up" the documents may not provide better performance. +larger documents that contain an array of subdocuments. Keep in mind +that if you often only need to retrieve a subset of the documents +within the group, then "rolling-up" the documents may not provide +better performance. -By "rolling up" these small documents into logical groupings, queries -to retrieve the group of the documents involve sequential reads and -less random disk accesses. +"Rolling up" these small documents into logical groupings means that queries to +retrieve a group of documents involve sequential reads and fewer random disk +accesses. .. Will probably need to break up the following sentence: -Additionally, if the individual documents were indexed on common -fields, then by "rolling up" the documents and moving the common fields -to the larger document, there would be fewer copies of the common -fields *and* there would be fewer associated key entries in the -corresponding index. See :doc:`/core/indexes` for more information on -indexes. +Additionally, "rolling up" documents and moving common fields to the +larger document benefit the index on these fields. There would be fewer +copies of the common fields *and* there would be fewer associated key +entries in the corresponding index. See :doc:`/core/indexes` for more +information on indexes. .. Commenting out.. If the data is too large to fit entirely in RAM, embedding provides better RAM cache utilization.