Skip to content

DOCS-134 review #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 27, 2012
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 42 additions & 39 deletions source/applications/aggregation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,11 @@ Overview
The MongoDB aggregation framework provides a means to calculate
aggregate values without having to use :doc:`map/reduce
</core/map-reduce>`. While map/reduce is powerful, using map/reduce is
more difficult than necessary for simple aggregation tasks, such as
more difficult than necessary for many simple aggregation tasks, such as
totaling or averaging field values.

If you're familiar with :term:`SQL`, the aggregation framework
provides similar functionality as "``GROUPBY``" and related SQL
provides similar functionality to "``GROUP BY``" and related SQL
operators as well as simple forms of "self joins." Additionally, the
aggregation framework provides projection capabilities to reshape the
returned data. Using projections and aggregation, you can add computed
Expand All @@ -38,23 +38,22 @@ underpin the aggregation framework: :term:`pipelines <pipeline>` and
Pipelines
~~~~~~~~~

A pipeline is process that applies a sequence of documents when using
the aggregation framework. For those familiar with UNIX-like shells
(e.g. bash,) the concept is analogous to the pipe (i.e. "``|``") used
to string operations together.
Conceptually, documents from a collection are passed through an
aggregation pipeline, and are transformed as they pass through it.
For those familiar with UNIX-like shells (e.g. bash,) the concept is
analogous to the pipe (i.e. "``|``") used to string text filters together.

In a shell environment the pipe redirects a stream of characters from
the output of one process to the input of the next. The MongoDB
aggregation pipeline streams MongoDB documents from one :doc:`pipeline
operator </reference/aggregation>` to the next to process the
documents.

All pipeline operators processes a stream of documents, and the
All pipeline operators process a stream of documents, and the
pipeline behaves as if the operation scans a :term:`collection` and
passes all matching documents into the "top" of the pipeline. Then,
each operator in the pipleine transforms each document as it passes
through the pipeline. At the end of the pipeline, the aggregation
framework returns documents in the same manner as all other queries.
passes all matching documents into the "top" of the pipeline.
Each operator in the pipleine transforms each document as it passes
through the pipeline.

.. note::

Expand All @@ -72,24 +71,26 @@ framework returns documents in the same manner as all other queries.
- :agg:pipeline:`$unwind`
- :agg:pipeline:`$group`
- :agg:pipeline:`$sort`
TODO I'd remove references to $out, since we don't have it yet
- :agg:pipeline:`$out`

.. _aggregation-expressions:

Expressions
~~~~~~~~~~~

Expressions calculate values based on inputs from the pipeline, and
return their results to the pipeline. The aggregation framework
defines expressions in :term:`JSON` using a prefix format.
Expressions calculate values based on documents passing through the pipeline,
and contribute their results to documents flowing through the pipeline.
The aggregation framework defines expressions in :term:`JSON` using a prefix
format.

Often, expressions are stateless and are only evaluated when seen by
the aggregation process. Stateless expressions perform operations such
as: adding the values of two fields together, or extracting the year
as adding the values of two fields together or extracting the year
from a date.

The :term:`accumulator` expressions *do* retain state, and the
:agg:pipeline:`$group` operator uses maintains state (e.g. counts,
:agg:pipeline:`$group` operator maintains that state (e.g.
totals, maximums, minimums, and related data.) as documents progress
through the :term:`pipeline`.

Expand All @@ -104,17 +105,17 @@ Invocation
~~~~~~~~~~

Invoke an :term:`aggregation` operation with the :func:`aggregate`
wrapper in the :program:`mongo` shell for the :dbcommand:`aggregate`
wrapper in the :program:`mongo` shell or the :dbcommand:`aggregate`
:term:`database command`. Always call :func:`aggregate` on a
collection object, which will determine the documents that contribute
to the beginning of the aggregation :term:`pipeline`. The arguments to
the :func:`aggregate` function specify a sequence :ref:`pipeline
the :func:`aggregate` function specify a sequence of :ref:`pipeline
operators <aggregation-pipeline-operator-reference>`, where each
:ref:`pipeline operator <aggregation-pipeline-operator-reference>` may
have a number of operands.

First, consider a :term:`collection` of documents named "``article``"
using the following schema or and format:
using the following format:

.. code-block:: javascript

Expand Down Expand Up @@ -169,7 +170,10 @@ The aggregation operation in the previous section returns a
if there was an error

As a document, the result is subject to the current :ref:`BSON
Document size <limit-maximum-bson-document-size>`. If you expect the
Document size <limit-maximum-bson-document-size>`.

TODO $out is not going to be available in 2.2, so I'd eliminate this reference
If you expect the
aggregation framework to return a larger result, consider using the
use the :agg:pipeline:`$out` pipeline operator to write the output to a
collection.
Expand All @@ -181,22 +185,21 @@ Early Filtering
~~~~~~~~~~~~~~~

Because you will always call :func:`aggregate` on a
:term:`collection` object, which inserts the *entire* collection into
the aggregation pipeline, you may want to increase efficiency in some
situations by avoiding scanning an entire collection.
:term:`collection` object, which logically inserts the *entire* collection into
the aggregation pipeline, you may want to optimize the operation
by avoiding scanning the entire collection whenever possible.

If your aggregation operation requires only a subset of the data in a
collection, use the :agg:pipeline:`$match` to limit the items in the
pipeline, as in a query. These :agg:pipeline:`$match` operations will use
suitable indexes to access the matching element or elements in a
collection.

When :agg:pipeline:`$match` appears first in the :term:`pipeline`, the
:dbcommand:`pipeline` begins with results of a :term:`query` rather than
the entire contents of a collection.

collection, use the :agg:pipeline:`$match` to restrict which items go in
to the top of the
pipeline, as in a query. When placed early in a pipeline, these
:agg:pipeline:`$match` operations will use
suitable indexes to scan only the matching documents in a collection.

TODO we don't do the following yet, but there's a ticket for it. Should we
leave it out for now?
:term:`Aggregation` operations have an optimization phase, before
execution, attempts to re-arrange the pipeline by moving
execution, which attempts to re-arrange the pipeline by moving
:agg:pipeline:`$match` operators towards the beginning to the greatest
extent possible. For example, if a :term:`pipeline` begins with a
:agg:pipeline:`$project` that renames fields, followed by a
Expand All @@ -221,7 +224,7 @@ must fit in memory.

:agg:pipeline:`$group` has similar characteristics: Before any
:agg:pipeline:`$group` passes its output along the pipeline, it must
receive the entity of its input. For the case of :agg:pipeline:`$group`
receive the entirety of its input. For the case of :agg:pipeline:`$group`
this frequently does not require as much memory as
:agg:pipeline:`$sort`, because it only needs to retain one record for
each unique key in the grouping specification.
Expand All @@ -236,14 +239,14 @@ Sharded Operation

The aggregation framework is compatible with sharded collections.

When the operating on a sharded collection, the aggregation pipeline
splits into two parts. The aggregation framework pushes all of the
When operating on a sharded collection, the aggregation pipeline
splits the pipeline into two parts. The aggregation framework pushes all of the
operators up to and including the first :agg:pipeline:`$group` or
:agg:pipeline:`$sort` to each shard using the results received from the
shards. [#match-sharding]_ Then, a second pipeline on the
:agg:pipeline:`$sort` to each shard.
[#match-sharding]_ Then, a second pipeline on the
:program:`mongos` runs. This pipeline consists of the first
:agg:pipeline:`$group` or :agg:pipeline:`$sort` and any remaining pipeline
operators
operators; this is run on the results received from the shards.

The :program:`mongos` pipeline merges :agg:pipeline:`$sort` operations
from the shards. The :agg:pipeline:`$group`, brings any “sub-totals”
Expand Down
Loading