From 25644a0237186141bfdcae8c515279b08a459a6f Mon Sep 17 00:00:00 2001 From: Rick Copeland Date: Sat, 17 Mar 2012 13:48:49 -0700 Subject: [PATCH 01/20] Add usecases (needs formatting cleanup) Signed-off-by: Rick Copeland --- .../cms-_metadata_and_asset_management.txt | 584 +++++++++++++ .../usecase/cms-_storing_comments.txt | 746 +++++++++++++++++ .../usecase/ecommerce-_category_hierarchy.txt | 272 +++++++ .../ecommerce-_inventory_management.txt | 537 ++++++++++++ .../usecase/ecommerce-_product_catalog.txt | 761 +++++++++++++++++ source/tutorial/usecase/index.txt | 14 + ...me_analytics-_hierarchical_aggregation.txt | 717 ++++++++++++++++ ..._time_analytics-_preaggregated_reports.txt | 768 ++++++++++++++++++ .../real_time_analytics-_storing_log_data.txt | 694 ++++++++++++++++ 9 files changed, 5093 insertions(+) create mode 100644 source/tutorial/usecase/cms-_metadata_and_asset_management.txt create mode 100644 source/tutorial/usecase/cms-_storing_comments.txt create mode 100644 source/tutorial/usecase/ecommerce-_category_hierarchy.txt create mode 100644 source/tutorial/usecase/ecommerce-_inventory_management.txt create mode 100644 source/tutorial/usecase/ecommerce-_product_catalog.txt create mode 100644 source/tutorial/usecase/index.txt create mode 100644 source/tutorial/usecase/real_time_analytics-_hierarchical_aggregation.txt create mode 100644 source/tutorial/usecase/real_time_analytics-_preaggregated_reports.txt create mode 100644 source/tutorial/usecase/real_time_analytics-_storing_log_data.txt diff --git a/source/tutorial/usecase/cms-_metadata_and_asset_management.txt b/source/tutorial/usecase/cms-_metadata_and_asset_management.txt new file mode 100644 index 00000000000..5e712c5b3fe --- /dev/null +++ b/source/tutorial/usecase/cms-_metadata_and_asset_management.txt @@ -0,0 +1,584 @@ +CMS: Metadata and Asset Management +================================== + +Problem[a] +---------- + +You are designing a content management system (CMS) and you want to use +MongoDB to store the content of your sites. + +Solution overview +----------------- + +Our approach in this solution is inspired by the design of Drupal, an +open source CMS written in PHP on relational databases that is available +at `http://www.drupal.org `_. In this case, we +will take advantage of MongoDB's dynamically typed collections to +*polymorphically* store all our content nodes in the same collection. +Our navigational information will be stored in its own collection since +it has relatively little in common with our content nodes. + +The main node types with which we are concerned here are: + +- **Basic page** : Basic pages are useful for displaying + infrequently-changing text such as an 'about' page. With a basic + page, the main information we are concerned with is the title and the + content. +- **Blog entry** : Blog entries record a "stream" of posts from users + on the CMS and store title, author, content, and date as relevant + information. +- **Photo** : Photos participate in photo galleries, and store title, + description, author, and date along with the actual photo binary + data. + +Schema design +------------- + +Our node collection will contain documents of various formats, but they +will all share a similar structure, with each document including an +\_id, type, section, slug, title, creation date, author, and tags. The +'section' property is used to identify groupings of items (grouped to a +particular blog or photo gallery, for instance). The 'slug' property is +a url-friendly representation of the node that is unique within its +section, and is used for mapping URLs to nodes. Each document also +contains a 'detail' field which will vary per document type: + +``{`` + +``_id: ObjectId(…),`` + +``nonce: ObjectId(…),`` + +``metadata: {`` + +``type: 'basic-page'`` + +``section: 'my-photos',`` + +``slug: 'about',`` + +``title: 'About Us',`` + +``created: ISODate(…),`` + +``author: { _id: ObjectId(…), name: 'Rick' },`` + +``tags: [ … ],`` + +``detail: { text: '# About Us\n…' }`` + +``}`` + +``}`` + +\` + +For the basic page above, the detail field might simply contain the text +of the page. In the case of a blog entry, the document might resemble +the following instead: + +``{`` + +``…`` + +``metadata: {`` + +``…`` + +``type: 'blog-entry',`` + +``section: 'my-blog',`` + +``slug: '2012-03-noticed-the-news',`` + +``…`` + +``detail: {`` + +``publish_on: ISODate(…),`` + +``text: 'I noticed the news from Washington today…'`` + +``}`` + +``}`` + +``}`` + +Photos present something of a special case. Since we will need to store +potentially very large photos, we would like separate our binary storage +of photo data from the metadata storage. GridFS provides just such a +mechanism, splitting a 'filesystem' of potentially very large files into +two collections, the 'files' collection and the 'chunks' collection. In +our case, we will call the two collections 'cms.assets.files' and +'cms.assets.chunks'. We will use documents in the 'assets.files' +collection to store the normal GridFS metadata as well as our node +metadata: + +``{`` + +``_id: ObjectId(…),`` + +``length: 123...,`` + +``chunkSize: 262144,`` + +``uploadDate: ISODate(…),`` + +``contentType: 'image/jpeg',`` + +``md5: 'ba49a...',`` + +``metadata: {`` + +``nonce: ObjectId(…),`` + +``slug: '2012-03-invisible-bicycle',`` + +``type: 'photo',`` + +``section: 'my-album',`` + +``title: 'Kitteh',`` + +``created: ISODate(…),`` + +``author: { _id: ObjectId(…), name: 'Jared' },`` + +``tags: [ … ],`` + +``detail: {`` + +``filename: 'kitteh_invisible_bike.jpg',`` + +``resolution: [ 1600, 1600 ], … }`` + +``}`` + +``}`` + +Here, we have embedded the schema for our 'normal' nodes so we can share +node- manipulation code among all types of nodes. + +Operations +---------- + +Here, we will describe common queries and updates used in our CMS, +paying particular attention to 'tweaks' we need to make for our various +node types. + +Create and edit content nodes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The content producers using our CMS will be creating and editing content +most of the time. Most content-creation activities are relatively +straightforward: + +``db.cms.nodes.insert({`` + +``'nonce': ObjectId(),`` + +``'metadata': {`` + +``'section': 'myblog',`` + +``'slug': '2012-03-noticed-the-news',`` + +``'type': 'blog-entry',`` + +``'title': 'Noticed in the News',`` + +``'created': datetime.utcnow(),`` + +``'author': { 'id': user_id, 'name': 'Rick' },`` + +``'tags': [ 'news', 'musings' ],`` + +``'detail': {`` + +``'publish_on': datetime.utcnow(),`` + +``'text': 'I noticed the news from Washington today…' }`` + +``}`` + +``})`` + +Once the node is in the database, we have a potential problem with +multiple editors. In order to support this, we use the special 'nonce' +value to detect when another editor may have modified the document and +allow the application to resolve any conflicts: + +``def update_text(section, slug, nonce, text):`` + +``result = db.cms.nodes.update(`` + +``{ 'metadata.section': section,`` + +``'metadata.slug': slug,`` + +``'nonce': nonce },`` + +``{ '$set':{'metadata.detail.text': text, 'nonce': ObjectId() } },`` + +``safe=True)`` + +``if not result['updatedExisting']:`` + +``raise ConflictError()`` + +\` + +We might also want to perform metadata edits to the item such as adding +tags: + +``db.cms.nodes.update(`` + +``{ 'metadata.section': section, 'metadata.slug': slug },`` + +``{ '$addToSet': { 'tags': { '$each': [ 'interesting', 'funny' ] } } })`` + +\` + +In this case, we don't actually need to supply the nonce (nor update it) +since we are using the atomic $addToSet modifier in MongoDB. + +Index support +^^^^^^^^^^^^^ + +Our updates in this case are based on equality queries containing the +(section, slug, and nonce) values. To support these queries, we might +use the following index: + +``>>> db.cms.nodes.ensure_index([`` + +``... ('metadata.section', 1), ('metadata.slug', 1), ('nonce', 1) ])`` + +\` + +Also note, however, that we would like to ensure that two editors don't +create two documents with the same section and slug. To support this, we +will use a second index with a unique constraint: + +\` + +``>>> db.cms.nodes.ensure_index([`` + +``... ('metadata.section', 1), ('metadata.slug', 1)], unique=True)`` + +In fact, since we expect that most of the time (section, slug, nonce) is +going to be unique, we don't actually get much benefit from the first +index and can use only the second one to satisfy our update queries as +well. + +Upload a photo +~~~~~~~~~~~~~~ + +Uploading photos to our site shares some things in common with node +update, but it also has some extra nuances: + +\` + +``def upload_new_photo(`` + +``input_file, section, slug, title, author, tags, details):`` + +``fs = GridFS(db, 'cms.assets')`` + +``with fs.new_file(`` + +``content_type='image/jpeg',`` + +``metadata=dict(`` + +``type='photo',`` + +``locked=datetime.utcnow(),`` + +``section=section,`` + +``slug=slug,`` + +``title=title,`` + +``created=datetime.utcnow(),`` + +``author=author,`` + +``tags=tags,`` + +``detail=detail)) as upload_file:`` + +``while True:`` + +``chunk = input_file.read(upload_file.chunk_size)`` + +``if not chunk: break`` + +``upload_file.write(chunk)`` + +``# unlock the file`` + +``db.assets.files.update(`` + +``{'_id': upload_file._id},`` + +``{'$set': { 'locked': None } } )`` + +\` + +Here, since uploading the photo is a non-atomic operation, we have +locked the file during upload by writing the current datetime into the +record. This lets us detect when a file upload may be stalled, which is +helpful when working with multiple editors. In this case, we will assume +that the last update wins: + +\` + +``def update_photo_content(input_file, section, slug):`` + +``fs = GridFS(db, 'cms.assets')`` + +\` + +``# Delete the old version if it's unlocked or was locked more than 5`` + +``# minutes ago`` + +``file_obj = db.cms.assets.find_one(`` + +``{ 'metadata.section': section,`` + +``'metadata.slug': slug,`` + +``'metadata.locked': None })`` + +``if file_obj is None:`` + +``threshold = datetime.utcnow() - timedelta(seconds=300)`` + +``file_obj = db.cms.assets.find_one(`` + +``{ 'metadata.section': section,`` + +``'metadata.slug': slug,`` + +``'metadata.locked': { '$lt': threshold } })`` + +``if file_obj is None: raise FileDoesNotExist()`` + +``fs.delete(file_obj['_id'])`` + +\` + +``# update content, keep metadata unchanged`` + +``file_obj['locked'] = datetime.utcnow()`` + +``with fs.new_file(**file_obj):`` + +``while True:`` + +``chunk = input_file.read(upload_file.chunk_size)`` + +``if not chunk: break`` + +``upload_file.write(chunk)`` + +``# unlock the file`` + +``db.assets.files.update(`` + +``{'_id': upload_file._id},`` + +``{'$set': { 'locked': None } } )`` + +We can, of course, perform metadata edits to the item such as adding +tags without the extra complexity: + +``db.cms.assets.files.update(`` + +``{ 'metadata.section': section, 'metadata.slug': slug },`` + +``{ '$addToSet': {`` + +``'metadata.tags': { '$each': [ 'interesting', 'funny' ] } } })`` + +Index support +^^^^^^^^^^^^^ + +Our updates here are also based on equality queries containing the +(section, slug) values, so we can use the same types of indexes as we +used in the 'regular' node case. Note in particular that we need a +unique constraint on (section, slug) to ensure that one of the calls to +GridFS.new\_file() will fail multiple editors try to create or update +the file simultaneously. + +\` + +``>>> db.cms.assets.files.ensure_index([`` + +``... ('metadata.section', 1), ('metadata.slug', 1)], unique=True)`` + +Locate and render a node +~~~~~~~~~~~~~~~~~~~~~~~~ + +We want to be able to locate a node based on its section and slug, which +we assume have been extracted from the page definition and URL by some +other technology. + +``node = db.nodes.find_one(`` + +``{'metadata.section': section, 'metadata.slug': slug })`` + +Index support +^^^^^^^^^^^^^ + +The same indexes we have defined above on (section, slug) would +efficiently render this node. + +Locate and render a file +~~~~~~~~~~~~~~~~~~~~~~~~ + +We want to be able to locate an image based on its section and slug, +which we assume have been extracted from the page definition and URL +just as with other nodes. + +``fs = GridFS(db, 'cms.assets')`` + +``with fs.get_version(`` + +``**{'metadata.section': section, 'metadata.slug': slug }) as img_fp:`` + +``# do something with our image file`` + +Index support +^^^^^^^^^^^^^ + +The same indexes we have defined above on (section, slug) would also +efficiently render this image. + +Search for nodes by tag +~~~~~~~~~~~~~~~~~~~~~~~ + +Here we would like to retrieve a list of nodes based on their tag: + +\` + +``nodes = db.nodes.find({'metadata.tags': tag })`` + +Index support +^^^^^^^^^^^^^ + +To support searching efficiently, we should define indexes on any fields +we intend on using in our query: + +\` + +``>>> db.cms.nodes.ensure_index('tags')`` + +\` + +Search for images by tag +~~~~~~~~~~~~~~~~~~~~~~~~ + +Here we would like to retrieve a list of images based on their tag: + +\` + +``image_file_objects = db.cms.assets.files.find({'metadata.tags': tag })`` + +``fs = GridFS(db, 'cms.assets')`` + +``for image_file_object in db.cms.assets.files.find(`` + +``{'metadata.tags': tag }):`` + +``image_file = fs.get(image_file_object['_id'])`` + +``# do something with the image file`` + +Index support +^^^^^^^^^^^^^ + +As above, in order to support searching efficiently, we should define +indexes on any fields we intend on using in our query: + +\` + +``>>> db.cms.assets.files.ensure_index('tags')`` + +Generate a feed of recently published blog articles +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Here, we wish to generate an .rss or .atom feed for our recently +published blog articles, sorted by date descending: + +\` + +``articles = db.nodes.find({`` + +``'metadata.section': 'my-blog'`` + +``'metadata.published': { '$lt': datetime.utcnow() } })`` + +``articles = articles.sort({'metadata.published': -1})`` + +In order to support this operation, we will create an index on (section, +published) so the items are 'in order' for our query. Note that in cases +where we are sorting or using range queries, as here, the field on which +we're sorting or performing a range query must be the final field in our +index: + +\` + +``>>> db.cms.nodes.ensure_index(`` + +``... [ ('metadata.section', 1), ('metadata.published', -1) ])`` + +Sharding +-------- + +In a CMS system, our read performance is generally much more important +than our write performance. As such, we will optimize the sharding setup +for read performance. In order to achieve the best read performance, we +need to ensure that queries are *routeable* by the mongos process. + +A second consideration when sharding is that unique indexes do not span +shards. As such, our shard key must include the unique indexes we have +defined in order to get the same semantics as we have described. Given +these constraints, sharding the nodes and assets on (section, slug) +seems to be a reasonable approach: + +\` + +``>>> db.command('shardcollection', 'cms.nodes', {`` + +``... key : { 'metadata.section': 1, 'metadata.slug' : 1 } })`` + +``{ "collectionsharded" : "cms.nodes", "ok" : 1 }`` + +``>>> db.command('shardcollection', 'cms.assets.files', {`` + +``... key : { 'metadata.section': 1, 'metadata.slug' : 1 } })`` + +``{ "collectionsharded" : "cms.assets.files", "ok" : 1 }`` + +\` + +If we wish to shard our 'cms.assets.chunks' collection, we need to shard +on the \_id field (none of our metadata is available on the chunks +collection in gridfs): + +\` + +``>>> db.command('shardcollection', 'cms.assets.chunks'`` + +``{ "collectionsharded" : "cms.assets.chunks", "ok" : 1 }`` + +This actually still maintains our query-routability constraint, since +all reads from gridfs must first look up the document in 'files' and +then look up the chunks separately (though the GridFS API sometimes +hides this detail from us.) + +Page of diff --git a/source/tutorial/usecase/cms-_storing_comments.txt b/source/tutorial/usecase/cms-_storing_comments.txt new file mode 100644 index 00000000000..9b685cd18c8 --- /dev/null +++ b/source/tutorial/usecase/cms-_storing_comments.txt @@ -0,0 +1,746 @@ +CMS: Storing Comments +===================== + +Problem[a] +---------- + +In your content management system (CMS) you would like to store +user-generated comments on the various types of content you generate. + +Solution overview +----------------- + +Rather than describing the One True Way to implement comments in this +solution, we will explore different options and the trade-offs with +each. The three major designs we will discuss here are: + +- **One document per comment** - This provides the greatest degree of + flexibility, as it is relatively straightforward to display the + comments as either threaded or chronological. There are also no + restrictions on the number of comments that can participate in a + discussion. +- **All comments embedded** - In this design, all the comments are + embedded in their parent document, whether that be a blog article, + news story, or forum topic. This can be the highest performance + design, but is also the most restrictive, as the display format of + the comments is tied to the embedded structure. There are also + potential problems with extremely active discussions where the total + data (topic data + comments) exceeds the 16MB limit of MongoDB + documents. +- **Hybrid design** - Here, we store comments separately from their + parent topic, but we aggregate comments together into a few + documents, each containing many comments. + +Another decision that needs to be considered in desniging a commenting +system is whether to support threaded commenting (explicit replies to a +parent comment). We will explore how this threaded comment support +decision affects our schema design and operations as well. + +Schema design: One Document Per Comment +--------------------------------------- + +A comment in the one document per comment format might have a structure +similar to the following: + +\` + +``{`` + +``_id: ObjectId(…),`` + +``discussion_id: ObjectId(…),`` + +``slug: '34db',`` + +``posted: ISODateTime(…),`` + +``author: { id: ObjectId(…), name: 'Rick' },`` + +``text: 'This is so bogus … '`` + +``}`` + +\` + +The format above is really only suitable for chronological display of +commentary. We maintain a reference to the discussion in which this +comment participates, a url-friendly 'slug' to identify it, posting time +and author, and the comment text. If we want to support threading in +this format, we need to maintain some notion of hierarchy in the comment +model as well: + +\` + +``{`` + +``_id: ObjectId(…),`` + +``discussion_id: ObjectId(…),`` + +``parent_id: ObjectId(…),`` + +``slug: '34db/8bda',`` + +``full_slug: '34db:2012.02.08.12.21.08/8bda:2012.02.09.22.19.16',`` + +``posted: ISODateTime(…),`` + +``author: { id: ObjectId(…), name: 'Rick' },`` + +``text: 'This is so bogus … '`` + +``}`` + +\` + +Here, we have stored some extra information into the document that +represents this document's position in the hierarchy. In addition to +maintaining the parent\_id for the comment, we have modified the slug +format and added a new field, full\_slug. The slug is now a path +consisting of the parent's slug plus the comment's unique slug portion. +The full\_slug is also included to facilitate sorting documents in a +threaded discussion by posting date. + +Operations: One comment per document +------------------------------------ + +Here, we describe the various operations we might perform with the above +single comment per document schema. + +Post a new comment +~~~~~~~~~~~~~~~~~~ + +In order to post a new comment in a chronologically ordered (unthreaded) +system, all we need to do is the following: + +``slug = generate_psuedorandom_slug()`` + +``db.comments.insert({`` + +``'discussion_id': discussion_id,`` + +``'slug': slug,`` + +``'posted': datetime.utcnow(),`` + +``'author': author_info,`` + +``'text': comment_text })`` + +In the case of a threaded discussion, we have a bit more work to do in +order to generate a 'pathed' slug and full\_slug: + +``posted = datetime.utcnow()`` + +\` + +``# generate the unique portions of the slug and full_slug`` + +``slug_part = generate_psuedorandom_slug()`` + +``full_slug_part = slug_part + ':' + posted.strftime(`` + +``'%Y.%m.%d.%H.%M.%S')`` + +\` + +``# load the parent comment (if any)`` + +``if parent_slug:`` + +``parent = db.comments.find_one(`` + +``{'discussion_id': discussion_id, 'slug': parent_slug })`` + +``slug = parent['slug'] + '/' + slug_part`` + +``full_slug = parent['full_slug'] + '/' + full_slug_part`` + +``else:`` + +``slug = slug_part`` + +``full_slug = full_slug_part`` + +\` + +``# actually insert the comment`` + +``db.comments.insert({`` + +``'discussion_id': discussion_id,`` + +``'slug': slug, 'full_slug': full_slug,`` + +``'posted': posted,`` + +``'author': author_info,`` + +``'text': comment_text })`` + +View the (paginated) comments for a discussion +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To actually view the comments in the non-threaded design, we need merely +to select all comments participating in a discussion, sorted by date: + +``cursor = db.comments.find({'discussion_id': discussion_id})`` + +``cursor = cursor.sort('posted')`` + +``cursor = cursor.skip(page_num * page_size)`` + +``cursor = cursor.limit(page_size)`` + +\` + +Since the full\_slug embeds both hierarchical information via the path +and chronological information, we can use a simple sort on the +full\_slug property to retrieve a threaded view: + +``cursor = db.comments.find({'discussion_id': discussion_id})`` + +``cursor = cursor.sort('full_slug')`` + +``cursor = cursor.skip(page_num * page_size)`` + +``cursor = cursor.limit(page_size)`` + +Index support +^^^^^^^^^^^^^ + +In order to efficiently support the queries above, we should maintain +two compound indexes, one on (discussion\_id, posted), and the other on +(discussion\_id, full\_slug): + +``>>> db.comments.ensure_index([`` + +``... ('discussion_id', 1), ('posted', 1)])`` + +``>>> db.comments.ensure_index([`` + +``... ('discussion_id', 1), ('full_slug', 1)])`` + +Note that we must ensure that the final element in a compound index is +the field by which we are sorting to ensure efficient performance of +these queries. + +Retrieve a comment via slug ("permalink") +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Here, we wish to directly retrieve a comment (e.g. *not* requiring +paging through all preceeding pages of commentary). In this case, we +simply use the slug: + +``comment = db.comments.find_one({`` + +``'discussion_id': discussion_id,`` + +``'slug': comment_slug})`` + +We can also retrieve a sub-discussion (a comment and all of its +descendants recursively) by performing a prefix query on the full\_slug +field: + +``subdiscussion = db.comments.find_one({`` + +``'discussion_id': discussion_id,`` + +``'full_slug': re.compile('^' + re.escape(parent_slug)) })`` + +``subdiscussion = subdiscussion.sort('full_slug')`` + +Index support +^^^^^^^^^^^^^ + +Since we already have indexes on (discussion\_id, full\_slug) to support +retrieval of subdiscussion, all we need is an index on (discussion\_id, +slug) to efficiently support retrieval of a comment by 'permalink': + +``>>> db.comments.ensure_index([`` + +``... ('discussion_id', 1), ('slug', 1)])`` + +Schema design: All comments embedded +------------------------------------ + +In this design, we wish to embed an entire discussion within its topic +document, be it a blog article, news story, or discussion thread. A +topic document, then, might look something like the following: + +\` + +``{`` + +``_id: ObjectId(…),`` + +``… lots of topic data …`` + +``comments: [`` + +``{ posted: ISODateTime(…),`` + +``author: { id: ObjectId(…), name: 'Rick' },`` + +``text: 'This is so bogus … ' },`` + +``… ]`` + +``}`` + +\` + +The format above is really only suitable for chronological display of +commentary. The comments are embedded in chronological order, with their +posting date, author, and text. Note that, since we are storing the +comments in sorted order, there is no need to maintain a slug per +comment. If we want to support threading in the embedded format, we need +to embed comments within comments: + +\` + +``{`` + +``_id: ObjectId(…),`` + +``… lots of topic data …`` + +``replies: [`` + +``{ posted: ISODateTime(…),`` + +``author: { id: ObjectId(…), name: 'Rick' },`` + +\` + +``text: 'This is so bogus … ',`` + +``replies: [`` + +``{ author: { … }, … },`` + +``… ]`` + +``}`` + +\` + +Here, we have added a 'replies' property to each comment which can hold +sub- comments and so on. One thing in particular to note about the +embedded document formats is we give up some flexibility when we embed +the documents, effectively 'baking in' the decisions we've made about +the proper display format. If we (or our users) someday wish to switch +from chronological or vice-versa, this schema makes such a migration +quite expensive. + +In popular discussions, we also have a potential issue with document +size. If we have a particularly avid discussion, for example, we may +outgrow the 16MB limit that MongoDB places on document size. We can also +run into scaling issues, particularly in the threaded design, as +documents need to be frequently moved on disk as they outgrow the space +allocated to them. + +Operations: All comments embedded +--------------------------------- + +Here, we describe the various operations we might perform with the above +single comment per document schema. Note that, in all the cases below, +we need no additional indexes since all our operations are +intra-document, and the document itself (the 'discussion') is retrieved +by its \_id field, which is automatically indexed by MongoDB. + +Post a new comment +~~~~~~~~~~~~~~~~~~ + +In order to post a new comment in a chronologically ordered (unthreaded) +system, all we need to do is the following: + +``db.discussion.update(`` + +``{ 'discussion_id': discussion_id },`` + +``{ '$push': { 'comments': {`` + +``'posted': datetime.utcnow(),`` + +``'author': author_info,`` + +``'text': comment_text } } } )`` + +Note that since we use the $push operator, all the comments will be +inserted in their correct chronological order. In the case of a threaded +discussion, we have a good bit more work to do. In order to reply to a +comment, we will assume that we have the 'path' to the comment we are +replying to as a list of positions: + +``if path != []:`` + +``str_path = '.'.join('replies.%d' % part for part in path)`` + +``str_path += '.replies'`` + +``else:`` + +``str_path = 'replies'`` + +``db.discussion.update(`` + +``{ 'discussion_id': discussion_id },`` + +``{ '$push': {`` + +``str_path: {`` + +``'posted': datetime.utcnow(),`` + +``'author': author_info,`` + +``'text': comment_text } } } )`` + +\` + +Here, we first construct a field name of the form +'replies.0.replies.2...' as str\_path and then use that to $push the new +comment into its parent comment's 'replies' property. + +View the (paginated) comments for a discussion +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To actually view the comments in the non-threaded design, we need to use +the $slice operator: + +``discussion = db.discussion.find_one(`` + +``{'discussion_id': discussion_id},`` + +``{ … some fields relevant to our page from the root discussion …,`` + +``'comments': { '$slice': [ page_num * page_size, page_size ] }`` + +``})`` + +\` + +If we wish to view paginated comments for the threaded design, we need +to do retrieve the whole document and paginate in our application: + +``discussion = db.discussion.find_one({'discussion_id': discussion_id})`` + +\` + +``def iter_comments(obj):`` + +``for reply in obj['replies']:`` + +``yield reply`` + +``for subreply in iter_comments(reply):`` + +``yield subreply`` + +\` + +``paginated_comments = itertools.slice(`` + +``iter_comments(discussion),`` + +``page_size * page_num,`` + +``page_size * (page_num + 1))`` + +Retrieve a comment via position or path ("permalink") +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Instead of using slugs as above, here we retrieve comments by their +position in the comment list or tree. In the case of the chronological +(non-threaded) design, we need simply to use the $slice operator to +extract the correct comment: + +``discussion = db.discussion.find_one(`` + +``{'discussion_id': discussion_id},`` + +``{'comments': { '$slice': [ position, position ] } })`` + +``comment = discussion['comments'][0]`` + +\` + +In the case of the threaded design, we are faced with the task of +finding the correct path through the tree in our application: + +``discussion = db.discussion.find_one({'discussion_id': discussion_id})`` + +``current = discussion`` + +``for part in path:`` + +``current = current.replies[part]`` + +``comment = current`` + +Note that, since the replies to comments are embedded in their parents, +we have actually retrieved the entire sub-discussion rooted in the +comment we were looking for as well. + +Schema design: Hybrid +--------------------- + +Comments in the hybrid format are stored in 'buckets' of about 100 +comments each: + +\` + +``{`` + +``_id: ObjectId(…),`` + +``discussion_id: ObjectId(…),`` + +``page: 1,`` + +``count: 42,`` + +``comments: [ {`` + +``slug: '34db',`` + +``posted: ISODateTime(…),`` + +``author: { id: ObjectId(…), name: 'Rick' },`` + +``text: 'This is so bogus … ' },`` + +``… ]`` + +``}`` + +\` + +Here, we have a 'page' of comment data, containing a bit of metadata +about the page (in particular, the page number and the comment count), +as well as the comment bodies themselves. Using a hybrid format actually +makes storing comments hierarchically quite complex, so we won't cover +it in this document. + +Note that in this design, 100 comments is a 'soft' limit to the number +of comments per page, chosen mainly for performance reasons and to +ensure that the comment page never grows beyond the 16MB limit MongoDB +imposes on document size. There may be occasions when the number of +comments is slightly larger than 100, but this does not affect the +correctness of the design. + +Operations: Hybrid +------------------ + +Here, we describe the various operations we might perform with the above +100-comment 'pages'. + +Post a new comment +~~~~~~~~~~~~~~~~~~ + +In order to post a new comment, we need to $push the comment onto the +last page and $inc its comment count. If the page has more than 100 +comments, we will insert a new page as well. For this operation, we +assume that we already have a reference to the discussion document, and +that the discussion document has a property that tracks the number of +pages: + +``page = db.comment_pages.find_and_modify(`` + +``{ 'discussion_id': discussion['_id'],`` + +``'page': discussion['num_pages'] },`` + +``{ '$inc': { 'count': 1 },`` + +``'$push': {`` + +``'comments': { 'slug': slug, … } } },`` + +``fields={'count':1},`` + +``upsert=True,`` + +``new=True )`` + +Note that we have written the find\_and\_modify above as an upsert +operation; if we don't find the page number, the find\_and\_modify will +create it for us, initialized with appropriate values for 'count' and +'comments'. Since we are limiting the number of comments per page, we +also need to create new pages as they become necessary: + +``if page['count'] > 100:`` + +``db.discussion.update(`` + +``{ 'discussion_id: discussion['_id'],`` + +``'num_pages': discussion['num_pages'] },`` + +``{ '$inc': { 'num_pages': 1 } } )`` + +Our update here includes the last know number of pages in the query to +ensure we don't have a race condition where the number of pages is +double- incremented, resulting in a nearly or totally empty page. If +some other process has incremented the number of pages in the +discussion, then update above simply does nothing. + +Index support +^^^^^^^^^^^^^ + +In order to efficiently support our find\_and\_modify and update +operations above, we need to maintain a compound index on +(discussion\_id, page) in the comment\_pages collection: + +``>>> db.comment_pages.ensure_index([`` + +``... ('discussion_id', 1), ('page', 1)])`` + +View the (paginated) comments for a discussion +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In order to paginate our comments with a fixed page size, we need to do +a bit of extra work in Python: + +``def find_comments(discussion_id, skip, limit):`` + +``result = []`` + +``page_query = db.comment_pages.find(`` + +``{ 'discussion_id': discussion_id },`` + +``{ 'count': 1, 'comments': { '$slice': [ skip, limit ] } })`` + +``page_query = page_query.sort('page')`` + +``for page in page_query:`` + +``result += page['comments']`` + +``skip = max(0, skip - page['count'])`` + +``limit -= len(page['comments'])`` + +``if limit == 0: break`` + +``return result`` + +\` + +Here, we use the $slice operator to pull out comments from each page, +but *only if we have satisfied our skip requirement* . An example will +help illustrate the logic here. Suppose we have 3 pages with 100, 102, +101, and 22 comments on each. respectively. We wish to retrieve comments +with skip=300 and limit=50. The algorithm proceeds as follows: + +Skip + +Limit + +Discussion + +300 + +50 + +{$slice: [ 300, 50 ] } matches no comments in page #1; subtract page +#1's count from 'skip' and continue + +200 + +50 + +{$slice: [ 200, 50 ] } matches no comments in page #2; subtract page +#2's count from 'skip' and continue + +98 + +50 + +{$slice: [ 98, 50 ] } matches 2 comments in page #3; subtract page #3's +count from 'skip' (saturating at 0), subtract 2 from limit, and continue + +0 + +48 + +{$slice: [ 0, 48 ] } matches all 22 comments in page #4; subtract 22 +from limit and continue + +0 + +26 + +There are no more pages; terminate loop + +Index support +^^^^^^^^^^^^^ + +SInce we already have an index on (discussion\_id, page) in our +comment\_pages collection, we will be able to satisfy these queries +efficiently. + +Retrieve a comment via slug ("permalink") +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Here, we wish to directly retrieve a comment (e.g. *not* requiring +paging through all preceeding pages of commentary). In this case, we can +use the slug to find the correct page, and then use our application to +find the correct comment: + +``page = db.comment_pages.find_one(`` + +``{ 'discussion_id': discussion_id,`` + +``'comments.slug': comment_slug},`` + +``{ 'comments': 1 })`` + +``for comment in page['comments']:`` + +``if comment['slug'] = comment_slug:`` + +``break`` + +Index support +^^^^^^^^^^^^^ + +Here, we need a new index on (discussion\_id, comments.slug) to +efficiently support retrieving the page number of the comment by slug: + +``>>> db.comment_pages.ensure_index([`` + +``... ('discussion_id', 1), ('comments.slug', 1)])`` + +Sharding +-------- + +In each of the cases above, it's likely that our discussion\_id will at +least participate in the shard key if we should choose to shard. + +In the case of the one document per comment approach, it would be nice +to use our slug (or full\_slug, in the case of threaded comments) as +part of the shard key to allow routing of requests by slug: + +``>>> db.command('shardcollection', 'comments', {`` + +``... key : { 'discussion_id' : 1, 'full_slug': 1 } })`` + +``{ "collectionsharded" : "comments", "ok" : 1 }`` + +In the case of the fully-embedded comments, of course, the discussion is +the only thing we need to shard, and its shard key will probably be +determined by concerns outside the scope of this document. + +In the case of hybrid documents, we want to use the page number of the +comment page in our shard key: + +``>>> db.command('shardcollection', 'comment_pages', {`` + +``... key : { 'discussion_id' : 1, ``'page'``: 1 } })`` + +``{ "collectionsharded" : "comment_pages", "ok" : 1 }`` + +Page of diff --git a/source/tutorial/usecase/ecommerce-_category_hierarchy.txt b/source/tutorial/usecase/ecommerce-_category_hierarchy.txt new file mode 100644 index 00000000000..2609defef2d --- /dev/null +++ b/source/tutorial/usecase/ecommerce-_category_hierarchy.txt @@ -0,0 +1,272 @@ +E-Commerce: Category Hierarchy +============================== + +Problem +------- + +You have a product hierarchy for an e-commerce site that you want to +query frequently and update somewhat frequently. + +Solution overview +----------------- + +We will keep each category in its own document, along with a list of its +ancestors. The category hierarchy we will use in this solution will be +based on different categories of music: + +.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sYoXu6LHwYVB_WXz1%20Y_k8XA&rev=27&h=250&w=443&ac=1 + :align: center + :alt: +Since categories change relatively infrequently, we will focus mostly in +this solution on the operations needed to keep the hierarchy up-to-date +and less on the performance aspects of updating the hierarchy. + +Schema design +------------- + +Each category in our hierarchy will be represented by a document. That +document will be identified by an ObjectId for internal +cross-referencing as well as a human-readable name and a url-friendly +'slug' property. Additionally, we will store an ancestors list along +with each document to facilitate displaying a category along with all +its ancestors in a single query. + +``{ "_id" : ObjectId("4f5ec858eb03303a11000002"),`` + +``"name" : "Modal Jazz",`` + +``"parent" : ObjectId("4f5ec858eb03303a11000001"),`` + +``"slug" : "modal-jazz",`` + +``"ancestors" : [`` + +``{ "_id" : ObjectId("4f5ec858eb03303a11000001"),`` + +``"slug" : "bop",`` + +``"name" : "Bop" },`` + +``{ "_id" : ObjectId("4f5ec858eb03303a11000000"),`` + +``"slug" : "ragtime",`` + +``"name" : "Ragtime" } ]`` + +``}`` + +Operations +---------- + +Here, we will describe the various queries and updates we will use +during the lifecycle of our hierarchy. + +Read and display a category +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The simplest operation is reading and displaying a hierarchy. In this +case, we might want to display a category along with a list of 'bread +crumbs' leading back up the hierarchy. In an E-commerce site, we will +most likely have the slug of the category available for our query. + +``category = db.categories.find(`` + +``{'slug':slug},`` + +``{'_id':0, 'name':1, 'ancestors.slug':1, 'ancestors.name':1 })`` + +\` + +Here, we use the slug to retrieve the category and retrieve only those +fields we wish to display. + +Index Support +^^^^^^^^^^^^^ + +In order to support this common operation efficiently, we need an index +on the 'slug' field. Since slug is also intended to be unique, we will +add that constraint to our index as well: + +``db.categories.ensure_index('slug', unique=True)`` + +Add a category to the hierarchy +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Adding a category to a hierarchy is relatively simple. Suppose we wish +to add a new category 'Swing' as a child of 'Ragtime': + +.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sRXRjZMEZDN2azKBl%20sOoXoA&rev=7&h=250&w=443&ac=1 + :align: center + :alt: +In this case, the initial insert is simple enough, but after this +insert, we are still missing the ancestors array in the 'Swing' +category. To define this, we will add a helper function to build our +ancestor list: + +\` + +``def build_ancestors(_id, parent_id):`` + +``parent = db.categories.find_one(`` + +``{'_id': parent_id},`` + +``{'name': 1, 'slug': 1, 'ancestors':1})`` + +``parent_ancestors = parent.pop('ancestors')`` + +``ancestors = [ parent ] + parent_ancestors`` + +``db.categories.update(`` + +``{'_id': _id},`` + +``{'$set': { 'ancestors': ancestors } })`` + +Note that we only need to travel one level in our hierarchy to get the +ragtime's ancestors and build swing's entire ancestor list. Now we can +actually perform the insert and rebuild the ancestor list: + +``doc = dict(name='Swing', slug='swing', parent=ragtime_id)`` + +``swing_id = db.categories.insert(doc)`` + +``build_ancestors(swing_id, ragtime_id)`` + +\` + +Index Support +^^^^^^^^^^^^^ + +Since these queries and updates all selected based on \_id, we only need +the default MongoDB-supplied index on \_id to support this operation +efficiently. + +Change the ancestry of a category +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Our goal here is to reorganize the hierarchy by moving 'bop' under +'swing': + +.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sFB8ph8n7c768f-%20MLTOkY-w&rev=6&h=354&w=443&ac=1 + :align: center + :alt: +The initial update is straightforward: + +``db.categories.update(`` + +``{'_id':bop_id}, {'$set': { 'parent': swing_id } } )`` + +Now, we need to update the ancestor list for bop and all its +descendants. In this case, we can't guarantee that the ancestor list of +the parent category is always correct, however (since we may be +processing the categories out-of- order), so we will need a new +ancestor-building function: + +\` + +``def build_ancestors_full(_id, parent_id):`` + +``ancestors = []`` + +``while parent_id is not None:`` + +``parent = db.categories.find_one(`` + +``{'_id': parent_id},`` + +``{'parent': 1, 'name': 1, 'slug': 1, 'ancestors':1})`` + +``parent_id = parent.pop('parent')`` + +``ancestors.append(parent)`` + +``db.categories.update(`` + +``{'_id': _id},`` + +``{'$set': { 'ancestors': ancestors } })`` + +\` + +Now, at the expense of a few more queries up the hierarchy, we can +easily reconstruct all the descendants of 'bop': + +``for cat in db.categories.find(`` + +``{'ancestors._id': bop_id},`` + +``{'parent_id': 1}):`` + +``build_ancestors_full(cat['_id'], cat['parent_id'])`` + +Index Support +^^^^^^^^^^^^^ + +In this case, an index on 'ancestors.\_id' would be helpful in +determining which descendants need to be updated: + +\` + +``db.categories.ensure_index('ancestors._id')`` + +Renaming a category +~~~~~~~~~~~~~~~~~~~ + +Renaming a category would normally be an extremely quick operation, but +in this case due to our denormalization, we also need to update the +descendants. Here, we will rename 'Bop' to 'BeBop': + +.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sqRIXKA2lGr_bm5ys%20M7KWQA&rev=3&h=354&w=443&ac=1 + :align: center + :alt: +First, we need to update the category name itself: + +\` + +``db.categories.update(`` + +``{'_id':bop_id}, {'$set': { 'name': 'BeBop' } } )`` + +Next, we need to update each descendant's ancestors list: + +``db.categories.update(`` + +``{'ancestors._id': bop_id},`` + +``{'$set': { 'ancestors.$.name': 'BeBop' } },`` + +``multi=True)`` + +\` + +Here, we use the positional operation '$' to match the exact 'ancestor' +entry that matches our query, as well as the 'multi' option on our +update to ensure the rename operation occurs in a single server +round-trip. + +Index Support +^^^^^^^^^^^^^ + +In this case, the index we have already defined on 'ancestors.\_id' is +sufficient to ensure good performance. + +Sharding +-------- + +In this solution, it is unlikely that we would want to shard the +collection since it's likely to be quite small. If we *should* decide to +shard, the use of an \_id field for most of our updates makes \_id an +ideal sharding candidate. The sharding commands we would use to shard +the category collection would then be the following: + +``>>> db.command('shardcollection', 'categories')`` + +``{ "collectionsharded" : "categories", "ok" : 1 }`` + +\` + +Note that there is no need to specify the shard key, as MongoDB will +default to using \_id as a shard key. + +Page of diff --git a/source/tutorial/usecase/ecommerce-_inventory_management.txt b/source/tutorial/usecase/ecommerce-_inventory_management.txt new file mode 100644 index 00000000000..fe3d6fb4375 --- /dev/null +++ b/source/tutorial/usecase/ecommerce-_inventory_management.txt @@ -0,0 +1,537 @@ +E-Commerce: Inventory Management +================================ + +Problem +------- + +You have a product catalog and you would like to maintain an accurate +inventory count as users shop your online store, adding and removing +things from their cart. + +Solution overview +----------------- + +In an ideal world, consumers would begin browsing an online store, add +items to their shopping cart, and proceed in a timely manner to checkout +where their credit cards would always be successfully validated and +charged. In the real world, however, customers often add or remove items +from their shopping cart, change quantities, abandon the cart, and have +problems at checkout time. + +In this solution, we will keep the metaphor of the shopping cart, but +the shopping cart will *age* . Once a shopping cart has not been active +for a certain period of time, all the items in the cart once again +become part of available inventory and the cart is cleared. The state +transition diagram for a shopping cart is below: + +.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sDw93URlN8GCsdNpA%20CSXCVA&rev=76&h=186&w=578&ac=1 + :align: center + :alt: +Schema design +------------- + +In our inventory collection, we will maintain the current available +inventory of each stock-keeping unit (SKU) as well as a list of 'carted' +items that may be released back to available inventory if their shopping +cart times out: + +``{`` + +``_id: '00e8da9b',`` + +``qty: 16,`` + +``carted: [`` + +``{ qty: 1, cart_id: 42,`` + +``timestamp: ISODate("2012-03-09T20:55:36Z"), },`` + +``{ qty: 2, cart_id: 43,`` + +``timestamp: ISODate("2012-03-09T21:55:36Z"), },`` + +``]`` + +``}`` + +(Note that, while in an actual implementation, we might choose to merge +this schema with the product catalog schema described in "E-Commerce: +Product Catalog", we've simplified the inventory schema here for +brevity.) If we continue the metaphor of the brick-and-mortar store, +then our SKU has 16 items on the shelf, 1 in one cart, and 2 in another +for a total of 19 unsold items of merchandise. + +For our shopping cart model, we will maintain a list of (sku, quantity, +price) line items: + +``{`` + +``_id: 42,`` + +``last_modified: ISODate("2012-03-09T20:55:36Z"),`` + +``status: 'active',`` + +``items: [`` + +``{ sku: '00e8da9b', qty: 1, item_details: {...} },`` + +``{ sku: '0ab42f88', qty: 4, item_details: {...} }`` + +``]`` + +``}`` + +Note in the cart model that we have included item details in each line +item. This allows us to display the contents of the cart to the user +without needing a second query back to the catalog collection to display +the details. + +Operations +---------- + +Here, we will describe the various inventory-related operations we will +perform during the course of operation. + +Add an item to a shopping cart +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Our most basic operation is moving an item off the 'shelf' in to the +'cart'. Our constraint is that we would like to guarantee that we never +move an unavailable item off the shelf into the cart. To solve this +problem, we will ensure that inventory is only updated if there is +sufficient inventory to satisfy the request: + +``def add_item_to_cart(cart_id, sku, qty, details):`` + +``now = datetime.utcnow()`` + +\` + +``# Make sure the cart is still active and add the line item`` + +``result = db.cart.update(`` + +``{'_id': cart_id, 'status': 'active' },`` + +``{ '$set': { 'last_modified': now },`` + +``'$push':`` + +``'items': {'sku': sku, 'qty':qty, 'details': details }`` + +``},`` + +``safe=True)`` + +``if not result['updatedExisting']:`` + +``raise CartInactive()`` + +\` + +``# Update the inventory`` + +``result = db.inventory.update(`` + +``{'_id':sku, 'qty': {'$gte': qty}},`` + +````{'$inc': {'qty': -qty}``[a]``,`` + +``'$push': {`` + +``'carted': { 'qty': qty, 'cart_id':cart_id,`` + +``'timestamp': now } } },`` + +``safe=True)`` + +``if not result['updatedExisting']:`` + +``# Roll back our cart update`` + +``db.cart.update(`` + +``{'_id': cart_id },`` + +``{ '$pull': { 'items': {'sku': sku } } }`` + +``)`` + +``raise InadequateInventory()`` + +Note here in particular that we do not trust that the request is +satisfiable. Our first check makes sure that the cart is still 'active' +(more on inactive carts below) before adding a line item. Our next check +verifies that sufficient inventory exists to satisfy the request before +decrementing inventory. In the case of inadequate inventory, we +*compensate* for the non- transactional nature of MongoDB by removing +our cart update. Using safe=True and checking the result in the case of +these two updates allows us to report back an error to the user if the +cart has become inactive or available quantity is insufficient to +satisfy the request. + +Index support +^^^^^^^^^^^^^ + +To support this query efficiently, all we really need is an index on +\_id, which MongoDB provides us by default. + +Modifying the quantity in the cart +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Here, we want to allow the user to adjust the quantity of items in their +cart. We must make sure that when they adjust the quantity upward, there +is sufficient inventory to cover the quantity, as well as updating the +particular 'carted' entry for the user's cart. + +\` + +``def update_quantity(cart_id, sku, old_qty, new_qty):`` + +``now = datetime.utcnow()`` + +``delta_qty = new_qty - old_qty`` + +\` + +``# Make sure the cart is still active and add the line item`` + +``result = db.cart.update(`` + +``{'_id': cart_id, 'status': 'active', 'items.sku': sku },`` + +``{'$set': {`` + +``'last_modified': now,`` + +``'items.$.qty': new_qty },`` + +``},`` + +``safe=True)`` + +``if not result['updatedExisting']:`` + +``raise CartInactive()`` + +\` + +``# Update the inventory`` + +``result = db.inventory.update(`` + +``{'_id':sku,`` + +``'carted.cart_id': cart_id,`` + +``'qty': {'$gte': delta_qty} },`` + +``{'$inc': {'qty': -delta_qty },`` + +``'$set': { 'carted.$.qty': new_qty, 'timestamp': now } },`` + +``safe=True)`` + +``if not result['updatedExisting']:`` + +``# Roll back our cart update`` + +``db.cart.update(`` + +``{'_id': cart_id, 'items.sku': sku },`` + +``{'$set': { 'items.$.qty': old_qty }`` + +``})`` + +``raise InadequateInventory()`` + +Note in particular here that we are using the positional operator '$' to +update the particular 'carted' entry and line item that matched for our +query. This allows us to update the inventory and keep track of the data +we need to 'rollback' the cart in a single atomic operation. We will +also ensure the cart is active and timestamp it as in the case of adding +items to the cart. + +Index support +^^^^^^^^^^^^^ + +To support this query efficiently, all we really need is an index on +\_id, which MongoDB provides us by default. + +Checking out +~~~~~~~~~~~~ + +During checkout, we want to validate the method of payment and remove +the various 'carted' items after the transaction has succeeded. + +``def checkout(cart_id):`` + +``now = datetime.utcnow()`` + +``# Make sure the cart is still active and set to 'pending'. Also`` + +``# fetch the cart details so we can calculate the checkout price`` + +``cart = db.cart.find_and_modify(`` + +``{'_id': cart_id, 'status': 'active' },`` + +``update={'$set': { 'status': 'pending','last_modified': now } } )`` + +``if cart is None:`` + +``raise CartInactive()`` + +\` + +``# Validate payment details; collect payment`` + +``if payment_is_successful(cart):`` + +``db.cart.update(`` + +``{'_id': cart_id },`` + +``{'$set': { 'status': 'complete' } } )`` + +``db.inventory.update(`` + +``{'carted.cart_id': cart_id},`` + +``{'$pull': {'cart_id': cart_id} },`` + +``multi=True)`` + +``else:`` + +``db.cart.update(`` + +``{'_id': cart_id },`` + +``{'$set': { 'status': 'active' } } )`` + +``raise PaymentError()`` + +Here, we first 'lock' the cart by setting its status to 'pending' +(disabling any modifications) and then collect payment data, verifying +at the same time that the cart is still active. We use MongoDB's +'findAndModify' command to atomically update the cart and return its +details so we can capture payment information. If the payment is +successful, we remove the 'carted' items from individual items' +inventory and set the cart to 'complete'. If payment is unsuccessful, we +unlock the cart by setting its status back to 'active' and report a +payment error. + +Index support +^^^^^^^^^^^^^ + +To support this query efficiently, all we really need is an index on +\_id, which MongoDB provides us by default. + +Returning timed-out items to inventory +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Periodically, we want to expire carts that have been inactive for a +given number of seconds, returning their line items to available +inventory: + +``def expire_carts(timeout):`` + +``now = datetime.utcnow()`` + +``threshold = now - timedelta(seconds=timeout)`` + +``# Lock and find all the expiring carts`` + +``db.cart.update(`` + +``{'status': 'active', 'last_modified': { '$lt': threshold } },`` + +``{'$set': { 'status': 'expiring' } },`` + +``multi=True )`` + +``# Actually expire each cart`` + +``for cart in db.cart.find({'status': 'expiring'}):`` + +``# Return all line items to inventory`` + +``for item in cart['items']:`` + +``db.inventory.update(`` + +``{ '_id': item['sku'],`` + +``'carted.cart_id': cart['id'],`` + +``'carted.qty': item['qty']`` + +``},`` + +``{'$inc': { 'qty': item['qty'] },`` + +``'$pull': { 'carted': { 'cart_id': cart['id'] } } })`` + +``db.cart.update(`` + +``{'_id': cart['id'] },`` + +``{'$set': { status': 'expired' })`` + +Here, we first find all carts to be expired and then, for each cart, +return its items to inventory. Once all items have been returned to +inventory, the cart is moved to the 'expired' state. + +Index support +^^^^^^^^^^^^^ + +In this case, we need to be able to efficiently query carts based on +their status and last\_modified values, so an index on these would help +the performance of our periodic expiration process: + +``>>> db.cart.ensure_index([('status', 1), ('last_modified', 1)])`` + +\` + +Note in particular the order in which we defined the index: in order to +efficiently support range queries ('$lt' in this case), the ranged item +must be the last item in the index. Also note that there is no need to +define an index on the 'status' field alone, as any queries for status +can use the compound index we have defined here. + +Error Handling +~~~~~~~~~~~~~~ + +There is one failure mode above that we have not handled adequately: the +case of an exception that occurs after updating the inventory collection +but before updating the shopping cart. The result of this failure mode +is a shopping cart that may be absent or expired where the 'carted' +items in the inventory have not been returned to available inventory. To +account for this case, we will run a cleanup method periodically that +will find old 'carted' items and check the status of their cart: + +``def cleanup_inventory(timeout):`` + +``now = datetime.utcnow()`` + +``threshold = now - timedelta(seconds=timeout)`` + +\` + +``# Find all the expiring carted items`` + +``for item in db.inventory.find(`` + +``{'carted.timestamp': {'$lt': threshold }}):`` + +\` + +``# Find all the carted items that matched`` + +``carted = dict(`` + +``(carted_item['cart_id'], carted_item)`` + +``for carted_item in item['carted']`` + +``if carted_item['timestamp'] < threshold)`` + +\` + +``# Find any carts that are active and refresh the carted items`` + +``for cart in db.cart.find(`` + +``{ '_id': {'$in': carted.keys() },`` + +``'status':'active'}):`` + +``cart = carted[cart['_id']]`` + +``db.inventory.update(`` + +``{ '_id': item['_id'],`` + +``'carted.cart_id': cart['_id'] },`` + +``{ '$set': {'carted.$.timestamp': now } })`` + +``del carted[cart['_id']]`` + +\` + +``# All the carted items left in the dict need to now be`` + +``# returned to inventory`` + +``for cart_id, carted_item in carted.items():`` + +``db.inventory.update(`` + +``{ '_id': item['_id'],`` + +``'carted.cart_id': cart_id,`` + +``'carted.qty': carted_item['qty'] },`` + +``{ '$inc': { 'qty': carted_item['qty'] },`` + +``'$pull': { 'carted': { 'cart_id': cart_id } } })`` + +Note that the function above is safe, as it checks to be sure the cart +is expired or expiring before removing items from the cart and returning +them to inventory. This function could, however, be slow as well as +slowing down other updates and queries, so it should be used +infrequently. + +Sharding +-------- + +If we choose to shard this system, the use of an \_id field for most of +our updates makes \_id an ideal sharding candidate, for both carts and +products. Using \_id as our shard key allows all updates that query on +\_id to be routed to a single mongod process. There are two potential +drawbacks with using \_id as a shard key, however. + +- If the cart collection's \_id is generated in a generally increasing + order, new carts will all initially be assigned to a single shard. +- Cart expiration and inventory adjustment requires several broadcast + queries and updates if \_id is used as a shard key. + +It turns out we can mitigate the first pitfall by choosing a random +value (perhaps the sha-1 hash of on ObjectId) as the \_id of each cart +as it is created. The second objection is valid, but relatively +unimportant, as our expiration process is an infrequent one and in fact +can be slowed down by the judicious use of sleep() calls in order to +minimize server load. + +The sharding commands we would use to shard the cart and inventory +collections, then, would be the following: + +``>>> db.command('shardcollection', 'inventory')`` + +``{ "collectionsharded" : "inventory", "ok" : 1 }`` + +``>>> db.command('shardcollection', 'cart')`` + +``{ "collectionsharded" : "cart", "ok" : 1 }`` + +Note that there is no need to specify the shard key, as MongoDB will +default to using \_id as a shard key. + +Page of + +[a]jsr: + +Actually isn't a $dec command. Just $inc by a negative value. Some +drivers seem to have added $dec as a helper, but probably shouldn't :) + +-------------- + +rick446: + +fixed diff --git a/source/tutorial/usecase/ecommerce-_product_catalog.txt b/source/tutorial/usecase/ecommerce-_product_catalog.txt new file mode 100644 index 00000000000..76e2d2d4d04 --- /dev/null +++ b/source/tutorial/usecase/ecommerce-_product_catalog.txt @@ -0,0 +1,761 @@ +E-Commerce: Product Catalog +=========================== + +Problem +------- + +You have a product catalog that you would like to store in MongoDB with +products of various types and various relevant attributes. + +Solution overview +----------------- + +In the relational database world, there are several solutions of varying +performance characteristics used to solve this problem. In this section +we will examine a few options and then describe the solution that +MongoDB enables. + +One approach ("concrete table inheritance") to solving this problem is +to create a table for each product category: + +\` + +``CREATE TABLE``product\_audio\_album``(`` + +```sku`` char(8) NOT NULL,\` + +``…`` + +```artist`` varchar(255) DEFAULT NULL,\` + +```genre_0`` varchar(255) DEFAULT NULL,\` + +```genre_1`` varchar(255) DEFAULT NULL,\` + +``…,`` + +``PRIMARY KEY(``sku``))`` + +``…`` + +``CREATE TABLE``product\_film``(`` + +```sku`` char(8) NOT NULL,\` + +``…`` + +```title`` varchar(255) DEFAULT NULL,\` + +```rating`` char(8) DEFAULT NULL,\` + +``…,`` + +``PRIMARY KEY(``sku``))`` + +``…`` + +\` + +The main problem with this approach is a lack of flexibility. Each time +we add a new product category, we need to create a new table. +Furthermore, queries must be tailored to the exact type of product +expected. + +Another approach ("single table inheritance") would be to use a single +table for all products and add new columns each time we needed to store +a new type of product: + +\` + +``CREATE TABLE``product``(`` + +```sku`` char(8) NOT NULL,\` + +``…`` + +```artist`` varchar(255) DEFAULT NULL,\` + +```genre_0`` varchar(255) DEFAULT NULL,\` + +```genre_1`` varchar(255) DEFAULT NULL,\` + +``…`` + +```title`` varchar(255) DEFAULT NULL,\` + +```rating`` char(8) DEFAULT NULL,\` + +``…,`` + +``PRIMARY KEY(``sku``))`` + +\` + +This is more flexible, allowing us to query across different types of +product, but it's quite wasteful of space. One possible space +optimization would be to name our columns generically (str\_0, str\_1, +etc), but then we lose visibility into the meaning of the actual data in +the columns. + +Multiple table inheritance is yet another approach where we represent +common attributes in a generic 'product' table and the variations in +individual category product tables: + +``CREATE TABLE``product``(`` + +```sku`` char(8) NOT NULL,\` + +```title`` varchar(255) DEFAULT NULL,\` + +```description`` varchar(255) DEFAULT NULL,\` + +```price`` …,\` + +``PRIMARY KEY(``sku``))`` + +\` + +``CREATE TABLE``product\_audio\_album``(`` + +```sku`` char(8) NOT NULL,\` + +``…`` + +```artist`` varchar(255) DEFAULT NULL,\` + +```genre_0`` varchar(255) DEFAULT NULL,\` + +```genre_1`` varchar(255) DEFAULT NULL,\` + +``…,`` + +``PRIMARY KEY(``sku``),`` + +``FOREIGN KEY(``sku``) REFERENCES``product``(``sku``))`` + +``…`` + +``CREATE TABLE``product\_film``(`` + +```sku`` char(8) NOT NULL,\` + +``…`` + +```title`` varchar(255) DEFAULT NULL,\` + +```rating`` char(8) DEFAULT NULL,\` + +``…,`` + +``PRIMARY KEY(``sku``),`` + +``FOREIGN KEY(``sku``) REFERENCES``product``(``sku``))`` + +``…`` + +This is more space-efficient than single-table inheritance and somewhat +more flexible than concrete-table inheritance, but it does require a +minimum of one join to actually obtain all the attributes relevant to a +product. + +Entity-attribute-value schemas are yet another solution, basically +creating a meta-model for your product data. In this approach, you +maintain a table with (entity\_id, attribute\_id, value) triples that +describe your product. For instance, suppose you are describing an audio +album. In that case you might have a series of rows representing the +following relationships: + +**Entity** +**Attribute** +**Value** + +sku\_00e8da9b + +type + +Audio Album + +sku\_00e8da9b + +title + +A Love Supreme + +sku\_00e8da9b + +… + +… + +sku\_00e8da9b + +artist + +John Coltrane + +sku\_00e8da9b + +genre + +Jazz + +sku\_00e8da9b + +genre + +General + +… + +… + +… + +This schema has the advantage of being completely flexible; any entity +can have any set of any attributes. New product categories do not +require *any* changes in the DDL for your database. The downside to this +schema is that any nontrivial query requires large numbers of join +operations, which results in a large performance penalty. + +One other approach that has been used in relational world is to "punt" +so to speak on the product details and serialize them all into a BLOB +column. The problem with this approach is that the details become +difficult to search and sort by. (One exception is with Oracle's XMLTYPE +columns, which actually resemble a NoSQL document database.) + +Our approach in MongoDB will be to use a single collection to store all +the product data, similar to single-table inheritance. Due to MongoDB's +dynamic schema, however, we need not conform each document to the same +schema. This allows us to tailor each product's document to only contain +attributes relevant to that product category. + +Schema design +------------- + +Our schema will contain general product information that needs to be +searchable across all products at the beginning of each document, with +properties that vary from category to category encapsulated in a +'details' property. Thus an audio album might look like the following: + +\` + +``{`` + +``sku: "00e8da9b",`` + +``type: "Audio Album",`` + +``title: "A Love Supreme",`` + +``description: "by John Coltrane",`` + +``asin: "B0000A118M",`` + +\` + +``shipping: {`` + +``weight: 6,`` + +``dimensions: {`` + +``width: 10,`` + +``height: 10,`` + +``depth: 1`` + +``},`` + +``},`` + +\` + +``pricing: {`` + +``list: 1200,`` + +``retail: 1100,`` + +``savings: 100,`` + +``pct_savings: 8`` + +``},`` + +\` + +``details: {`` + +``title: "A Love Supreme [Original Recording Reissued]",`` + +``artist: "John Coltrane",`` + +``genre: [ "Jazz", "General" ],`` + +``…`` + +``tracks: [`` + +``"A Love Supreme Part I: Acknowledgement",`` + +``"A Love Supreme Part II - Resolution",`` + +``"A Love Supreme, Part III: Pursuance",`` + +``"A Love Supreme, Part IV-Psalm"`` + +``],`` + +``},`` + +``}`` + +\` + +A movie title would have the same fields stored for general product +information, shipping, and pricing, but have quite a different details +attribute: + +``{`` + +``sku: "00e8da9d",`` + +``type: "Film",`` + +``…`` + +``asin: "B000P0J0AQ",`` + +\` + +``shipping: { … },`` + +\` + +``pricing: { … },`` + +\` + +``details: {`` + +``title: "The Matrix",`` + +``director: [ "Andy Wachowski", "Larry Wachowski" ],`` + +``writer: [ "Andy Wachowski", "Larry Wachowski" ],`` + +``…`` + +``aspect_ratio: "1.66:1"`` + +``},`` + +``}`` + +\` + +Another thing to note in the MongoDB schema is that we can have +multi-valued attributes without any arbitrary restriction on the number +of attributes (as we might have if we had ``genre_0`` and ``genre_1`` +columns in a relational database, for instance) and without the need for +a join (as we might have if we normalize the many-to-many "genre" +relation). + +Operations +---------- + +We will be using the product catalog mainly to perform search +operations. Thus our focus in this section will be on the various types +of queries we might want to support in an e-commerce site. These +examples will be written in the Python programming language using the +pymongo driver, but other language/driver combinations should be +similar. + +Find all jazz albums, sorted by year produced[a] +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Here, we would like to see a group of products with a particular genre, +sorted by the year in which they were produced: + +``query = db.products.find({'type':'Audio Album',`` + +``'details.genre': 'jazz'})`` + +``query = query.sort([('details.issue_date', -1)])`` + +Index support +^^^^^^^^^^^^^ + +In order to efficiently support this type of query, we need to create a +compound index on all the properties used in the filter and in the sort: + +\` + +``db.products.ensure_index([`` + +``('type', 1),`` + +``('details.genre', 1),`` + +``('details.issue_date', -1)])`` + +\` + +Again, notice that the final component of our index is the sort field. + +Find all products sorted by percentage discount descending +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +While most searches would be for a particular type of product (audio +album or movie, for instance), there may be cases where we would like to +find all products in a certain price range, perhaps for a 'best daily +deals' of our website. In this case, we will use the pricing information +that exists in all products to find the products with the highest +percentage discount: + +\` + +``query = db.products.find( { 'pricing.pct_savings': {'$gt': 25 })`` + +``query = query.sort([('pricing.pct_savings', -1)])`` + +Index support +^^^^^^^^^^^^^ + +In order to efficiently support this type of query, we need to have an +index on the percentage savings: + +\` + +\`db.products.ensure\_index('pricing.pct\_savings') + +\` + +Since the index is only on a single key, it does not matter in which +order the index is sorted. Note that, had we wanted to perform a range +query (say all products over $25 retail) and sort by another property +(perhaps percentage savings), MongoDB would not have been able to use an +index as effectively. Range queries or sorts must always be the *last* +property in a compound index in order to avoid scanning entirely. Thus +using a different property for a range query and a sort requires some +degree of scanning, slowing down your query. + +Find all movies in which Keanu Reeves acted +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In this case, we want to search inside the details of a particular type +of product (a movie) to find all movies containing Keanu Reeves, sorted +by date descending: + +\` + +``query = db.products.find({'type': 'Film',`` + +``'details.actor': 'Keanu Reeves'})`` + +``query = query.sort([('details.issue_date', -1)])`` + +Index support +^^^^^^^^^^^^^ + +Here, we wish to once again index by type first, followed the details +we're interested in: + +\` + +``db.products.ensure_index([`` + +``('type', 1),`` + +``('details.actor', 1),`` + +``('details.issue_date', -1)])`` + +\` + +And once again, the final component of our index is the sort field. + +\*\* +**Find all movies with the word "hacker" in the title** [b] + +Those experienced with relational databases may shudder at this +operation, since it implies an inefficient LIKE query. In fact, without +a full-text search engine, some scanning will always be required to +satisfy this query. In the case of MongoDB, we will use a regular +expression. First, we will see how we might do this using Python's re +module: + +``import re`` + +``re_hacker = re.compile(r'.*hacker.*', re.IGNORECASE)`` + +\` + +``query = db.products.find({'type': 'Film', 'title': re_hacker})`` + +``query = query.sort([('details.issue_date', -1)])`` + +\` + +Although this is fairly convenient, MongoDB also gives us the option to +use a special syntax in our query instead of importing the Python re +module: + +\` + +``query = db.products.find({`` + +``'type': 'Film',`` + +``'title': {'$regex': '.*hacker.*', '$options':'i'}})`` + +``query = query.sort([('details.issue_date', -1)])`` + +Index support +^^^^^^^^^^^^^ + +Here, we will diverge a bit from our typical index order: + +\` + +``db.products.ensure_index([`` + +``('type', 1),`` + +``('details.issue_date', -1),`` + +``('title', 1)])`` + +\` + +You may be wondering why we are including the title field in the index +if we have to scan anyway. The reason is that there are two types of +scans: index scans and document scans. Document scans require entire +documents to be loaded into memory, while index scans only require index +entries to be loaded. So while an index scan on title isn't as efficient +as a direct lookup, it is certainly faster than a document scan. + +The order in which we include our index keys is also different than what +you might expect. This is once again due to the fact that we are +scanning. Since our results need to be in sorted order by +'details.issue\_date', we should make sure that's the order in which +we're scanning titles. You can observe the difference looking at the +query plans we get for different orderings. If we use the (type, title, +details.issue\_date) index, we get the following plan: + +``{u'allPlans': [...],`` + +``u'cursor': u'BtreeCursor type_1_title_1_details.issue_date_-1 multi',`` + +``u'indexBounds': {u'details.issue_date': [[{u'$maxElement': 1},`` + +``{u'$minElement': 1}]],`` + +``u'title': [[u'', {}],`` + +``[<_sre.SRE_Pattern object at 0x2147cd8>,`` + +``<_sre.SRE_Pattern object at 0x2147cd8>]],`` + +``u'type': [[u'Film', u'Film']]},`` + +``u'indexOnly': False,`` + +``u'isMultiKey': False,`` + +``u'millis': 208,`` + +``u'n': 0,`` + +``u'nChunkSkips': 0,`` + +``u'nYields': 0,`` + +``u'nscanned': 10000,`` + +``u'nscannedObjects': 0,`` + +``u'scanAndOrder': True}`` + +\` + +If, however, we use the (type, details.issue\_date, title) index, we get +the following plan: + +\` + +``{u'allPlans': [...],`` + +``u'cursor': u'BtreeCursor type_1_details.issue_date_-1_title_1 multi',`` + +``u'indexBounds': {u'details.issue_date': [[{u'$maxElement': 1},`` + +``{u'$minElement': 1}]],`` + +``u'title': [[u'', {}],`` + +``[<_sre.SRE_Pattern object at 0x2147cd8>,`` + +``<_sre.SRE_Pattern object at 0x2147cd8>]],`` + +``u'type': [[u'Film', u'Film']]},`` + +``u'indexOnly': False,`` + +``u'isMultiKey': False,`` + +``u'millis': 157,`` + +``u'n': 0,`` + +``u'nChunkSkips': 0,`` + +``u'nYields': 0,`` + +``u'nscanned': 10000,`` + +``u'nscannedObjects': 0}`` + +\` + +The two salient features to note are a) the absence of the +'scanAndOrder: True' in the optmal query and b) the difference in time +(208ms for the suboptimal query versus 157ms for the optimal [c]one). +The lesson learned here is that if you absolutely have to scan, you +should make the elements you're scanning the *least* significant part of +the index (even after the sort). + +Sharding[d] +----------- + +Though our performance in this system is highly dependent on the indexes +we maintain, sharding can enhance that performance further by allowing +us to keep larger portions of those indexes in RAM. In order to maximize +our read scaling, we would also like to choose a shard key that allows +mongos to route queries to only one or a few shards rather than all the +shards globally. + +Since most of the queries in our system include type, we should probably +also include that in our shard key. You may note that most of the +queries also included 'details.issue\_date', so there may be a +temptation to include it in our shard key, but this actually wouldn't +help us much since none of the queries were *selective* by date. + +Since our schema is so flexible, it's hard to say *a priori* what the +ideal shard key would be, but a reasonable guess would be to include the +'type' field, one or more detail fields that are commonly queried, and +one final random-ish field to ensure we don't get large unsplittable +chunks. For this example, we will assume that 'details.genre' is our +second-most queried field after 'type', and thus our sharding setup +would be as follows: + +``>>> db.command('shardcollection', 'product', {`` + +``... key : { 'type': 1, 'details.genre' : 1, 'sku':1 } })`` + +``{ "collectionsharded" : "details.genre", "ok" : 1 }`` + +One important note here is that, even if we choose a shard key that +requires all queries to be broadcast to all shards, we still get some +benefits from sharding due to a) the larger amount of memory available +to store our indexes and b) the fact that searches will be parallelized +across shards, reducing search latency. + +Scaling Queries with read\_preference +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Although sharding is the best way to scale reads and writes, it's not +always possible to partition our data so that the queries can be routed +by mongos to a subset of shards. In this case, mongos will broadcast the +query to all shards and then accumulate the results before returning to +the client. In cases like this, we can still scale our query performance +by allowing mongos to read from the secondary servers in a replica set. +This is achieved via the 'read\_preference' argument, and can be set at +the connection or individual query level. For instance, to allow all +reads on a connection to go to a secondary, the syntax is: + +``conn = pymongo.Connection(read_preference=pymongo.SECONDARY)`` + +or + +``conn = pymongo.Connection(read_preference=pymongo.SECONDARY_ONLY)`` + +\` + +In the first instance, reads will be distributed among all the +secondaries and the primary, whereas in the second reads will only be +sent to the secondary. To allow queries to go to a secondary on a +per-query basis, we can also specify a read\_preference: + +``results = db.product.find(..., read_preference=pymongo.SECONDARY)`` + +or + +``results = db.product.find(..., read_preference=pymongo.SECONDARY_ONLY)`` + +\` + +It is important to note that reading from a secondary can introduce a +lag between when inserts and updates occur and when they become visible +to queries. In the case of a product catalog, however, where queries +happen frequently and updates happen infrequently, such eventual +consistency (updates visible within a few seconds but not immediately) +is usually tolerable. + +Page of + +[a]jsr: + +This might make more sense as the first operation. The "sorted by +discount" feels like a secondary use case.. still include it, but maybe +lower down in the TOC. + +-------------- + +rick446: + +there you go + +[b]jsr: + +See note below about scatter gather queries. Might want to add slaveOk +flag on these and talk about why it's okay with this model. Don't always +need consistent reads. Better to do slaveOk so you can get some more +scale out of scatter gather queries. + +-------------- + +rick446: + +See response below. + +[c]jsr: + +This doesn't seem like a big difference. I think that the scanAndOrder +is more important. + +-------------- + +rick446: + +Hm, well it's not a big *absolute* difference, but I'd expect it to grow +as your data size increased. The query time is the only thing we're +*actually* interested in IMO (the scanAndOrder being interesting because +it's the cause of the slow query) + +[d]jsr: + +With this model, another consideration might be parallelized searches. +For example, if you're sharded on, say genre, but you want to query for +all albums by coltrane, you'll do a scatter gather query. Maybe include +a discussion of scatter gather queries and the fact that you can add +replicas to scale these. + +-------------- + +rick446: + +I added a section at the end about read\_preference and a clause about +still getting a benefit from sharding due to parallelized searches. +Anything else you wanted here? diff --git a/source/tutorial/usecase/index.txt b/source/tutorial/usecase/index.txt new file mode 100644 index 00000000000..ef59866bcb5 --- /dev/null +++ b/source/tutorial/usecase/index.txt @@ -0,0 +1,14 @@ +Use Cases +============ + +.. toctree:: + :maxdepth: 1 + + real_time_analytics-_storing_log_data + real_time_analytics-_preaggregated_reports + real_time_analytics-_hierarchical_aggregation + ecommerce-_product_catalog + ecommerce-_inventory_management + ecommerce-_category_hierarchy + cms-_metadata_and_asset_management + cms-_storing_comments diff --git a/source/tutorial/usecase/real_time_analytics-_hierarchical_aggregation.txt b/source/tutorial/usecase/real_time_analytics-_hierarchical_aggregation.txt new file mode 100644 index 00000000000..f8032e939da --- /dev/null +++ b/source/tutorial/usecase/real_time_analytics-_hierarchical_aggregation.txt @@ -0,0 +1,717 @@ +Real Time Analytics: Hierarchical Aggregation +============================================= + +Problem +------- + +You have a large amount of event data that you want to analyze at +multiple levels of aggregation. + +Solution overview +----------------- + +For this solution we will assume that the incoming event data is already +stored in an incoming 'events' collection. For details on how we could +get the event data into the events collection, please see "Real Time +Analytics: Storing Log Data." + +Once the event data is in the events collection, we need to aggregate +event data to the finest time granularity we're interested in. Once that +data is aggregated, we will use it to aggregate up to the next level of +the hierarchy, and so on. To perform the aggregations, we will use +MongoDB's mapreduce command. Our schema will use several collections: +the raw data (event) logs and collections for statistics aggregated +hourly, daily, weekly, monthly, and yearly. We will use a hierarchical +approach to running our map-reduce jobs. The input and output of each +job is illustrated below: + +.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=syuQgkoNVdeOo7UC4%20WepaPQ&rev=1&h=208&w=268&ac=1 + :align: center + :alt: +Aside: Map-Reduce Algorithm +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Map/reduce is a popular aggregation algorithm that is optimized for +embarrassingly parallel problems. The psuedocode (in Python) of the +map/reduce algorithm appears below. Note that we are providing the +psuedocode for a particular type of map/reduce where the results of the +map/reduce operation are *reduced* into the result collection, allowing +us to perform incremental aggregation which we'll need in this case. + +\` + +``def map_reduce(icollection, query,`` + +``mapf, reducef, finalizef, ocollection):`` + +``'''Psuedocode for map/reduce with output type="reduce" in MongoDB'''`` + +``map_results = defaultdict(list)`` + +``def emit(key, value):`` + +``'''helper function used inside mapf'''`` + +``map_results[key].append(value)`` + +\` + +``# The map phase`` + +``for doc in icollection.find(query):`` + +``mapf(doc)`` + +\` + +``# Pull in documents from the output collection for`` + +``# output type='reduce'`` + +``for doc in ocollection.find({'_id': {'$in': map_results.keys() } }):`` + +``map_results[doc['_id']].append(doc['value'])`` + +\` + +``# The reduce phase`` + +``for key, values in map_results.items():`` + +``reduce_results[key] = reducef(key, values)`` + +\` + +``# Finalize and save the results back`` + +``for key, value in reduce_results.items():`` + +``final_value = finalizef(key, value)`` + +``ocollection.save({'_id': key, 'value': final_value})``[a] + +\` \` + +The embarrassingly parallel part of the map/reduce algorithm lies in the +fact that each invocation of mapf, reducef, and finalizef are +independent of each other and can, in fact, be distributed to different +servers. In the case of MongoDB, this parallelism can be achieved by +using sharding on the collection on which we are performing map/reduce. + +Schema design[b] +---------------- + +When designing the schema for event storage, we need to keep in mind the +necessity to differentiate between events which have been included in +our aggregations and events which have not yet been included. A simple +approach in a relational database would be to use an auto-increment +integer primary key, but this introduces a big performance penalty to +our event logging process as it has to fetch event keys one-by one. + +If we are able to batch up our inserts into the event table, we can +still use an auto-increment primary key by using the find\_and\_modify +command to generate our \_id values: + +``>>> obj = db.my_sequence.find_and_modify(`` + +``... query={'_id':0},`` + +``... update={'$inc': {'inc': 50}}`` + +``... upsert=True,`` + +``... new=True)`` + +``>>> batch_of_ids = range(obj['inc']-50, obj['inc'])`` + +In most cases, however, it is sufficient to include a timestamp with +each event that we can use as a marker of which events have been +processed and which ones remain to be processed. For this use case, +we'll assume that we are calculating average session length for +logged-in users on a website. Our event format will thus be the +following: + +``{`` + +``"userid": "rick",`` + +``"ts": ISODate('2010-10-10T14:17:22Z'),`` + +``"length":95`` + +``}`` + +\` + +We want to calculate total and average session times for each user at +the hour, day, week, month, and year. In each case, we will also store +the number of sessions to enable us to incrementally recompute the +average session times. Each of our aggregate documents, then, looks like +the following: + +``{`` + +``_id: { u: "rick", d: ISODate("2010-10-10T14:00:00Z") },`` + +``value: {`` + +``ts: ISODate('2010-10-10T15:01:00Z'),`` + +``total: 254,`` + +``count: 10,`` + +``mean: 25.4 }`` + +``}`` + +\` + +Note in particular that we have added a timestamp to the aggregate +document. This will help us as we incrementally update the various +levels of the hierarchy. + +Operations +---------- + +In the discussion below, we will assume that all the events have been +inserted and appropriately timestamped, so our main operations are +aggregating from events into the smallest aggregate (the hourly totals) +and aggregating from smaller granularity to larger granularity. In each +case, we will assume that the last time the particular aggregation was +run is stored in a last\_run variable. (This variable might be loaded +from MongoDB or another persistence mechanism.) + +Aggregate from events to the hourly level +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Here, we want to load all the events since our last run until one minute +ago (to allow for some lag in logging events). The first thing we need +to do is create our map function. Even though we will be using Python +and PyMongo to interface with the MongoDB server, note that the various +functions (map, reduce, and finalize) that we pass to the mapreduce +command must be Javascript functions. The map function appears below: + +``mapf_hour = bson.Code('''function() {`` + +``var key = {`` + +``u: this.userid,`` + +``d: new Date(`` + +``this.ts.getFullYear(),`` + +``this.ts.getMonth(),`` + +``this.ts.getDate(),`` + +``this.ts.getHours(),`` + +``0, 0, 0);`` + +``emit(`` + +``key,`` + +``{`` + +``total: this.length,`` + +``count: 1,`` + +``mean: 0,`` + +``ts: new Date(); });`` + +``}''')`` + +\` + +In this case, we are emitting key, value pairs which contain the +statistics we want to aggregate as you'd expect, but we are also +emitting 'ts' value. This will be used in the cascaded aggregations +(hour to day, etc.) to determine when a particular hourly aggregation +was performed.\` \` + +\` + +Our reduce function is also fairly straightforward: + +``reducef = bson.Code('''function(key, values) {`` + +``var r = { total: 0, count: 0, mean: 0, ts: null };`` + +``values.forEach(function(v) {`` + +``r.total += v.total;`` + +``r.count += v.count;`` + +``});`` + +``return r;`` + +``}''')`` + +A few things are notable here. First of all, note that the returned +document from our reduce function has the same format as the result of +our map. This is a characteristic of our map/reduce that we would like +to maintain, as differences in structure between map, reduce, and +finalize results can lead to difficult-to-debug errors. Also note that +we are ignoring the 'mean' and 'ts' values. These will be provided in +the 'finalize' step: + +\` + +``finalizef = bson.Code('''function(key, value) {`` + +``if(value.count > 0) {`` + +``value.mean = value.total / value.count;`` + +``}`` + +``value.ts = new Date();`` + +``return value;`` + +``}''')`` + +\` + +Here, we compute the mean value as well as the timestamp we will use to +write back to the output collection. Now, to bind it all together, here +is our Python code to invoke the mapreduce command: + +``cutoff = datetime.utcnow() - timedelta(seconds=60)`` + +``query = { 'ts': { '$gt': last_run, '$lt': cutoff } }`` + +\` + +``db.events.map_reduce(`` + +``map=mapf_hour,`` + +``reduce=reducef,`` + +``finalize=finalizef,`` + +``query=query,`` + +``out={ 'reduce': 'stats.hourly' })`` + +\` + +``last_run = cutoff`` + +Because we used the 'reduce' option on our output, we are able to run +this aggregation as often as we like as long as we update the last\_run +variable. + +Index support +^^^^^^^^^^^^^ + +Since we are going to be running the initial query on the input events +frequently, we would benefit significantly from and index on the +timestamp of incoming events: + +``>>> db.stats.hourly.ensure_index('ts')`` + +Since we are always reading and writing the most recent events, this +index +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +has the advantage of being right-aligned, which basically means we only +need a thin slice of the index (the most recent values) in RAM to +achieve good performance. + +Aggregate from hour to day +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In calculating the daily statistics, we will use the hourly statistics +as input. Our map function looks quite similar to our hourly map +function: + +``mapf_day = bson.Code('''function() {`` + +``var key = {`` + +``u: this._id.u,`` + +``d: new Date(`` + +``this._id.d.getFullYear(),`` + +``this._id.d.getMonth(),`` + +``this._id.d.getDate(),`` + +``0, 0, 0, 0) };`` + +``emit(`` + +``key,`` + +``{`` + +``total: this.value.total,`` + +``count: this.value.count,`` + +``mean: 0,`` + +``ts: null });`` + +``}''')`` + +There are a few differences to note here. First of all, the key to which +we aggregate is the (userid, date) rather than (userid, hour) to allow +for daily aggregation. Secondly, note that the keys and values we emit +are actually the total and count values from our hourly aggregates +rather than properties from event documents. This will be the case in +all our higher-level hierarchical aggregations. + +Since we are using the same format for map output as we used in the +hourly aggregations, we can, in fact, use the same reduce and finalize +functions. The actual Python code driving this level of aggregation is +as follows: + +``cutoff = datetime.utcnow() - timedelta(seconds=60)`` + +``query = { 'value.ts': { '$gt': last_run, '$lt': cutoff } }`` + +\` + +``db.stats.hourly.map_reduce(`` + +``map=mapf_day,`` + +``reduce=reducef,`` + +``finalize=finalizef,`` + +``query=query,`` + +``out={ 'reduce': 'stats.daily' })`` + +\` + +``last_run = cutoff`` + +There are a couple of things to note here. First of all, our query is +not on 'ts' now, but 'value.ts', the timestamp we wrote during the +finalization of our hourly aggregates. Also note that we are, in fact, +aggregating from the stats.hourly collection into the stats.daily +collection. + +Index support +^^^^^^^^^^^^^ + +Since we are going to be running the initial query on the hourly +statistics collection frequently, an index on 'value.ts' would be nice +to have: + +``>>> db.stats.hourly.ensure_index('value.ts')`` + +Once again, this is a right-aligned index that will use very little RAM +for +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +efficient operation. + +Other aggregations +~~~~~~~~~~~~~~~~~~ + +Once we have our daily statistics, we can use them to calculate our +weekly and monthly statistics. Our weekly map function is as follows: + +``mapf_week = bson.Code('''function() {`` + +``var key = {`` + +``u: this._id.u,`` + +``d: new Date(`` + +``this._id.d.valueOf()`` + +``- dt.getDay()*24*60*60*1000) };`` + +``emit(`` + +``key,`` + +``{`` + +``total: this.value.total,`` + +``count: this.value.count,`` + +``mean: 0,`` + +``ts: null });`` + +``}''')`` + +\` + +Here, in order to get our group key, we are simply taking the date and +subtracting days until we get to the beginning of the week. In our +weekly map function, we will choose the first day of the month as our +group key: + +``mapf_month = bson.Code('''function() {`` + +``d: new Date(`` + +``this._id.d.getFullYear(),`` + +``this._id.d.getMonth(),`` + +``1, 0, 0, 0, 0) };`` + +``emit(`` + +``key,`` + +``{`` + +``total: this.value.total,`` + +``count: this.value.count,`` + +``mean: 0,`` + +``ts: null });`` + +``}''')`` + +One thing in particular to notice about these map functions is that they +are identical to one another except for the date calculation. We can use +Python's string interpolation to refactor our map function definitions +as follows: + +\` + +``mapf_hierarchical = '''function() {`` + +``var key = {`` + +``u: this._id.u,`` + +``d: %s };`` + +``emit(`` + +``key,`` + +``{`` + +``total: this.value.total,`` + +``count: this.value.count,`` + +``mean: 0,`` + +``ts: null });`` + +``}'''`` + +\` + +``mapf_day = bson.Code(`` + +``mapf_hierarchical % '''new Date(`` + +``this._id.d.getFullYear(),`` + +``this._id.d.getMonth(),`` + +``this._id.d.getDate(),`` + +``0, 0, 0, 0)''')`` + +\` + +``mapf_week = bson.Code(`` + +``mapf_hierarchical % '''new Date(`` + +``this._id.d.valueOf()`` + +``- dt.getDay()*24*60*60*1000)''')`` + +\` + +``mapf_month = bson.Code(`` + +``mapf_hierarchical % '''new Date(`` + +``this._id.d.getFullYear(),`` + +``this._id.d.getMonth(),`` + +``1, 0, 0, 0, 0)''')`` + +\` + +``mapf_year = bson.Code(`` + +``mapf_hierarchical % '''new Date(`` + +``this._id.d.getFullYear(),`` + +``1, 1, 0, 0, 0, 0)''')`` + +\` + +Our Python driver can also be refactored so we have much less code +duplication: + +``def aggregate(icollection, ocollection, mapf, cutoff, last_run):`` + +``query = { 'value.ts': { '$gt': last_run, '$lt': cutoff } }`` + +``icollection.map_reduce(`` + +``map=mapf,`` + +``reduce=reducef,`` + +``finalize=finalizef,`` + +``query=query,`` + +``out={ 'reduce': ocollection.name })`` + +\` + +Once this is defined, we can perform all our aggregations as follows: + +``cutoff = datetime.utcnow() - timedelta(seconds=60)`` + +``aggregate(db.events, db.stats.hourly, mapf_hour, cutoff, last_run)`` + +``aggregate(db.stats.hourly, db.stats.daily, mapf_day, cutoff, last_run)`` + +``aggregate(db.stats.daily, db.stats.weekly, mapf_week, cutoff, last_run)`` + +``aggregate(db.stats.daily, db.stats.monthly, mapf_month, cutoff,`` + +``last_run)`` + +``aggregate(db.stats.monthly, db.stats.yearly, mapf_year, cutoff,`` + +``last_run)`` + +``last_run = cutoff`` + +\` + +So long as we save/restore our 'last\_run' variable between +aggregations, we can run these aggregations as often as we like since +each aggregation individually is incremental. + +Index support +^^^^^^^^^^^^^ + +Our indexes will continue to be on the value's timestamp to ensure +efficient operation of the next level of the aggregation (and they +continue to be right- aligned): + +``>>> db.stats.daily.ensure_index('value.ts')`` + +``>>> db.stats.monthly.ensure_index('value.ts')`` + +Sharding +-------- + +To take advantage of distinct shards when performing map/reduce, our +input collections should be sharded. In order to achieve good balancing +between nodes, we should make sure that the shard key we use is not +simply the incoming timestamp, but rather something that varies +significantly in the most recent documents. In this case, the username +makes sense as the most significant part of the shard key. + +In order to prevent a single, active user from creating a large, +unsplittable chunk, we will use a compound shard key with (username, +timestamp) on each of our collections: + +``>>> db.command('shardcollection', 'events', {`` + +``... key : { 'userid': 1, 'ts' : 1} } )`` + +``{ "collectionsharded" : "events", "ok" : 1 }`` + +``>>> db.command('shardcollection', 'stats.daily', {`` + +``... key : { '_id': 1} } )`` + +``{ "collectionsharded" : "stats.daily", "ok" : 1 }`` + +``>>> db.command('shardcollection', 'stats.weekly', {`` + +``... key : { '_id': 1} } )`` + +``{ "collectionsharded" : "stats.weekly", "ok" : 1 }`` + +``>>> db.command('shardcollection', 'stats.monthly', {`` + +``... key : { '_id': 1} } )`` + +``{ "collectionsharded" : "stats.monthly", "ok" : 1 }`` + +``>>> db.command('shardcollection', 'stats.yearly', {`` + +``... key : { '_id': 1} } )`` + +``{ "collectionsharded" : "stats.yearly", "ok" : 1 }`` + +We should also update our map/reduce driver so that it notes the output +should be sharded. This is accomplished by adding 'sharded':True to the +output argument: + +… + +``out={ 'reduce': ocollection.name, 'sharded': True })`` + +… + +Note that the output collection of a mapreduce command, if sharded, must +be sharded using \_id as the shard key. + +Page of + +[a]jsr: + +It's a little weird to have the code sample in python since we'll +actually be doing the map reduce in javascript. Is it significantly more +code to do this as javascript? + +-------------- + +rick446: + +I could rewrite all the code examples in this doc as Javascript if +that's what you want. I don't think we should do some of the snippets in +Python and some in JS, however. + +Also, the docs focus on JS, so it might be nice to see how you do this +in a non-JS environment (Answering the question of how *do* you send the +mapf and reducef from non-JS) + +[b]jsr: + +It's worth describing the set of collections we'll have. 1) the raw data +logs, 2) hourly data, 3) daily data. And show that there's a map reduce +job between each collection. E.g. job1 takes raw data to hourly. job2 +takes hourly data to daily data. + +-------------- + +rick446: + +I added an illustration above in the solution overview; is that +sufficient? diff --git a/source/tutorial/usecase/real_time_analytics-_preaggregated_reports.txt b/source/tutorial/usecase/real_time_analytics-_preaggregated_reports.txt new file mode 100644 index 00000000000..11066f030bf --- /dev/null +++ b/source/tutorial/usecase/real_time_analytics-_preaggregated_reports.txt @@ -0,0 +1,768 @@ +Real Time Analytics: Pre-Aggregated Reports +=========================================== + +Problem +------- + +You have one or more servers generating events for which you want +real-time statistical information in a MongoDB collection. + +Solution overview +----------------- + +For this solution we will make a few assumptions: + +- There is no need to retain transactional event data in MongoDB, or + that retention is handled outside the scope of this use case +- We need statistical data to be up-to-the minute (or up-to-the-second, + if possible) +- The queries to retrieve time series of statistical data need to be as + fast as possible. + +Our general approach is to use upserts and increments to generate the +statistics and simple range-based queries and filters to draw the time +series charts of the aggregated data. + +To help anchor the solution, we will examine a simple scenario where we +want to count the number of hits to a collection of web site at various +levels of time-granularity (by minute, hour, day, week, and month) as +well as by path. We will assume that either you have some code that can +run as part of your web app when it is rendering the page, or you have +some set of logfile post- processors that can run in order to integrate +the statistics. + +Schema design +------------- + +There are two important considerations when designing the schema for a +real- time analytics system: the ease & speed of updates and the ease & +speed of queries[a]. In particular, we want to avoid the following +performance-killing circumstances: + +- documents changing in size significantly, causing reallocations on + disk +- queries that require large numbers of disk seeks to be satisfied +- document structures that make accessing a particular field slow + +One approach we *could* use to make updates easier would be to keep our +hit counts in individual documents, one document per +minute/hour/day/etc. This approach, however, requires us to query +several documents for nontrivial time range queries, slowing down our +queries significantly. In order to keep our queries fast, we will +instead use somewhat more complex documents, keeping several aggregate +values in each document. + +In order to illustrate some of the other issues we might encounter, we +will consider several schema designs that yield suboptimal performance +and discuss the problems with them before finally describing the +solution we would like to go with. + +Design 0: one document per page/day +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Our initial approach will be to simply put all the statistics in which +we're interested into a single document per page: + +``{`` + +``_id: "20101010/site-1/apache_pb.gif",`` + +``metadata: {`` + +``date: ISODate("2000-10-10T00:00:00Z"),`` + +``site: "site-1",`` + +``page: "/apache_pb.gif" },`` + +``daily: 5468426,`` + +``hourly: {`` + +``"0": 227850,`` + +``"1": 210231,`` + +``…`` + +``"23": 20457 },`` + +``minute: {`` + +``"0": 3612,`` + +``"1": 3241,`` + +``…`` + +``"1439": 2819 }`` + +``}``[b] + +This approach has a couple of advantages: a) it only requires a single +update per hit to the website, b) intra-day reports for a single page +require fetching only a single document. There are, however, significant +problems with this approach. The biggest problem is that, as we upsert +data into the 'hy' and 'mn' properties, the document grows. Although +MongoDB attempts to pad the space required for documents, it will still +end up needing to reallocate these documents multiple times throughout +the day, copying the documents to areas with more space. + +Design #0.5: Preallocate documents +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In order to mitigate the repeated copying of documents, we can tweak our +approach slightly by adding a process which will preallocate a document +with initial zeros during the previous day. In order to avoid a +situation where we preallocate documents *en masse* at midnight, we will +(with a low probability) randomly upsert the next day's document each +time we update the current day's statistics. This requires some tuning; +we'd like to have almost all the documents preallocated by the end of +the day, without spending much time on extraneous upserts (preallocating +a document that's already there). A reasonable first guess would be to +look at our average number of hits per day (call it *hits* ) and +preallocate with a probability of *1/hits* . + +Preallocating helps us mainly by ensuring that all the various 'buckets' +are initialized with 0 hits. Once the document is initialized, then, it +will never dynamically grow, meaning a) there is no need to perform the +reallocations that could slow us down in design #0 and b) MongoDB +doesn't need to pad the records, leading to a more compact +representation and better usage of our memory. + +Design #1: Add intra-document hierarchy +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +One thing to be aware of with BSON is that documents are stored as a +sequence of (key, value) pairs, *not* as a hash table. What this means +for us is that writing to stats.mn.0 is *much* faster than writing to +stats.mn.1439. [c] + +.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sg_d2tpKfXUsecEyv%20pgRg8w&rev=1&h=82&w=410&ac=1 + :align: center + :alt: +In order to speed this up, we can introduce some intra-document +hierarchy. In particular, we can split the 'mn' field up into 24 hourly +fields: + +\` + +``{`` + +``_id: "20101010/site-1/apache_pb.gif",`` + +``metadata: {`` + +``date: ISODate("2000-10-10T00:00:00Z"),`` + +``site: "site-1",`` + +``page: "/apache_pb.gif" },`` + +``daily: 5468426,`` + +``hourly: {`` + +``"0": 227850,`` + +``"1": 210231,`` + +``…`` + +``"23": 20457 },`` + +``minute: {`` + +``"0": {`` + +``"0": 3612,`` + +``"1": 3241,`` + +``…`` + +``"59": 2130 },`` + +\` "1": { + +:: + + "60": … , + +\` + +``},`` + +``…`` + +``"23": {`` + +``…`` + +``"1439": 2819 }`` + +``}`` + +``}`` + +This allows MongoDB to "skip forward" when updating the minute +statistics later in the day, making our performance more uniform and +generally faster. + +.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sGv9KIXyF_XZvpnNP%20Vyojcg&rev=21&h=148&w=410&ac=1 + :align: center + :alt: +Design #2: Create separate documents for different granularities +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Design #1 is certainly a reasonable design for storing intraday +statistics, but what happens when we want to draw a historical chart +over a month or two? In that case, we need to fetch 30+ individual +documents containing or daily statistics. A better approach would be to +store daily statistics in a separate document, aggregated to the month. +This does introduce a second upsert to the statistics generation side of +our system, but the reduction in disk seeks on the query side should +more than make up for it. At this point, our document structure is as +follows: + +Daily Statistics +^^^^^^^^^^^^^^^^ + +``{`` + +``_id: "20101010/site-1/apache_pb.gif",`` + +``metadata: {`` + +``date: ISODate("2000-10-10T00:00:00Z"),`` + +``site: "site-1",`` + +``page: "/apache_pb.gif" },`` + +``hourly: {`` + +``"0": 227850,`` + +``"1": 210231,`` + +``…`` + +``"23": 20457 },`` + +``minute: {`` + +``"0": {`` + +``"0": 3612,`` + +``"1": 3241,`` + +``…`` + +``"59": 2130 },`` + +\` "1": { + +:: + + "0": … , + +\` + +``},`` + +``…`` + +``"23": {`` + +``…`` + +``"59": 2819 }`` + +``}`` + +``}`` + +Monthly Statistics +^^^^^^^^^^^^^^^^^^ + +\` + +``{`` + +``_id: "201010/site-1/apache_pb.gif",`` + +``metadata: {`` + +``date: ISODate("2000-10-00T00:00:00Z"),`` + +``site: "site-1",`` + +``page: "/apache_pb.gif" },`` + +``daily: {`` + +``"1": 5445326,`` + +``"2": 5214121,`` + +``… }`` + +``}`` + +Operations +---------- + +In this system, we want balance between read performance and write +(upsert) performance. This section will describe each of the major +operations we perform, using the Python programming language and the +pymongo MongoDB driver. These operations would be similar in other +languages as well. + +Log a hit to a page +~~~~~~~~~~~~~~~~~~~ + +Logging a hit to a page in our website is the main 'write' activity in +our system. In order to maximize performance, we will be doing in-place +updates with the upsert operation: + +``from datetime import datetime, time`` + +\` + +``def log_hit(db, dt_utc, site, page):`` + +\` + +``# Update daily stats doc`` + +``id_daily = dt_utc.strftime('%Y%m%d/') + site + page`` + +``hour = dt_utc.hour`` + +``minute = dt_utc.minute`` + +\` + +``# Get a datetime that only includes date info`` + +``d = datetime.combine(dt_utc.date(), time.min)`` + +``query = {`` + +``'_id': id_daily,`` + +``'metadata': { 'date': d, 'site': site, 'page': page } }`` + +``update = { '$inc': {`` + +``'hourly.%d' % (hour,): 1,`` + +``'minute.%d.%d' % (hour,minute): 1 } }`` + +``db.stats.daily.update(query, update, upsert=True)`` + +\` + +``# Update monthly stats document`` + +``id_monthly = dt_utc.strftime('%Y%m/') + site + page`` + +``day_of_month = dt_utc.day`` + +``query = {`` + +``'_id': id_monthly,`` + +``'metadata': {`` + +``'date': d.replace(day=1),`` + +``'site': site,`` + +``'page': page } }`` + +``update = { '$inc': {`` + +``'daily.%d' % day_of_month: 1} }`` + +``db.stats.monthly.update(query, update, upsert=True)`` + +Since we are using the upsert operation, this function will perform +correctly whether the document is already present or not, which is +important, as our preallocation (the next operation) will only +preallocate documents with a high probability. Note however, that +without preallocation, we end up with a dynamically growing document, +slowing down our upserts significantly as documents are moved in order +to grow them. + +Preallocate[d] +~~~~~~~~~~~~~~ + +In order to keep our documents from growing, we can preallocate them +before they are needed. When preallocating, we set all the statistics to +zero for all time periods so that later, the document doesn't need to +grow to accomodate the upserts. Here, we add this preallocation as its +own function: + +``def preallocate(db, dt_utc, site, page):`` + +\` + +``# Get id values`` + +``id_daily = dt_utc.strftime('%Y%m%d/') + site + page`` + +``id_monthly = dt_utc.strftime('%Y%m/') + site + page`` + +\` + +``# Get daily metadata`` + +``daily_metadata = {`` + +``'date': datetime.combine(dt_utc.date(), time.min),`` + +``'site': site,`` + +``'page': page }`` + +``# Get monthly metadata`` + +``monthly_metadata = {`` + +``'date': daily_m['d'].replace(day=1),`` + +``'site': site,`` + +``'page': page }`` + +\` + +``# Initial zeros for statistics`` + +``hourly = dict((str(i), 0) for i in range(24))`` + +``minute = dict(`` + +``(str(i), dict((str(j), 0) for j in range(60)))`` + +``for i in range(24))`` + +``daily = dict((str(i), 0) for i in range(1, 32))`` + +\` + +``# Perform upserts, setting metadata`` + +``db.stats.daily.update(`` + +``{`` + +``'_id': id_daily,`` + +``'hourly': hourly,`` + +``'minute': minute},`` + +``{ '$set': { 'metadata': daily_metadata }},`` + +``upsert=True)`` + +``db.stats.monthly.update(`` + +``{`` + +``'_id': id_monthly,`` + +``'daily': daily },`` + +``{ '$set': { 'm': monthly_metadata }},`` + +``upsert=True)`` + +\` + +In this case, note that we went ahead and preallocated the monthly +document while we were preallocating the daily document. While we could +have split this into its own function and preallocated monthly documents +less frequently that daily documents, the performance difference is +negligible, so we opted to simply combine monthly preallocation with +daily preallocation. + +The next question we must answer is when we should preallocate. We would +like to have a high likelihood of the document being preallocated before +it is needed, but we don't want to preallocate all at once (say at +midnight) to ensure we don't create a spike in activity and a +corresponding increase in latency. Our solution here is to +probabilistically preallocate each time we log a hit, with a probability +tuned to make preallocation likely without performing too many +unnecessary calls to preallocate: + +\_ + +``from random import random`` + +``from datetime import datetime, timedelta, time`` + +\` + +``# Example probability based on 500k hits per day per page`` + +``prob_preallocate = 1.0 / 500000`` + +\` + +``def log_hit(db, dt_utc, site, page):`` + +``if random.random() < prob_preallocate:`` + +``preallocate(db, dt_utc + timedelta(days=1), site_page)`` + +``# Update daily stats doc`` + +``…`` + +Now with a high probability, we will preallocate each document before +it's used, preventing the midnight spike as well as eliminating the +movement of dynamically growing documents. + +Get data for a real-time chart +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +One chart that we may be interested in seeing would be the number of +hits to a particular page over the last hour. In that case, our query is +fairly straightforward: + +``>>>``db.stats.daily.find_one(`` + +``... {'metadata': {'date':dt, 'site':'site-1', 'page':'/foo.gif'}},`` + +``... { 'minute': 1 })`` + +Likewise, we can get the number of hits to a page over the last day, +with hourly granularity: + +\` + +``>>> db.stats.daily.find_one(`` + +``... {'metadata': {'date':dt, 'site':'site-1', 'page':'/foo.gif'}},`` + +``... { 'hy': 1 })`` + +\` + +If we want a few days' worth of hourly data, we can get it using the +following query: + +\` + +``>>> db.stats.daily.find(`` + +``... {`` + +``... 'metadata.date': { '$gte': dt1, '$lte': dt2 },`` + +``... 'metadata.site': 'site-1',`` + +``... 'metadata.page': '/foo.gif'},`` + +``... { 'metadata.date': 1, 'hourly': 1 } },`` + +``... sort=[('metadata.date', 1)])`` + +In this case, we are retrieving the date along with the statistics since +it's possible (though highly unlikely) that we could have a gap of one +day where a) we didn't happen to preallocate that day and b) there were +no hits to the document on that day. + +Index support +^^^^^^^^^^^^^ + +These operations would benefit significantly from indexes on the +metadata of the daily statistics: + +``>>> db.stats.daily.ensure_index([`` + +``... ('metadata.site', 1),`` + +``... ('metadata.page', 1),`` + +``... ('metadata.date', 1)])`` + +Note in particular that we indexed on the page first, date second. This +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +allows us to perform the third query above (a single page over a range +of days) quite efficiently. Having any compound index on page and date, +of course, allows us to look up a single day's statistics efficiently. + +Get data for a historical chart +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In order to retrieve daily data for a single month, we can perform the +following query: + +``>>> db.stats.monthly.find_one(`` + +``... {'metadata':`` + +``... {'date':dt,`` + +``... 'site': 'site-1',`` + +``... 'page':'/foo.gif'}},`` + +``... { 'daily': 1 })`` + +\` + +If we want several months' worth of daily data, of course, we can do the +same trick as above: + +\` + +``>>> db.stats.monthly.find(`` + +``... {`` + +``... 'metadata.date': { '$gte': dt1, '$lte': dt2 },`` + +``... 'metadata.site': 'site-1',`` + +``... 'metadata.page': '/foo.gif'},`` + +``... { 'metadata.date': 1, 'hourly': 1 } },`` + +``... sort=[('metadata.date', 1)])`` + +Index support +^^^^^^^^^^^^^ + +Once again, these operations would benefit significantly from indexes on +the metadata of the monthly statistics: + +``>>> db.stats.monthly.ensure_index([`` + +``... ('metadata.site', 1),`` + +``... ('metadata.page', 1),`` + +``... ('metadata.date', 1)])`` + +The order of our index is once again designed to efficiently support +range +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +queries for a single page over several months, as above. + +Sharding +-------- + +Our performance in this system will be limited by the number of shards +in our cluster as well as the choice of our shard key. Our ideal shard +key will balance upserts between our shards evenly while routing any +individual query to a single shard (or a small number of shards). A +reasonable shard key for us would thus be ('metadata.site', +'metadata.page'), the site-page combination for which we are calculating +statistics: + +``>>> db.command('shardcollection', 'stats.daily', {`` + +``... key : { 'metadata.site': 1, 'metadata.page' : 1 } })`` + +``{ "collectionsharded" : "stats.daily", "ok" : 1 }`` + +``>>> db.command('shardcollection', 'stats.monthly', {`` + +``... key : { 'metadata.site': 1, 'metadata.page' : 1 } })`` + +``{ "collectionsharded" : "stats.monthly", "ok" : 1 }`` + +One downside to using ('metadata.site', 'metadata.page') as our shard +key is that, if one page dominates all our traffic, all updates to that +page will go to a single shard. The problem, however, is largely +unavoidable, since all update for a single page are going to a single +*document.* + +We also have the problem using only ('metadata.site', 'metadata.page') +shard key that, if a high percentage of our queries go to the same page, +these will all be handled by the same shard. A (slightly) better shard +key would the include the date as well as the site/page so that we could +serve different historical ranges with different shards: + +``>>> db.command('shardcollection', 'stats.daily', {`` + +``... key:{'metadata.site':1,'metadata.page':1,'metadata.date':1}})`` + +``{ "collectionsharded" : "stats.daily", "ok" : 1 }`` + +``>>> db.command('shardcollection', 'stats.monthly', {`` + +``... key:{'metadata.site':1,'metadata.page':1,'metadata.date':1}})`` + +``{ "collectionsharded" : "stats.monthly", "ok" : 1 }`` + +It is worth noting in this discussion of sharding that, depending on the +number of sites/pages you are tracking and the number of hits per page, +we are talking about a fairly small set of data with modest performance +requirements, so sharding may be overkill. In the case of the MongoDB +Monitoring Service (MMS), a single shard is able to keep up with the +totality of traffic generated by all the customers using this (free) +service. + +Page of + +[a]jsr: + +It's worth mentioning that if we organize events as we did in the "log +collection" use case, then queries need to hit lots of documents. the +appeal of this approach is that queries are fast because data is +pre-aggregated. + +-------------- + +rick446: + +Added a paragraph below to talk about why we don't want individual docs +for each 'tick' + +[b]jsr: + +Let's expand the attribute names to full words so it's more readable. + +-------------- + +rick446: + +done + +[c]jsr: + +Similar to table-scan vs. tree. Perhaps a diagram showing the difference +between iterating through 1439 elements in a flat array vs. traversing a +tree. + +-------------- + +rick446: + +Added some diagrams to illustrate the BSON layout and # of 'skip +forward' operations needed to do with each schema + +[d]jsr: + +It's a little unclear why we need pre-allocation. Ryan had a graph that +showed big spikes in insert latency at the beginning of the day, and +then another chart showing smoother performance when pre-allocation was +added. Maybe recreate this chart. + +-------------- + +rick446: + +I could possibly recreate the chart (though that would take several +hours to actually implement). Is the verbiage I added sufficient to get +across the reasoning? diff --git a/source/tutorial/usecase/real_time_analytics-_storing_log_data.txt b/source/tutorial/usecase/real_time_analytics-_storing_log_data.txt new file mode 100644 index 00000000000..3232331607e --- /dev/null +++ b/source/tutorial/usecase/real_time_analytics-_storing_log_data.txt @@ -0,0 +1,694 @@ +Real Time Analytics: Storing Log Data +===================================== + +Problem +------- + +You have one or more servers generating events that you would like to +persist to a MongoDB collection. + +Solution overview +----------------- + +For this solution, we will assume that each server generating events has +access to the MongoDB server(s). We will also assume that the consumer +of the event data has access to the MongoDB server(s) and that the query +rate is (substantially) lower than the insert rate (as is most often the +case when logging a high-bandwidth event stream). + +Schema design +------------- + +\*\* +The schema design in this case will depend largely on the particular +format of the event data you want to store. For a simple example, let's +take standard request logs from the Apache web server using the combined +log format. For this example we will assume you're using an uncapped +collection to store the event data. A line from such a log file might +look like the following: + +``127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "``[http://www.example.com/start.html](http://www.example.com/start.html)``" "Mozilla/4.08 [en] (Win98; I ;Nav)"`` + +\` + +The simplest approach to storing the log data would be putting the exact +text of the log record into a document: + +``{`` + +``_id: ObjectId('4f442120eb03305789000000'),`` + +``line: '127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "``[http://www.example.com/start.html](http://www.example.com/start.html)``" "Mozilla/4.08 [en] (Win98; I ;Nav)"'`` + +``}`` + +While this is a possible solution, it's not likely to be the optimal +solution. For instance, if we decided we wanted to find events that hit +the same page, we would need to use a regular expression query, which +would require a full collection scan. A better approach would be to +extract the relevant fields into individual properties. When doing the +extraction, we should pay attention to the choice of data types for the +various fields. For instance, the date field in the log line +``[10/Oct/2000:13:55:36 -0700]``is 28 bytes long. If we instead store +this as a UTC timestamp, it shrinks to 8 bytes. Storing the date as a +timestamp also gives us the advantage of being able to make date range +queries, whereas comparing two date *strings* is nearly useless. A +similar argument applies to numeric fields; storing them as strings is +suboptimal, taking up more space and making the appropriate types of +queries much more difficult. + +We should also consider what information we might want to omit from the +log record. For instance, if we wanted to record exactly what was in the +log record, we might create a document like the following: + +``{`` + +``_id: ObjectId('4f442120eb03305789000000'),`` + +``host: "127.0.0.1",`` + +``logname: null,`` + +``user: 'frank',`` + +``time: ,`` + +``request: "GET /apache_pb.gif HTTP/1.0",`` + +``status: 200,`` + +``response_size: 2326,`` + +``referer: "``[http://www.example.com/start.html](http://www.example.com/start.html)``",`` + +``user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)"`` + +``}`` + +\` + +In most cases, however, we probably are only interested in a subset of +the data about the request. Here, we may want to keep the host, time, +path, user agent, and referer for a web analytics application: + +``{`` + +``_id: ObjectId('4f442120eb03305789000000'),`` + +``host: "127.0.0.1",`` + +``time: ISODate("2000-10-10T20:55:36Z"),`` + +``path: "/apache_pb.gif",`` + +``referer: "``[http://www.example.com/start.html](http://www.example.com/start.html)``",`` + +``user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)"`` + +``}`` + +\` + +It might even be possible to remove the time, since ObjectIds embed +their the time they are created: + +``{`` + +``_id: ObjectId('4f442120eb03305789000000'),`` + +``host: "127.0.0.1",`` + +``time: ISODate("2000-10-10T20:55:36Z"),`` + +``path: "/apache_pb.gif",`` + +``referer: "``[http://www.example.com/start.html](http://www.example.com/start.html)``",`` + +``user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)"`` + +``}`` + +\` + +**System Architecture** + +\*\* +For an event logging system, we are mainly concerned with two +performance considerations: 1) how many inserts per second can we +perform (this will limit our event throughput) and 2) how will we manage +the growth of event data. Concerning insert performance, the best way to +scale the architecture is via sharding. + +Operations +---------- + +The main performance-critical operation we're concerned with in storing +an event log is the insertion speed. However, we also need to be able to +query the event data for relevant statistics. This section will describe +each of these operations, using the Python programming language and the +pymongo MongoDB driver. These operations would be similar in other +languages as well. + +Inserting a log record +~~~~~~~~~~~~~~~~~~~~~~ + +In many event logging applications, we can accept some degree of risk +when it comes to dropping events. In others, we need to be absolutely +sure we don't drop any events. MongoDB supports both models. In the case +where we can tolerate a risk of loss, we can insert records +*asynchronously* using a fire- and-forget model: + +``>>> import bson`` + +``>>> import pymongo`` + +``>>> from datetime import datetime`` + +``>>> conn = pymongo.Connection()`` + +``>>> db = conn.event_db`` + +``>>> event = {`` + +``... _id: bson.ObjectId(),`` + +``... host: "127.0.0.1",`` + +``... time: datetime(2000,10,10,20,55,36),`` + +``... path: "/apache_pb.gif",`` + +``... referer: "``[http://www.example.com/start.html](http://www.example.com/start.html)``",`` + +``... user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)"`` + +``...}`` + +``>>> db.events.insert(event, safe=False)`` + +This is the fastest approach, as our code doesn't even require a +round-trip to the MongoDB server to ensure that the insert was received. +It is thus also the riskiest approach, as we will not detect network +failures nor server errors (such as DuplicateKeyErrors on a unique +index). If we want to make sure we have an acknowledgement from the +server that our insertion succeeded (for some definition of success), we +can pass safe=True: + +``>>> db.events.insert(event, safe=True)`` + +\` + +If our tolerance for data loss risk is somewhat less, we can require +that the server to which we write the data has committed the event to +the on-disk journal before we continue operation (``safe=True`` is +implied by all the following options): + +``>>> db.events.insert(event, j=True)`` + +\` + +Finally, if we have *extremely low* tolerance for event data loss, we +can require the data to be replicated to multiple secondary servers +before returning: + +``>>> db.events.insert(event, w=2)`` + +In this case, we have requested acknowledgement that the data has been +replicated to 2 replicas. We can combine options as well: + +``>>> db.events.insert(event, j=True, w=2)`` + +In this case, we are waiting on both a journal commit *and* a +replication acknowledgement. Although this is the safest option, it is +also the slowest, so you should be aware of the trade-off when +performing your inserts. + +Aside: Bulk Inserts +^^^^^^^^^^^^^^^^^^^ + +If at all possible in our application architecture, we should consider +using bulk inserts to insert event data. All the options discussed above +apply to bulk inserts, but you can actually pass multiple events as the +first parameter to .insert(). By passing multiple documents into a +single insert() call, we are able to amortize the performance penalty we +incur by using the 'safe' options such as j=True or w=2. + +Finding all the events for a particular page +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +For a web analytics-type operation, getting the logs for a particular +web page might be a common operation that we would want to optimize for. +In this case, the query would be as follows: + +``>>> q_events = db.events.find({'path': '/apache_pb.gif'})`` + +\` + +Note that the sharding setup we use (should we decide to shard this +collection) has performance implications for this operation. For +instance, if we shard on the 'path' property, then this query will be +handled by a single shard, whereas if we shard on some other property or +combination of properties, the mongos instance will be forced to do a +scatter/gather operation which involves *all* the shards. + +Index support +^^^^^^^^^^^^^ + +This operation would benefit significantly from an index on the 'path' +attribute: + +\` + +``>>> db.events.ensure_index('path')`` + +\` + +One potential downside to this index is that it is relatively randomly +distributed, meaning that for efficient operation the entire index +should be resident in RAM. Since there is likely to be a relatively +small number of distinct paths in the index, however, this will probably +not be a problem. + +Finding all the events for a particular date +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +We may also want to find all the events for a particular date. In this +case, we would perform the following query: + +``>>> q_events = db.events.find('time':`` + +``... { '$gte':datetime(2000,10,10),'$lt':datetime(2000,10,11)})`` + +\` + +Index support +^^^^^^^^^^^^^ + +In this case, an index on 'time' would provide optimal performance: + +\` + +``>>> db.events.ensure_index('time')`` + +One of the nice things about this index is that it is *right-aligned.* +Since we are always inserting events in ascending time order, the +right-most slice of the B-tree will always be resident in RAM. So long +as our queries focus mainly on recent events, the *only* part of the +index that needs to be resident in RAM is the right-most slice of the +B-tree, allowing us to keep quite a large index without using up much of +our system memory. + +Finding all the events for a particular host/date +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +\` + +We might also want to analyze the behavior of a particular host on a +particular day, perhaps for analyzing suspicious behavior by a +particular IP address. In that case, we would write a query such as: + +``>>> q_events = db.events.find({`` + +``... 'host': '127.0.0.1',`` + +``... 'time': {'$gte':datetime(2000,10,10),'$lt':datetime(2000,10,11)}`` + +``... })`` + +Index support +^^^^^^^^^^^^^ + +Once again, our choice of indexes affects the performance +characteristics of this query significantly. For instance, suppose we +create a compound index on (time, host): + +``>>> db.events.ensure_index([('time', 1), ('host', 1)])`` + +In this case, the query plan would be the following (retrieved via +q\_events.explain()): + +``{`` + +``…`` + +``u'cursor': u'BtreeCursor time_1_host_1',`` + +``u'indexBounds': {u'host': [[u'127.0.0.1', u'127.0.0.1']],`` + +``u'time': [[datetime.datetime(2000, 10, 10, 0, 0),`` + +``datetime.datetime(2000, 10, 11, 0, 0)]]},`` + +``…`` + +``u'millis': 4,`` + +``u'n': 11,`` + +``…`` + +``u'nscanned': 1296,`` + +``u'nscannedObjects': 11,`` + +``…`` + +``}`` + +\` + +If, however, we create a compound index on (host, time)... + +\` + +``>>> db.events.ensure_index([('host', 1), ('time', 1)])`` + +\` + +We get a much more efficient query plan and much better performance: + +\` + +``{`` + +``…`` + +``u'cursor': u'BtreeCursor host_1_time_1',`` + +``u'indexBounds': {u'host': [[u'127.0.0.1', u'127.0.0.1']],`` + +``u'time': [[datetime.datetime(2000, 10, 10, 0, 0),`` + +``datetime.datetime(2000, 10, 11, 0, 0)]]},`` + +``…`` + +``u'millis': 0,`` + +``u'n': 11,`` + +``…`` + +``u'nscanned': 11,`` + +``u'nscannedObjects': 11,`` + +``…`` + +``}`` + +\` + +In this case, MongoDB is able to visit just 11 entries in the index to +satisfy the query, whereas in the first it needed to visit 1296 entries. +This is because the query using (host, time) needs to search the index +range from ('127.0.0.1', datetime(2000,10,10)) to ('127.0.0.1', +datetime(2000,10,11)) to satisfy the above query, whereas if we used +(time, host), the index range would be (datetime(2000,10,10), MIN\_KEY) +to (datetime(2000,10,10), MAX\_KEY), a much larger range (in this case, +1296 entries) which will yield a correspondingly slower performance. + +Although the index order has an impact on the performance of the query, +one thing to keep in mind is that an index scan is *much* faster than a +collection scan. So using a (time, host) index would still be much +faster than an index on (time) alone. There is also the issue of +right-alignedness to consider, as the (time, host) index will be +right-aligned but the (host, time) index will not, and it's possible +that the right-alignedness of a (time, host) index will make up for the +increased number of index entries that need to be visited to satisfy +this query. + +Counting the number of requests by day and page +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +MongoDB 2.1 introduced a new aggregation framework that allows us to +perform queries that aggregate large numbers of documents significantly +faster than the old 'mapreduce' and 'group' commands in prior versions +of MongoDB. Suppose we want to find out how many requests there were for +each day and page over the last month, for instance. In this case, we +could build up the following aggregation pipeline: + +\` + +``>>> result = db.command('aggregate', 'events', pipeline=[`` + +``... { '$match': {`` + +``... 'time': {`` + +``... '$gte': datetime(2000,10,1),`` + +``... '$lt': datetime(2000,11,1) } } },`` + +``... { '$project': {`` + +``... 'path': 1,`` + +``... 'date': {`` + +``... 'y': { '$year': '$time' },`` + +``... 'm': { '$month': '$time' },`` + +``... 'd': { '$dayOfMonth': '$time' } } } },`` + +``... { '$group': {`` + +``... '_id': {`` + +``... 'p':'$path',`` + +``... 'y': '$date.y',`` + +``... 'm': '$date.m',`` + +``... 'd': '$date.d' },`` + +``... 'hits': { '$sum': 1 } } },`` + +``... ])`` + +\` + +The performance of this aggregation is dependent, of course, on our +choice of shard key if we're sharding. What we'd like to ensure is that +all the items in a particular 'group' are on the same server, which we +can do by sharding on date (probably not wise, as we discuss below) or +path (possibly a good idea). + +Index support +^^^^^^^^^^^^^ + +In this case, we want to make sure we have an index on the initial +$match query: + +\` + +``>>> db.events.ensure_index('time')`` + +\` + +If we already have an index on ('time', 'host') as discussed above, +however, there is no need to create a separate index on 'time' alone, +since the ('time', 'host') index can be used to satisfy range queries on +'time' alone. + +Sharding +-------- + +Our insertion rate is going to be limited by the number of shards we +maintain in our cluster as well as by our choice of a shard key. The +choice of a shard key is important because MongoDB uses *range-based +sharding* . What we *want* to happen is for the insertions to be +balanced equally among the shards, so we want to avoid using something +like a timestamp, sequence number, or ObjectId as a shard key, as new +inserts would tend to cluster around the same values (and thus the same +shard). But what we also *want* to happen is for each of our queries to +be routed to a single shard. Here, we discuss the pros and cons of each +approach. + +Option 0: Shard on time +~~~~~~~~~~~~~~~~~~~~~~~ + +Although an ObjectId or timestamp might seem to be an attractive +sharding key at first, particularly given the right-alignedness of the +index, it turns out to provide the worst of all worlds when it comes to +read and write performance. In this case, all of our inserts will always +flow to the same shard, providing no performance benefit write-side from +sharding. Our reads will also tend to cluster in the same shard, so we +would get no performance benefit read-side either. + +Option 1: Shard on a random(ish) key +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Supose instead that we decided to shard on a key with a random +distribution, say the md5 or sha1 hash of the '\_id' field: + +\` + +``>>> from bson import Binary`` + +``>>> from hashlib import sha1`` + +``>>>`` + +``>>> # Introduce the synthetic shard key (this should actually be done at`` + +``>>> # event insertion time)`` + +``>>>`` + +``>>> for ev in db.events.find({}, {'_id':1}):`` + +``... ssk = Binary(sha1(str(ev._id))).digest())`` + +``... db.events.update({'_id':ev['_id']}, {'$set': {'ssk': ssk} })`` + +``...`` + +``>>> db.command('shardcollection', 'events', {`` + +``... key : { 'ssk' : 1 } })`` + +``{ "collectionsharded" : "events", "ok" : 1 }`` + +This does introduce some complexity into our application in order to +generate the random key, but it provides us linear scaling on our +inserts, so 5 shards should yield a 5x speedup in inserting. The +downsides to using a random shard key are the following: a) the shard +key's index will tend to take up more space (and we need an index to +determine where to place each new insert), and b) queries (unless they +include our synthetic, random-ish shard key) will need to be distributed +to all our shards in parallel. This may be acceptable, since in our +scenario our write performance is much more important than our read +performance, but we should be aware of the downsides to using a random +key distribution. + +Option 2: Shard on a naturally evenly-distributed key +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In this case, we might choose to shard on the 'path' attribute, since it +seems to be relatively evenly distributed: + +``>>> db.command('shardcollection', 'events', {`` + +``... key : { 'path' : 1 } })`` + +``{ "collectionsharded" : "events", "ok" : 1 }`` + +This has a couple of advantages: a) writes tend to be evenly balanced, +and b) reads tend to be selective (assuming they include the 'path' +attribute in the query). There is a potential downside to this approach, +however, particularly in the case where there are a limited number of +distinct values for the path. In that case, you can end up with large +shard 'chunks' that cannot be split or rebalanced because they contain +only a single shard key. The rule of thumb here is that we should not +pick a shard key which allows large numbers of documents to have the +same shard key since this prevents rebalancing. + +Option 3: Combine a natural and synthetic key +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This approach is perhaps the best combination of read and write +performance for our application. We can define the shard key to be +(path, sha1(\_id)): + +\` + +``>>> db.command('shardcollection', 'events', {`` + +``... key : { 'path' : 1, 'ssk': 1 } })`` + +``{ "collectionsharded" : "events", "ok" : 1 }`` + +We still need to calculate a synthetic key in the application client, +but in return we get good write balancing as well as good read +selectivity. + +Sharding conclusion: Test with your own data +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Picking a good shard key is unfortunately still one of those decisions +that is simultaneously difficult to make, high-impact, and difficult to +change once made. The particular mix of reading and writing, as well as +the particular queries used, all have a large impact on the performance +of different sharding configurations. Although you can choose a +reasonable shard key based on the considerations above, the best +approach is to analyze the actual insertions and queries you are using +in your own application. + +Variation: Capped Collections +----------------------------- + +One variation that you may want to consider based on your data retention +requirements is whether you might be able to use a `capped +collection `_ to +store your events. Capped collections might be a good choice if you know +you will process the event documents in a timely manner and you don't +have exacting data retention requirements on the event data. Capped +collections have the advantage of never growing beyond a certain size +(they are allocated as a circular buffer on the disk) and having +documents 'fall out' of the buffer in their insertion order. Uncapped +collections (the default) will persist documents until they are +explicitly removed from the collection or the collection is dropped. + +Appendix: Managing Event Data Growth +------------------------------------ + +MongoDB databases, in the course of normal operation, never relinquish +disk space back to the file system. This can create difficulties if you +don't manage the size of your databases up front. For event data, we +have a few options for managing the data growth: + +Single Collection +~~~~~~~~~~~~~~~~~ + +This is the simplest option: keep all events in a single collection, +periodically removing documents that we don't need any more. The +advantage of simplicity, however, is offset by some performance +considerations. First, when we execute our remove, MongoDB will actually +bring the documents being removed into memory. Since these are documents +that presumably we haven't touched in a while (that's why we're deleting +them), this will force more relevant data to be flushed out to disk. +Second, in order to do a reasonably fast remove operation, we probably +want to keep an index on a timestamp field. This will tend to slow down +our inserts, as the inserts have to update the index as well as write +the event data. Finally, removing data periodically will also be the +option that has the most potential for fragmenting the database, as +MongoDB attempts to reuse the space freed by the remove operations for +new events. + +Multiple Collections, Single Database +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Our next option is to periodically *rename* our event collection, +rotating collections in much the same way we might rotate log files. We +would then drop the oldest collection from the database. This has +several advantages over the single collection approach. First off, +collection renames are both fast and atomic. Secondly, we don't actually +have to touch any of the documents to drop a collection. Finally, since +MongoDB allocates storage in *extents* that are owned by collections, +dropping a collection will free up entire extents, mitigating the +fragmentation risk. The downside to using multiple collections is +increased complexity, since you will probably need to query both the +current event collection and the previous event collection for any data +analysis you perform. + +Multiple Databases +~~~~~~~~~~~~~~~~~~ + +In the multiple database option, we take the multiple collection option +a step further. Now, rather than rotating our collections, we will +rotate our databases. At the cost of rather increased complexity both in +insertions and queries, we do gain one benefit: as our databases get +dropped, disk space gets returned to the operating system. This option +would only really make sense if you had extremely variable event +insertion rates or if you had variable data retention requirements. For +instance, if you are performing a large backfill of event data and want +to make sure that the entire set of event data for 90 days is available +during the backfill, but can be reduced to 30 days in ongoing +operations, you might consider using multiple databases. The complexity +cost for multiple databases, however, is significant, so this option +should only be taken after thorough analysis. + +Page of From b0c2725d6e02023d1daa055222174db670686920 Mon Sep 17 00:00:00 2001 From: Rick Copeland Date: Sat, 17 Mar 2012 19:18:52 -0700 Subject: [PATCH 02/20] Begin formatting cleanup of usecases Signed-off-by: Rick Copeland --- .../cms-_metadata_and_asset_management.txt | 542 +++++--------- .../usecase/cms-_storing_comments.txt | 614 ++++++--------- .../usecase/ecommerce-_category_hierarchy.txt | 193 ++--- .../ecommerce-_inventory_management.txt | 506 +++++-------- .../usecase/ecommerce-_product_catalog.txt | 703 ++++++------------ ...me_analytics-_hierarchical_aggregation.txt | 688 ++++++----------- ..._time_analytics-_preaggregated_reports.txt | 691 ++++++----------- .../real_time_analytics-_storing_log_data.txt | 429 ++++------- 8 files changed, 1539 insertions(+), 2827 deletions(-) diff --git a/source/tutorial/usecase/cms-_metadata_and_asset_management.txt b/source/tutorial/usecase/cms-_metadata_and_asset_management.txt index 5e712c5b3fe..f03ec0fc448 100644 --- a/source/tutorial/usecase/cms-_metadata_and_asset_management.txt +++ b/source/tutorial/usecase/cms-_metadata_and_asset_management.txt @@ -43,67 +43,43 @@ a url-friendly representation of the node that is unique within its section, and is used for mapping URLs to nodes. Each document also contains a 'detail' field which will vary per document type: -``{`` - -``_id: ObjectId(…),`` - -``nonce: ObjectId(…),`` - -``metadata: {`` - -``type: 'basic-page'`` - -``section: 'my-photos',`` - -``slug: 'about',`` - -``title: 'About Us',`` - -``created: ISODate(…),`` - -``author: { _id: ObjectId(…), name: 'Rick' },`` - -``tags: [ … ],`` - -``detail: { text: '# About Us\n…' }`` - -``}`` - -``}`` - -\` +:: + + { + _id: ObjectId(…), + nonce: ObjectId(…), + metadata: { + type: 'basic-page' + section: 'my-photos', + slug: 'about', + title: 'About Us', + created: ISODate(…), + author: { _id: ObjectId(…), name: 'Rick' }, + tags: [ … ], + detail: { text: '# About Us\n…' } + } + } For the basic page above, the detail field might simply contain the text of the page. In the case of a blog entry, the document might resemble the following instead: -``{`` - -``…`` - -``metadata: {`` - -``…`` - -``type: 'blog-entry',`` - -``section: 'my-blog',`` - -``slug: '2012-03-noticed-the-news',`` - -``…`` - -``detail: {`` - -``publish_on: ISODate(…),`` - -``text: 'I noticed the news from Washington today…'`` - -``}`` - -``}`` - -``}`` +:: + + { + … + metadata: { + … + type: 'blog-entry', + section: 'my-blog', + slug: '2012-03-noticed-the-news', + … + detail: { + publish_on: ISODate(…), + text: 'I noticed the news from Washington today…' + } + } + } Photos present something of a special case. Since we will need to store potentially very large photos, we would like separate our binary storage @@ -115,50 +91,32 @@ our case, we will call the two collections 'cms.assets.files' and collection to store the normal GridFS metadata as well as our node metadata: -``{`` - -``_id: ObjectId(…),`` - -``length: 123...,`` - -``chunkSize: 262144,`` - -``uploadDate: ISODate(…),`` - -``contentType: 'image/jpeg',`` - -``md5: 'ba49a...',`` - -``metadata: {`` - -``nonce: ObjectId(…),`` - -``slug: '2012-03-invisible-bicycle',`` - -``type: 'photo',`` - -``section: 'my-album',`` - -``title: 'Kitteh',`` - -``created: ISODate(…),`` - -``author: { _id: ObjectId(…), name: 'Jared' },`` - -``tags: [ … ],`` - -``detail: {`` - -``filename: 'kitteh_invisible_bike.jpg',`` - -``resolution: [ 1600, 1600 ], … }`` - -``}`` - -``}`` +:: + + { + _id: ObjectId(…), + length: 123..., + chunkSize: 262144, + uploadDate: ISODate(…), + contentType: 'image/jpeg', + md5: 'ba49a...', + metadata: { + nonce: ObjectId(…), + slug: '2012-03-invisible-bicycle', + type: 'photo', + section: 'my-album', + title: 'Kitteh', + created: ISODate(…), + author: { _id: ObjectId(…), name: 'Jared' }, + tags: [ … ], + detail: { + filename: 'kitteh_invisible_bike.jpg', + resolution: [ 1600, 1600 ], … } + } + } Here, we have embedded the schema for our 'normal' nodes so we can share -node- manipulation code among all types of nodes. +node-manipulation code among all types of nodes. Operations ---------- @@ -174,71 +132,49 @@ The content producers using our CMS will be creating and editing content most of the time. Most content-creation activities are relatively straightforward: -``db.cms.nodes.insert({`` - -``'nonce': ObjectId(),`` - -``'metadata': {`` - -``'section': 'myblog',`` - -``'slug': '2012-03-noticed-the-news',`` - -``'type': 'blog-entry',`` - -``'title': 'Noticed in the News',`` - -``'created': datetime.utcnow(),`` - -``'author': { 'id': user_id, 'name': 'Rick' },`` - -``'tags': [ 'news', 'musings' ],`` - -``'detail': {`` - -``'publish_on': datetime.utcnow(),`` - -``'text': 'I noticed the news from Washington today…' }`` - -``}`` - -``})`` +:: + + db.cms.nodes.insert({ + 'nonce': ObjectId(), + 'metadata': { + 'section': 'myblog', + 'slug': '2012-03-noticed-the-news', + 'type': 'blog-entry', + 'title': 'Noticed in the News', + 'created': datetime.utcnow(), + 'author': { 'id': user_id, 'name': 'Rick' }, + 'tags': [ 'news', 'musings' ], + 'detail': { + 'publish_on': datetime.utcnow(), + 'text': 'I noticed the news from Washington today…' } + } + }) Once the node is in the database, we have a potential problem with multiple editors. In order to support this, we use the special 'nonce' value to detect when another editor may have modified the document and allow the application to resolve any conflicts: -``def update_text(section, slug, nonce, text):`` - -``result = db.cms.nodes.update(`` - -``{ 'metadata.section': section,`` - -``'metadata.slug': slug,`` - -``'nonce': nonce },`` - -``{ '$set':{'metadata.detail.text': text, 'nonce': ObjectId() } },`` +:: -``safe=True)`` - -``if not result['updatedExisting']:`` - -``raise ConflictError()`` - -\` + def update_text(section, slug, nonce, text): + result = db.cms.nodes.update( + { 'metadata.section': section, + 'metadata.slug': slug, + 'nonce': nonce }, + { '$set':{'metadata.detail.text': text, 'nonce': ObjectId() } }, + safe=True) + if not result['updatedExisting']: + raise ConflictError() We might also want to perform metadata edits to the item such as adding tags: -``db.cms.nodes.update(`` - -``{ 'metadata.section': section, 'metadata.slug': slug },`` - -``{ '$addToSet': { 'tags': { '$each': [ 'interesting', 'funny' ] } } })`` +:: -\` + db.cms.nodes.update( + { 'metadata.section': section, 'metadata.slug': slug }, + { '$addToSet': { 'tags': { '$each': [ 'interesting', 'funny' ] } } }) In this case, we don't actually need to supply the nonce (nor update it) since we are using the atomic $addToSet modifier in MongoDB. @@ -250,21 +186,19 @@ Our updates in this case are based on equality queries containing the (section, slug, and nonce) values. To support these queries, we might use the following index: -``>>> db.cms.nodes.ensure_index([`` +:: -``... ('metadata.section', 1), ('metadata.slug', 1), ('nonce', 1) ])`` - -\` + >>> db.cms.nodes.ensure_index([ + ... ('metadata.section', 1), ('metadata.slug', 1), ('nonce', 1) ]) Also note, however, that we would like to ensure that two editors don't create two documents with the same section and slug. To support this, we will use a second index with a unique constraint: -\` - -``>>> db.cms.nodes.ensure_index([`` +:: -``... ('metadata.section', 1), ('metadata.slug', 1)], unique=True)`` + >>> db.cms.nodes.ensure_index([ + ... ('metadata.section', 1), ('metadata.slug', 1)], unique=True) In fact, since we expect that most of the time (section, slug, nonce) is going to be unique, we don't actually get much benefit from the first @@ -277,55 +211,31 @@ Upload a photo Uploading photos to our site shares some things in common with node update, but it also has some extra nuances: -\` - -``def upload_new_photo(`` - -``input_file, section, slug, title, author, tags, details):`` - -``fs = GridFS(db, 'cms.assets')`` - -``with fs.new_file(`` - -``content_type='image/jpeg',`` - -``metadata=dict(`` - -``type='photo',`` - -``locked=datetime.utcnow(),`` - -``section=section,`` - -``slug=slug,`` - -``title=title,`` - -``created=datetime.utcnow(),`` - -``author=author,`` - -``tags=tags,`` - -``detail=detail)) as upload_file:`` - -``while True:`` - -``chunk = input_file.read(upload_file.chunk_size)`` - -``if not chunk: break`` - -``upload_file.write(chunk)`` - -``# unlock the file`` - -``db.assets.files.update(`` - -``{'_id': upload_file._id},`` - -``{'$set': { 'locked': None } } )`` - -\` +:: + + def upload_new_photo( + input_file, section, slug, title, author, tags, details): + fs = GridFS(db, 'cms.assets') + with fs.new_file( + content_type='image/jpeg', + metadata=dict( + type='photo', + locked=datetime.utcnow(), + section=section, + slug=slug, + title=title, + created=datetime.utcnow(), + author=author, + tags=tags, + detail=detail)) as upload_file: + while True: + chunk = input_file.read(upload_file.chunk_size) + if not chunk: break + upload_file.write(chunk) + # unlock the file + db.assets.files.update( + {'_id': upload_file._id}, + {'$set': { 'locked': None } } ) Here, since uploading the photo is a non-atomic operation, we have locked the file during upload by writing the current datetime into the @@ -333,76 +243,49 @@ record. This lets us detect when a file upload may be stalled, which is helpful when working with multiple editors. In this case, we will assume that the last update wins: -\` - -``def update_photo_content(input_file, section, slug):`` - -``fs = GridFS(db, 'cms.assets')`` - -\` - -``# Delete the old version if it's unlocked or was locked more than 5`` - -``# minutes ago`` - -``file_obj = db.cms.assets.find_one(`` - -``{ 'metadata.section': section,`` - -``'metadata.slug': slug,`` - -``'metadata.locked': None })`` - -``if file_obj is None:`` - -``threshold = datetime.utcnow() - timedelta(seconds=300)`` - -``file_obj = db.cms.assets.find_one(`` - -``{ 'metadata.section': section,`` - -``'metadata.slug': slug,`` - -``'metadata.locked': { '$lt': threshold } })`` - -``if file_obj is None: raise FileDoesNotExist()`` - -``fs.delete(file_obj['_id'])`` - -\` - -``# update content, keep metadata unchanged`` - -``file_obj['locked'] = datetime.utcnow()`` - -``with fs.new_file(**file_obj):`` - -``while True:`` - -``chunk = input_file.read(upload_file.chunk_size)`` - -``if not chunk: break`` - -``upload_file.write(chunk)`` - -``# unlock the file`` - -``db.assets.files.update(`` - -``{'_id': upload_file._id},`` - -``{'$set': { 'locked': None } } )`` +:: + + def update_photo_content(input_file, section, slug): + fs = GridFS(db, 'cms.assets') + + + # Delete the old version if it's unlocked or was locked more than 5 + # minutes ago + file_obj = db.cms.assets.find_one( + { 'metadata.section': section, + 'metadata.slug': slug, + 'metadata.locked': None }) + if file_obj is None: + threshold = datetime.utcnow() - timedelta(seconds=300) + file_obj = db.cms.assets.find_one( + { 'metadata.section': section, + 'metadata.slug': slug, + 'metadata.locked': { '$lt': threshold } }) + if file_obj is None: raise FileDoesNotExist() + fs.delete(file_obj['_id']) + + + # update content, keep metadata unchanged + file_obj['locked'] = datetime.utcnow() + with fs.new_file(**file_obj): + while True: + chunk = input_file.read(upload_file.chunk_size) + if not chunk: break + upload_file.write(chunk) + # unlock the file + db.assets.files.update( + {'_id': upload_file._id}, + {'$set': { 'locked': None } } ) We can, of course, perform metadata edits to the item such as adding tags without the extra complexity: -``db.cms.assets.files.update(`` - -``{ 'metadata.section': section, 'metadata.slug': slug },`` - -``{ '$addToSet': {`` +:: -``'metadata.tags': { '$each': [ 'interesting', 'funny' ] } } })`` + db.cms.assets.files.update( + { 'metadata.section': section, 'metadata.slug': slug }, + { '$addToSet': { + 'metadata.tags': { '$each': [ 'interesting', 'funny' ] } } }) Index support ^^^^^^^^^^^^^ @@ -414,11 +297,10 @@ unique constraint on (section, slug) to ensure that one of the calls to GridFS.new\_file() will fail multiple editors try to create or update the file simultaneously. -\` +:: -``>>> db.cms.assets.files.ensure_index([`` - -``... ('metadata.section', 1), ('metadata.slug', 1)], unique=True)`` + >>> db.cms.assets.files.ensure_index([ + ... ('metadata.section', 1), ('metadata.slug', 1)], unique=True) Locate and render a node ~~~~~~~~~~~~~~~~~~~~~~~~ @@ -427,9 +309,10 @@ We want to be able to locate a node based on its section and slug, which we assume have been extracted from the page definition and URL by some other technology. -``node = db.nodes.find_one(`` +:: -``{'metadata.section': section, 'metadata.slug': slug })`` + node = db.nodes.find_one( + {'metadata.section': section, 'metadata.slug': slug }) Index support ^^^^^^^^^^^^^ @@ -444,13 +327,12 @@ We want to be able to locate an image based on its section and slug, which we assume have been extracted from the page definition and URL just as with other nodes. -``fs = GridFS(db, 'cms.assets')`` - -``with fs.get_version(`` +:: -``**{'metadata.section': section, 'metadata.slug': slug }) as img_fp:`` - -``# do something with our image file`` + fs = GridFS(db, 'cms.assets') + with fs.get_version( + **{'metadata.section': section, 'metadata.slug': slug }) as img_fp: + # do something with our image file Index support ^^^^^^^^^^^^^ @@ -463,9 +345,9 @@ Search for nodes by tag Here we would like to retrieve a list of nodes based on their tag: -\` +:: -``nodes = db.nodes.find({'metadata.tags': tag })`` + nodes = db.nodes.find({'metadata.tags': tag }) Index support ^^^^^^^^^^^^^ @@ -473,30 +355,23 @@ Index support To support searching efficiently, we should define indexes on any fields we intend on using in our query: -\` - -``>>> db.cms.nodes.ensure_index('tags')`` +:: -\` + >>> db.cms.nodes.ensure_index('tags') Search for images by tag ~~~~~~~~~~~~~~~~~~~~~~~~ Here we would like to retrieve a list of images based on their tag: -\` +:: -``image_file_objects = db.cms.assets.files.find({'metadata.tags': tag })`` - -``fs = GridFS(db, 'cms.assets')`` - -``for image_file_object in db.cms.assets.files.find(`` - -``{'metadata.tags': tag }):`` - -``image_file = fs.get(image_file_object['_id'])`` - -``# do something with the image file`` + image_file_objects = db.cms.assets.files.find({'metadata.tags': tag }) + fs = GridFS(db, 'cms.assets') + for image_file_object in db.cms.assets.files.find( + {'metadata.tags': tag }): + image_file = fs.get(image_file_object['_id']) + # do something with the image file Index support ^^^^^^^^^^^^^ @@ -504,9 +379,9 @@ Index support As above, in order to support searching efficiently, we should define indexes on any fields we intend on using in our query: -\` +:: -``>>> db.cms.assets.files.ensure_index('tags')`` + >>> db.cms.assets.files.ensure_index('tags') Generate a feed of recently published blog articles ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -514,15 +389,12 @@ Generate a feed of recently published blog articles Here, we wish to generate an .rss or .atom feed for our recently published blog articles, sorted by date descending: -\` +:: -``articles = db.nodes.find({`` - -``'metadata.section': 'my-blog'`` - -``'metadata.published': { '$lt': datetime.utcnow() } })`` - -``articles = articles.sort({'metadata.published': -1})`` + articles = db.nodes.find({ + 'metadata.section': 'my-blog' + 'metadata.published': { '$lt': datetime.utcnow() } }) + articles = articles.sort({'metadata.published': -1}) In order to support this operation, we will create an index on (section, published) so the items are 'in order' for our query. Note that in cases @@ -530,11 +402,10 @@ where we are sorting or using range queries, as here, the field on which we're sorting or performing a range query must be the final field in our index: -\` - -``>>> db.cms.nodes.ensure_index(`` +:: -``... [ ('metadata.section', 1), ('metadata.published', -1) ])`` + >>> db.cms.nodes.ensure_index( + ... [ ('metadata.section', 1), ('metadata.published', -1) ]) Sharding -------- @@ -542,43 +413,32 @@ Sharding In a CMS system, our read performance is generally much more important than our write performance. As such, we will optimize the sharding setup for read performance. In order to achieve the best read performance, we -need to ensure that queries are *routeable* by the mongos process. - -A second consideration when sharding is that unique indexes do not span +need to ensure that queries are *routeable* by the mongos process. A +second consideration when sharding is that unique indexes do not span shards. As such, our shard key must include the unique indexes we have defined in order to get the same semantics as we have described. Given these constraints, sharding the nodes and assets on (section, slug) seems to be a reasonable approach: -\` - -``>>> db.command('shardcollection', 'cms.nodes', {`` - -``... key : { 'metadata.section': 1, 'metadata.slug' : 1 } })`` - -``{ "collectionsharded" : "cms.nodes", "ok" : 1 }`` - -``>>> db.command('shardcollection', 'cms.assets.files', {`` +:: -``... key : { 'metadata.section': 1, 'metadata.slug' : 1 } })`` - -``{ "collectionsharded" : "cms.assets.files", "ok" : 1 }`` - -\` + >>> db.command('shardcollection', 'cms.nodes', { + ... key : { 'metadata.section': 1, 'metadata.slug' : 1 } }) + { "collectionsharded" : "cms.nodes", "ok" : 1 } + >>> db.command('shardcollection', 'cms.assets.files', { + ... key : { 'metadata.section': 1, 'metadata.slug' : 1 } }) + { "collectionsharded" : "cms.assets.files", "ok" : 1 } If we wish to shard our 'cms.assets.chunks' collection, we need to shard on the \_id field (none of our metadata is available on the chunks collection in gridfs): -\` +:: -``>>> db.command('shardcollection', 'cms.assets.chunks'`` - -``{ "collectionsharded" : "cms.assets.chunks", "ok" : 1 }`` + >>> db.command('shardcollection', 'cms.assets.chunks' + { "collectionsharded" : "cms.assets.chunks", "ok" : 1 } This actually still maintains our query-routability constraint, since all reads from gridfs must first look up the document in 'files' and then look up the chunks separately (though the GridFS API sometimes -hides this detail from us.) - -Page of +hides this detail from us.) Page of diff --git a/source/tutorial/usecase/cms-_storing_comments.txt b/source/tutorial/usecase/cms-_storing_comments.txt index 9b685cd18c8..1a3bccb61e8 100644 --- a/source/tutorial/usecase/cms-_storing_comments.txt +++ b/source/tutorial/usecase/cms-_storing_comments.txt @@ -42,25 +42,16 @@ Schema design: One Document Per Comment A comment in the one document per comment format might have a structure similar to the following: -\` +:: -``{`` - -``_id: ObjectId(…),`` - -``discussion_id: ObjectId(…),`` - -``slug: '34db',`` - -``posted: ISODateTime(…),`` - -``author: { id: ObjectId(…), name: 'Rick' },`` - -``text: 'This is so bogus … '`` - -``}`` - -\` + { + _id: ObjectId(…), + discussion_id: ObjectId(…), + slug: '34db', + posted: ISODateTime(…), + author: { id: ObjectId(…), name: 'Rick' }, + text: 'This is so bogus … ' + } The format above is really only suitable for chronological display of commentary. We maintain a reference to the discussion in which this @@ -69,29 +60,18 @@ and author, and the comment text. If we want to support threading in this format, we need to maintain some notion of hierarchy in the comment model as well: -\` - -``{`` - -``_id: ObjectId(…),`` +:: -``discussion_id: ObjectId(…),`` - -``parent_id: ObjectId(…),`` - -``slug: '34db/8bda',`` - -``full_slug: '34db:2012.02.08.12.21.08/8bda:2012.02.09.22.19.16',`` - -``posted: ISODateTime(…),`` - -``author: { id: ObjectId(…), name: 'Rick' },`` - -``text: 'This is so bogus … '`` - -``}`` - -\` + { + _id: ObjectId(…), + discussion_id: ObjectId(…), + parent_id: ObjectId(…), + slug: '34db/8bda', + full_slug: '34db:2012.02.08.12.21.08/8bda:2012.02.09.22.19.16', + posted: ISODateTime(…), + author: { id: ObjectId(…), name: 'Rick' }, + text: 'This is so bogus … ' + } Here, we have stored some extra information into the document that represents this document's position in the hierarchy. In addition to @@ -113,70 +93,48 @@ Post a new comment In order to post a new comment in a chronologically ordered (unthreaded) system, all we need to do is the following: -``slug = generate_psuedorandom_slug()`` - -``db.comments.insert({`` - -``'discussion_id': discussion_id,`` +:: -``'slug': slug,`` - -``'posted': datetime.utcnow(),`` - -``'author': author_info,`` - -``'text': comment_text })`` + slug = generate_psuedorandom_slug() + db.comments.insert({ + 'discussion_id': discussion_id, + 'slug': slug, + 'posted': datetime.utcnow(), + 'author': author_info, + 'text': comment_text }) In the case of a threaded discussion, we have a bit more work to do in order to generate a 'pathed' slug and full\_slug: -``posted = datetime.utcnow()`` - -\` - -``# generate the unique portions of the slug and full_slug`` - -``slug_part = generate_psuedorandom_slug()`` - -``full_slug_part = slug_part + ':' + posted.strftime(`` - -``'%Y.%m.%d.%H.%M.%S')`` - -\` - -``# load the parent comment (if any)`` - -``if parent_slug:`` - -``parent = db.comments.find_one(`` - -``{'discussion_id': discussion_id, 'slug': parent_slug })`` - -``slug = parent['slug'] + '/' + slug_part`` - -``full_slug = parent['full_slug'] + '/' + full_slug_part`` - -``else:`` - -``slug = slug_part`` - -``full_slug = full_slug_part`` - -\` +:: -``# actually insert the comment`` + posted = datetime.utcnow() -``db.comments.insert({`` -``'discussion_id': discussion_id,`` + # generate the unique portions of the slug and full_slug + slug_part = generate_psuedorandom_slug() + full_slug_part = slug_part + ':' + posted.strftime( + '%Y.%m.%d.%H.%M.%S') -``'slug': slug, 'full_slug': full_slug,`` -``'posted': posted,`` + # load the parent comment (if any) + if parent_slug: + parent = db.comments.find_one( + {'discussion_id': discussion_id, 'slug': parent_slug }) + slug = parent['slug'] + '/' + slug_part + full_slug = parent['full_slug'] + '/' + full_slug_part + else: + slug = slug_part + full_slug = full_slug_part -``'author': author_info,`` -``'text': comment_text })`` + # actually insert the comment + db.comments.insert({ + 'discussion_id': discussion_id, + 'slug': slug, 'full_slug': full_slug, + 'posted': posted, + 'author': author_info, + 'text': comment_text }) View the (paginated) comments for a discussion ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -184,27 +142,23 @@ View the (paginated) comments for a discussion To actually view the comments in the non-threaded design, we need merely to select all comments participating in a discussion, sorted by date: -``cursor = db.comments.find({'discussion_id': discussion_id})`` +:: -``cursor = cursor.sort('posted')`` - -``cursor = cursor.skip(page_num * page_size)`` - -``cursor = cursor.limit(page_size)`` - -\` + cursor = db.comments.find({'discussion_id': discussion_id}) + cursor = cursor.sort('posted') + cursor = cursor.skip(page_num * page_size) + cursor = cursor.limit(page_size) Since the full\_slug embeds both hierarchical information via the path and chronological information, we can use a simple sort on the full\_slug property to retrieve a threaded view: -``cursor = db.comments.find({'discussion_id': discussion_id})`` - -``cursor = cursor.sort('full_slug')`` - -``cursor = cursor.skip(page_num * page_size)`` +:: -``cursor = cursor.limit(page_size)`` + cursor = db.comments.find({'discussion_id': discussion_id}) + cursor = cursor.sort('full_slug') + cursor = cursor.skip(page_num * page_size) + cursor = cursor.limit(page_size) Index support ^^^^^^^^^^^^^ @@ -213,13 +167,12 @@ In order to efficiently support the queries above, we should maintain two compound indexes, one on (discussion\_id, posted), and the other on (discussion\_id, full\_slug): -``>>> db.comments.ensure_index([`` +:: -``... ('discussion_id', 1), ('posted', 1)])`` - -``>>> db.comments.ensure_index([`` - -``... ('discussion_id', 1), ('full_slug', 1)])`` + >>> db.comments.ensure_index([ + ... ('discussion_id', 1), ('posted', 1)]) + >>> db.comments.ensure_index([ + ... ('discussion_id', 1), ('full_slug', 1)]) Note that we must ensure that the final element in a compound index is the field by which we are sorting to ensure efficient performance of @@ -232,23 +185,22 @@ Here, we wish to directly retrieve a comment (e.g. *not* requiring paging through all preceeding pages of commentary). In this case, we simply use the slug: -``comment = db.comments.find_one({`` - -``'discussion_id': discussion_id,`` +:: -``'slug': comment_slug})`` + comment = db.comments.find_one({ + 'discussion_id': discussion_id, + 'slug': comment_slug}) We can also retrieve a sub-discussion (a comment and all of its descendants recursively) by performing a prefix query on the full\_slug field: -``subdiscussion = db.comments.find_one({`` - -``'discussion_id': discussion_id,`` - -``'full_slug': re.compile('^' + re.escape(parent_slug)) })`` +:: -``subdiscussion = subdiscussion.sort('full_slug')`` + subdiscussion = db.comments.find_one({ + 'discussion_id': discussion_id, + 'full_slug': re.compile('^' + re.escape(parent_slug)) }) + subdiscussion = subdiscussion.sort('full_slug') Index support ^^^^^^^^^^^^^ @@ -257,9 +209,10 @@ Since we already have indexes on (discussion\_id, full\_slug) to support retrieval of subdiscussion, all we need is an index on (discussion\_id, slug) to efficiently support retrieval of a comment by 'permalink': -``>>> db.comments.ensure_index([`` +:: -``... ('discussion_id', 1), ('slug', 1)])`` + >>> db.comments.ensure_index([ + ... ('discussion_id', 1), ('slug', 1)]) Schema design: All comments embedded ------------------------------------ @@ -268,27 +221,17 @@ In this design, we wish to embed an entire discussion within its topic document, be it a blog article, news story, or discussion thread. A topic document, then, might look something like the following: -\` - -``{`` - -``_id: ObjectId(…),`` - -``… lots of topic data …`` - -``comments: [`` - -``{ posted: ISODateTime(…),`` - -``author: { id: ObjectId(…), name: 'Rick' },`` - -``text: 'This is so bogus … ' },`` - -``… ]`` - -``}`` +:: -\` + { + _id: ObjectId(…), + … lots of topic data … + comments: [ + { posted: ISODateTime(…), + author: { id: ObjectId(…), name: 'Rick' }, + text: 'This is so bogus … ' }, + … ] + } The format above is really only suitable for chronological display of commentary. The comments are embedded in chronological order, with their @@ -297,36 +240,24 @@ comments in sorted order, there is no need to maintain a slug per comment. If we want to support threading in the embedded format, we need to embed comments within comments: -\` +:: -``{`` + { + _id: ObjectId(…), + … lots of topic data … + replies: [ + { posted: ISODateTime(…), + author: { id: ObjectId(…), name: 'Rick' }, -``_id: ObjectId(…),`` -``… lots of topic data …`` - -``replies: [`` - -``{ posted: ISODateTime(…),`` - -``author: { id: ObjectId(…), name: 'Rick' },`` - -\` - -``text: 'This is so bogus … ',`` - -``replies: [`` - -``{ author: { … }, … },`` - -``… ]`` - -``}`` - -\` + text: 'This is so bogus … ', + replies: [ + { author: { … }, … }, + … ] + } Here, we have added a 'replies' property to each comment which can hold -sub- comments and so on. One thing in particular to note about the +sub-comments and so on. One thing in particular to note about the embedded document formats is we give up some flexibility when we embed the documents, effectively 'baking in' the decisions we've made about the proper display format. If we (or our users) someday wish to switch @@ -355,17 +286,14 @@ Post a new comment In order to post a new comment in a chronologically ordered (unthreaded) system, all we need to do is the following: -``db.discussion.update(`` +:: -``{ 'discussion_id': discussion_id },`` - -``{ '$push': { 'comments': {`` - -``'posted': datetime.utcnow(),`` - -``'author': author_info,`` - -``'text': comment_text } } } )`` + db.discussion.update( + { 'discussion_id': discussion_id }, + { '$push': { 'comments': { + 'posted': datetime.utcnow(), + 'author': author_info, + 'text': comment_text } } } ) Note that since we use the $push operator, all the comments will be inserted in their correct chronological order. In the case of a threaded @@ -373,31 +301,20 @@ discussion, we have a good bit more work to do. In order to reply to a comment, we will assume that we have the 'path' to the comment we are replying to as a list of positions: -``if path != []:`` - -``str_path = '.'.join('replies.%d' % part for part in path)`` - -``str_path += '.replies'`` - -``else:`` - -``str_path = 'replies'`` - -``db.discussion.update(`` - -``{ 'discussion_id': discussion_id },`` - -``{ '$push': {`` - -``str_path: {`` - -``'posted': datetime.utcnow(),`` - -``'author': author_info,`` - -``'text': comment_text } } } )`` - -\` +:: + + if path != []: + str_path = '.'.join('replies.%d' % part for part in path) + str_path += '.replies' + else: + str_path = 'replies' + db.discussion.update( + { 'discussion_id': discussion_id }, + { '$push': { + str_path: { + 'posted': datetime.utcnow(), + 'author': author_info, + 'text': comment_text } } } ) Here, we first construct a field name of the form 'replies.0.replies.2...' as str\_path and then use that to $push the new @@ -409,44 +326,33 @@ View the (paginated) comments for a discussion To actually view the comments in the non-threaded design, we need to use the $slice operator: -``discussion = db.discussion.find_one(`` - -``{'discussion_id': discussion_id},`` - -``{ … some fields relevant to our page from the root discussion …,`` - -``'comments': { '$slice': [ page_num * page_size, page_size ] }`` +:: -``})`` - -\` + discussion = db.discussion.find_one( + {'discussion_id': discussion_id}, + { … some fields relevant to our page from the root discussion …, + 'comments': { '$slice': [ page_num * page_size, page_size ] } + }) If we wish to view paginated comments for the threaded design, we need to do retrieve the whole document and paginate in our application: -``discussion = db.discussion.find_one({'discussion_id': discussion_id})`` - -\` - -``def iter_comments(obj):`` - -``for reply in obj['replies']:`` - -``yield reply`` +:: -``for subreply in iter_comments(reply):`` + discussion = db.discussion.find_one({'discussion_id': discussion_id}) -``yield subreply`` -\` + def iter_comments(obj): + for reply in obj['replies']: + yield reply + for subreply in iter_comments(reply): + yield subreply -``paginated_comments = itertools.slice(`` -``iter_comments(discussion),`` - -``page_size * page_num,`` - -``page_size * (page_num + 1))`` + paginated_comments = itertools.slice( + iter_comments(discussion), + page_size * page_num, + page_size * (page_num + 1)) Retrieve a comment via position or path ("permalink") ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -456,28 +362,23 @@ position in the comment list or tree. In the case of the chronological (non-threaded) design, we need simply to use the $slice operator to extract the correct comment: -``discussion = db.discussion.find_one(`` - -``{'discussion_id': discussion_id},`` - -``{'comments': { '$slice': [ position, position ] } })`` - -``comment = discussion['comments'][0]`` +:: -\` + discussion = db.discussion.find_one( + {'discussion_id': discussion_id}, + {'comments': { '$slice': [ position, position ] } }) + comment = discussion['comments'][0] In the case of the threaded design, we are faced with the task of finding the correct path through the tree in our application: -``discussion = db.discussion.find_one({'discussion_id': discussion_id})`` +:: -``current = discussion`` - -``for part in path:`` - -``current = current.replies[part]`` - -``comment = current`` + discussion = db.discussion.find_one({'discussion_id': discussion_id}) + current = discussion + for part in path: + current = current.replies[part] + comment = current Note that, since the replies to comments are embedded in their parents, we have actually retrieved the entire sub-discussion rooted in the @@ -489,33 +390,20 @@ Schema design: Hybrid Comments in the hybrid format are stored in 'buckets' of about 100 comments each: -\` - -``{`` - -``_id: ObjectId(…),`` - -``discussion_id: ObjectId(…),`` - -``page: 1,`` - -``count: 42,`` - -``comments: [ {`` - -``slug: '34db',`` - -``posted: ISODateTime(…),`` - -``author: { id: ObjectId(…), name: 'Rick' },`` - -``text: 'This is so bogus … ' },`` - -``… ]`` - -``}`` - -\` +:: + + { + _id: ObjectId(…), + discussion_id: ObjectId(…), + page: 1, + count: 42, + comments: [ { + slug: '34db', + posted: ISODateTime(…), + author: { id: ObjectId(…), name: 'Rick' }, + text: 'This is so bogus … ' }, + … ] + } Here, we have a 'page' of comment data, containing a bit of metadata about the page (in particular, the page number and the comment count), @@ -546,23 +434,17 @@ assume that we already have a reference to the discussion document, and that the discussion document has a property that tracks the number of pages: -``page = db.comment_pages.find_and_modify(`` - -``{ 'discussion_id': discussion['_id'],`` - -``'page': discussion['num_pages'] },`` - -``{ '$inc': { 'count': 1 },`` +:: -``'$push': {`` - -``'comments': { 'slug': slug, … } } },`` - -``fields={'count':1},`` - -``upsert=True,`` - -``new=True )`` + page = db.comment_pages.find_and_modify( + { 'discussion_id': discussion['_id'], + 'page': discussion['num_pages'] }, + { '$inc': { 'count': 1 }, + '$push': { + 'comments': { 'slug': slug, … } } }, + fields={'count':1}, + upsert=True, + new=True ) Note that we have written the find\_and\_modify above as an upsert operation; if we don't find the page number, the find\_and\_modify will @@ -570,21 +452,19 @@ create it for us, initialized with appropriate values for 'count' and 'comments'. Since we are limiting the number of comments per page, we also need to create new pages as they become necessary: -``if page['count'] > 100:`` - -``db.discussion.update(`` +:: -``{ 'discussion_id: discussion['_id'],`` - -``'num_pages': discussion['num_pages'] },`` - -``{ '$inc': { 'num_pages': 1 } } )`` + if page['count'] > 100: + db.discussion.update( + { 'discussion_id: discussion['_id'], + 'num_pages': discussion['num_pages'] }, + { '$inc': { 'num_pages': 1 } } ) Our update here includes the last know number of pages in the query to ensure we don't have a race condition where the number of pages is -double- incremented, resulting in a nearly or totally empty page. If -some other process has incremented the number of pages in the -discussion, then update above simply does nothing. +double-incremented, resulting in a nearly or totally empty page. If some +other process has incremented the number of pages in the discussion, +then update above simply does nothing. Index support ^^^^^^^^^^^^^ @@ -593,9 +473,10 @@ In order to efficiently support our find\_and\_modify and update operations above, we need to maintain a compound index on (discussion\_id, page) in the comment\_pages collection: -``>>> db.comment_pages.ensure_index([`` +:: -``... ('discussion_id', 1), ('page', 1)])`` + >>> db.comment_pages.ensure_index([ + ... ('discussion_id', 1), ('page', 1)]) View the (paginated) comments for a discussion ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -603,31 +484,20 @@ View the (paginated) comments for a discussion In order to paginate our comments with a fixed page size, we need to do a bit of extra work in Python: -``def find_comments(discussion_id, skip, limit):`` - -``result = []`` - -``page_query = db.comment_pages.find(`` - -``{ 'discussion_id': discussion_id },`` - -``{ 'count': 1, 'comments': { '$slice': [ skip, limit ] } })`` - -``page_query = page_query.sort('page')`` - -``for page in page_query:`` - -``result += page['comments']`` - -``skip = max(0, skip - page['count'])`` - -``limit -= len(page['comments'])`` - -``if limit == 0: break`` - -``return result`` - -\` +:: + + def find_comments(discussion_id, skip, limit): + result = [] + page_query = db.comment_pages.find( + { 'discussion_id': discussion_id }, + { 'count': 1, 'comments': { '$slice': [ skip, limit ] } }) + page_query = page_query.sort('page') + for page in page_query: + result += page['comments'] + skip = max(0, skip - page['count']) + limit -= len(page['comments']) + if limit == 0: break + return result Here, we use the $slice operator to pull out comments from each page, but *only if we have satisfied our skip requirement* . An example will @@ -635,45 +505,22 @@ help illustrate the logic here. Suppose we have 3 pages with 100, 102, 101, and 22 comments on each. respectively. We wish to retrieve comments with skip=300 and limit=50. The algorithm proceeds as follows: -Skip +Skip Limit Discussion -Limit +300 50 {$slice: [ 300, 50 ] } matches no comments in page #1; subtract +page #1's count from 'skip' and continue -Discussion +200 50 {$slice: [ 200, 50 ] } matches no comments in page #2; subtract +page #2's count from 'skip' and continue -300 +98 50 {$slice: [ 98, 50 ] } matches 2 comments in page #3; subtract page +#3's count from 'skip' (saturating at 0), subtract 2 from limit, and +continue -50 +0 48 {$slice: [ 0, 48 ] } matches all 22 comments in page #4; subtract +22 from limit and continue -{$slice: [ 300, 50 ] } matches no comments in page #1; subtract page -#1's count from 'skip' and continue - -200 - -50 - -{$slice: [ 200, 50 ] } matches no comments in page #2; subtract page -#2's count from 'skip' and continue - -98 - -50 - -{$slice: [ 98, 50 ] } matches 2 comments in page #3; subtract page #3's -count from 'skip' (saturating at 0), subtract 2 from limit, and continue - -0 - -48 - -{$slice: [ 0, 48 ] } matches all 22 comments in page #4; subtract 22 -from limit and continue - -0 - -26 - -There are no more pages; terminate loop +0 26 There are no more pages; terminate loop Index support ^^^^^^^^^^^^^ @@ -690,19 +537,15 @@ paging through all preceeding pages of commentary). In this case, we can use the slug to find the correct page, and then use our application to find the correct comment: -``page = db.comment_pages.find_one(`` - -``{ 'discussion_id': discussion_id,`` - -``'comments.slug': comment_slug},`` - -``{ 'comments': 1 })`` +:: -``for comment in page['comments']:`` - -``if comment['slug'] = comment_slug:`` - -``break`` + page = db.comment_pages.find_one( + { 'discussion_id': discussion_id, + 'comments.slug': comment_slug}, + { 'comments': 1 }) + for comment in page['comments']: + if comment['slug'] = comment_slug: + break Index support ^^^^^^^^^^^^^ @@ -710,9 +553,10 @@ Index support Here, we need a new index on (discussion\_id, comments.slug) to efficiently support retrieving the page number of the comment by slug: -``>>> db.comment_pages.ensure_index([`` +:: -``... ('discussion_id', 1), ('comments.slug', 1)])`` + >>> db.comment_pages.ensure_index([ + ... ('discussion_id', 1), ('comments.slug', 1)]) Sharding -------- @@ -724,11 +568,11 @@ In the case of the one document per comment approach, it would be nice to use our slug (or full\_slug, in the case of threaded comments) as part of the shard key to allow routing of requests by slug: -``>>> db.command('shardcollection', 'comments', {`` +:: -``... key : { 'discussion_id' : 1, 'full_slug': 1 } })`` - -``{ "collectionsharded" : "comments", "ok" : 1 }`` + >>> db.command('shardcollection', 'comments', { + ... key : { 'discussion_id' : 1, 'full_slug': 1 } }) + { "collectionsharded" : "comments", "ok" : 1 } In the case of the fully-embedded comments, of course, the discussion is the only thing we need to shard, and its shard key will probably be @@ -737,10 +581,10 @@ determined by concerns outside the scope of this document. In the case of hybrid documents, we want to use the page number of the comment page in our shard key: -``>>> db.command('shardcollection', 'comment_pages', {`` - -``... key : { 'discussion_id' : 1, ``'page'``: 1 } })`` +:: -``{ "collectionsharded" : "comment_pages", "ok" : 1 }`` + >>> db.command('shardcollection', 'comment_pages', { + ... key : { 'discussion_id' : 1, ``'page'``: 1 } }) + { "collectionsharded" : "comment_pages", "ok" : 1 } Page of diff --git a/source/tutorial/usecase/ecommerce-_category_hierarchy.txt b/source/tutorial/usecase/ecommerce-_category_hierarchy.txt index 2609defef2d..7d28d45b13d 100644 --- a/source/tutorial/usecase/ecommerce-_category_hierarchy.txt +++ b/source/tutorial/usecase/ecommerce-_category_hierarchy.txt @@ -14,7 +14,7 @@ We will keep each category in its own document, along with a list of its ancestors. The category hierarchy we will use in this solution will be based on different categories of music: -.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sYoXu6LHwYVB_WXz1%20Y_k8XA&rev=27&h=250&w=443&ac=1 +.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sYoXu6LHwYVB_WXz1Y_k8XA&rev=27&h=250&w=443&ac=1 :align: center :alt: Since categories change relatively infrequently, we will focus mostly in @@ -31,29 +31,20 @@ cross-referencing as well as a human-readable name and a url-friendly with each document to facilitate displaying a category along with all its ancestors in a single query. -``{ "_id" : ObjectId("4f5ec858eb03303a11000002"),`` - -``"name" : "Modal Jazz",`` - -``"parent" : ObjectId("4f5ec858eb03303a11000001"),`` - -``"slug" : "modal-jazz",`` - -``"ancestors" : [`` - -``{ "_id" : ObjectId("4f5ec858eb03303a11000001"),`` - -``"slug" : "bop",`` - -``"name" : "Bop" },`` - -``{ "_id" : ObjectId("4f5ec858eb03303a11000000"),`` - -``"slug" : "ragtime",`` - -``"name" : "Ragtime" } ]`` - -``}`` +:: + + { "_id" : ObjectId("4f5ec858eb03303a11000002"), + "name" : "Modal Jazz", + "parent" : ObjectId("4f5ec858eb03303a11000001"), + "slug" : "modal-jazz", + "ancestors" : [ + { "_id" : ObjectId("4f5ec858eb03303a11000001"), + "slug" : "bop", + "name" : "Bop" }, + { "_id" : ObjectId("4f5ec858eb03303a11000000"), + "slug" : "ragtime", + "name" : "Ragtime" } ] + } Operations ---------- @@ -69,13 +60,11 @@ case, we might want to display a category along with a list of 'bread crumbs' leading back up the hierarchy. In an E-commerce site, we will most likely have the slug of the category available for our query. -``category = db.categories.find(`` - -``{'slug':slug},`` +:: -``{'_id':0, 'name':1, 'ancestors.slug':1, 'ancestors.name':1 })`` - -\` + category = db.categories.find( + {'slug':slug}, + {'_id':0, 'name':1, 'ancestors.slug':1, 'ancestors.name':1 }) Here, we use the slug to retrieve the category and retrieve only those fields we wish to display. @@ -87,53 +76,42 @@ In order to support this common operation efficiently, we need an index on the 'slug' field. Since slug is also intended to be unique, we will add that constraint to our index as well: -``db.categories.ensure_index('slug', unique=True)`` +:: + + db.categories.ensure_index('slug', unique=True) Add a category to the hierarchy ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Adding a category to a hierarchy is relatively simple. Suppose we wish -to add a new category 'Swing' as a child of 'Ragtime': +to add a new category 'Swing' as a child of 'Ragtime': |image0| -.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sRXRjZMEZDN2azKBl%20sOoXoA&rev=7&h=250&w=443&ac=1 - :align: center - :alt: In this case, the initial insert is simple enough, but after this insert, we are still missing the ancestors array in the 'Swing' category. To define this, we will add a helper function to build our ancestor list: -\` - -``def build_ancestors(_id, parent_id):`` - -``parent = db.categories.find_one(`` - -``{'_id': parent_id},`` - -``{'name': 1, 'slug': 1, 'ancestors':1})`` - -``parent_ancestors = parent.pop('ancestors')`` - -``ancestors = [ parent ] + parent_ancestors`` - -``db.categories.update(`` +:: -``{'_id': _id},`` - -``{'$set': { 'ancestors': ancestors } })`` + def build_ancestors(_id, parent_id): + parent = db.categories.find_one( + {'_id': parent_id}, + {'name': 1, 'slug': 1, 'ancestors':1}) + parent_ancestors = parent.pop('ancestors') + ancestors = [ parent ] + parent_ancestors + db.categories.update( + {'_id': _id}, + {'$set': { 'ancestors': ancestors } }) Note that we only need to travel one level in our hierarchy to get the ragtime's ancestors and build swing's entire ancestor list. Now we can actually perform the insert and rebuild the ancestor list: -``doc = dict(name='Swing', slug='swing', parent=ragtime_id)`` - -``swing_id = db.categories.insert(doc)`` - -``build_ancestors(swing_id, ragtime_id)`` +:: -\` + doc = dict(name='Swing', slug='swing', parent=ragtime_id) + swing_id = db.categories.insert(doc) + build_ancestors(swing_id, ragtime_id) Index Support ^^^^^^^^^^^^^ @@ -148,57 +126,45 @@ Change the ancestry of a category Our goal here is to reorganize the hierarchy by moving 'bop' under 'swing': -.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sFB8ph8n7c768f-%20MLTOkY-w&rev=6&h=354&w=443&ac=1 +.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sFB8ph8n7c768f-MLTOkY-w&rev=6&h=354&w=443&ac=1 :align: center :alt: The initial update is straightforward: -``db.categories.update(`` +:: -``{'_id':bop_id}, {'$set': { 'parent': swing_id } } )`` + db.categories.update( + {'_id':bop_id}, {'$set': { 'parent': swing_id } } ) Now, we need to update the ancestor list for bop and all its descendants. In this case, we can't guarantee that the ancestor list of the parent category is always correct, however (since we may be -processing the categories out-of- order), so we will need a new +processing the categories out-of-order), so we will need a new ancestor-building function: -\` - -``def build_ancestors_full(_id, parent_id):`` - -``ancestors = []`` - -``while parent_id is not None:`` - -``parent = db.categories.find_one(`` - -``{'_id': parent_id},`` - -``{'parent': 1, 'name': 1, 'slug': 1, 'ancestors':1})`` - -``parent_id = parent.pop('parent')`` - -``ancestors.append(parent)`` - -``db.categories.update(`` +:: -``{'_id': _id},`` - -``{'$set': { 'ancestors': ancestors } })`` - -\` + def build_ancestors_full(_id, parent_id): + ancestors = [] + while parent_id is not None: + parent = db.categories.find_one( + {'_id': parent_id}, + {'parent': 1, 'name': 1, 'slug': 1, 'ancestors':1}) + parent_id = parent.pop('parent') + ancestors.append(parent) + db.categories.update( + {'_id': _id}, + {'$set': { 'ancestors': ancestors } }) Now, at the expense of a few more queries up the hierarchy, we can easily reconstruct all the descendants of 'bop': -``for cat in db.categories.find(`` - -``{'ancestors._id': bop_id},`` +:: -``{'parent_id': 1}):`` - -``build_ancestors_full(cat['_id'], cat['parent_id'])`` + for cat in db.categories.find( + {'ancestors._id': bop_id}, + {'parent_id': 1}): + build_ancestors_full(cat['_id'], cat['parent_id']) Index Support ^^^^^^^^^^^^^ @@ -206,9 +172,9 @@ Index Support In this case, an index on 'ancestors.\_id' would be helpful in determining which descendants need to be updated: -\` +:: -``db.categories.ensure_index('ancestors._id')`` + db.categories.ensure_index('ancestors._id') Renaming a category ~~~~~~~~~~~~~~~~~~~ @@ -217,28 +183,21 @@ Renaming a category would normally be an extremely quick operation, but in this case due to our denormalization, we also need to update the descendants. Here, we will rename 'Bop' to 'BeBop': -.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sqRIXKA2lGr_bm5ys%20M7KWQA&rev=3&h=354&w=443&ac=1 - :align: center - :alt: -First, we need to update the category name itself: +|image1| First, we need to update the category name itself: -\` +:: -``db.categories.update(`` - -``{'_id':bop_id}, {'$set': { 'name': 'BeBop' } } )`` + db.categories.update( + {'_id':bop_id}, {'$set': { 'name': 'BeBop' } } ) Next, we need to update each descendant's ancestors list: -``db.categories.update(`` - -``{'ancestors._id': bop_id},`` - -``{'$set': { 'ancestors.$.name': 'BeBop' } },`` +:: -``multi=True)`` - -\` + db.categories.update( + {'ancestors._id': bop_id}, + {'$set': { 'ancestors.$.name': 'BeBop' } }, + multi=True) Here, we use the positional operation '$' to match the exact 'ancestor' entry that matches our query, as well as the 'multi' option on our @@ -260,13 +219,13 @@ shard, the use of an \_id field for most of our updates makes \_id an ideal sharding candidate. The sharding commands we would use to shard the category collection would then be the following: -``>>> db.command('shardcollection', 'categories')`` - -``{ "collectionsharded" : "categories", "ok" : 1 }`` +:: -\` + >>> db.command('shardcollection', 'categories') + { "collectionsharded" : "categories", "ok" : 1 } Note that there is no need to specify the shard key, as MongoDB will -default to using \_id as a shard key. +default to using \_id as a shard key. Page of -Page of +.. |image0| image:: https://docs.google.com/a/arborian.com/drawings/image?id=sRXRjZMEZDN2azKBlsOoXoA&rev=7&h=250&w=443&ac=1 +.. |image1| image:: https://docs.google.com/a/arborian.com/drawings/image?id=sqRIXKA2lGr_bm5ysM7KWQA&rev=3&h=354&w=443&ac=1 diff --git a/source/tutorial/usecase/ecommerce-_inventory_management.txt b/source/tutorial/usecase/ecommerce-_inventory_management.txt index fe3d6fb4375..71016e619b9 100644 --- a/source/tutorial/usecase/ecommerce-_inventory_management.txt +++ b/source/tutorial/usecase/ecommerce-_inventory_management.txt @@ -24,7 +24,7 @@ for a certain period of time, all the items in the cart once again become part of available inventory and the cart is cleared. The state transition diagram for a shopping cart is below: -.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sDw93URlN8GCsdNpA%20CSXCVA&rev=76&h=186&w=578&ac=1 +.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sDw93URlN8GCsdNpACSXCVA&rev=76&h=186&w=578&ac=1 :align: center :alt: Schema design @@ -35,25 +35,18 @@ inventory of each stock-keeping unit (SKU) as well as a list of 'carted' items that may be released back to available inventory if their shopping cart times out: -``{`` +:: -``_id: '00e8da9b',`` - -``qty: 16,`` - -``carted: [`` - -``{ qty: 1, cart_id: 42,`` - -``timestamp: ISODate("2012-03-09T20:55:36Z"), },`` - -``{ qty: 2, cart_id: 43,`` - -``timestamp: ISODate("2012-03-09T21:55:36Z"), },`` - -``]`` - -``}`` + { + _id: '00e8da9b', + qty: 16, + carted: [ + { qty: 1, cart_id: 42, + timestamp: ISODate("2012-03-09T20:55:36Z"), }, + { qty: 2, cart_id: 43, + timestamp: ISODate("2012-03-09T21:55:36Z"), }, + ] + } (Note that, while in an actual implementation, we might choose to merge this schema with the product catalog schema described in "E-Commerce: @@ -65,23 +58,17 @@ for a total of 19 unsold items of merchandise. For our shopping cart model, we will maintain a list of (sku, quantity, price) line items: -``{`` - -``_id: 42,`` - -``last_modified: ISODate("2012-03-09T20:55:36Z"),`` +:: -``status: 'active',`` - -``items: [`` - -``{ sku: '00e8da9b', qty: 1, item_details: {...} },`` - -``{ sku: '0ab42f88', qty: 4, item_details: {...} }`` - -``]`` - -``}`` + { + _id: 42, + last_modified: ISODate("2012-03-09T20:55:36Z"), + status: 'active', + items: [ + { sku: '00e8da9b', qty: 1, item_details: {...} }, + { sku: '0ab42f88', qty: 4, item_details: {...} } + ] + } Note in the cart model that we have included item details in each line item. This allows us to display the contents of the cart to the user @@ -103,71 +90,47 @@ move an unavailable item off the shelf into the cart. To solve this problem, we will ensure that inventory is only updated if there is sufficient inventory to satisfy the request: -``def add_item_to_cart(cart_id, sku, qty, details):`` - -``now = datetime.utcnow()`` - -\` - -``# Make sure the cart is still active and add the line item`` - -``result = db.cart.update(`` - -``{'_id': cart_id, 'status': 'active' },`` - -``{ '$set': { 'last_modified': now },`` - -``'$push':`` - -``'items': {'sku': sku, 'qty':qty, 'details': details }`` - -``},`` - -``safe=True)`` - -``if not result['updatedExisting']:`` - -``raise CartInactive()`` - -\` - -``# Update the inventory`` - -``result = db.inventory.update(`` - -``{'_id':sku, 'qty': {'$gte': qty}},`` - -````{'$inc': {'qty': -qty}``[a]``,`` - -``'$push': {`` - -``'carted': { 'qty': qty, 'cart_id':cart_id,`` - -``'timestamp': now } } },`` - -``safe=True)`` - -``if not result['updatedExisting']:`` - -``# Roll back our cart update`` - -``db.cart.update(`` - -``{'_id': cart_id },`` - -``{ '$pull': { 'items': {'sku': sku } } }`` - -``)`` - -``raise InadequateInventory()`` +:: + + def add_item_to_cart(cart_id, sku, qty, details): + now = datetime.utcnow() + + + # Make sure the cart is still active and add the line item + result = db.cart.update( + {'_id': cart_id, 'status': 'active' }, + { '$set': { 'last_modified': now }, + '$push': + 'items': {'sku': sku, 'qty':qty, 'details': details } + }, + safe=True) + if not result['updatedExisting']: + raise CartInactive() + + + # Update the inventory + result = db.inventory.update( + {'_id':sku, 'qty': {'$gte': qty}}, + ``{'$inc': {'qty': -qty}`[a]`, + '$push': { + 'carted': { 'qty': qty, 'cart_id':cart_id, + 'timestamp': now } } }, + safe=True) + if not result['updatedExisting']: + # Roll back our cart update + db.cart.update( + {'_id': cart_id }, + { '$pull': { 'items': {'sku': sku } } } + ) + raise InadequateInventory() Note here in particular that we do not trust that the request is satisfiable. Our first check makes sure that the cart is still 'active' (more on inactive carts below) before adding a line item. Our next check verifies that sufficient inventory exists to satisfy the request before decrementing inventory. In the case of inadequate inventory, we -*compensate* for the non- transactional nature of MongoDB by removing -our cart update. Using safe=True and checking the result in the case of +*compensate* for the non-transactional nature of MongoDB by removing our +cart update. Using safe=True and checking the result in the case of these two updates allows us to report back an error to the user if the cart has become inactive or available quantity is insufficient to satisfy the request. @@ -186,67 +149,40 @@ cart. We must make sure that when they adjust the quantity upward, there is sufficient inventory to cover the quantity, as well as updating the particular 'carted' entry for the user's cart. -\` - -``def update_quantity(cart_id, sku, old_qty, new_qty):`` - -``now = datetime.utcnow()`` - -``delta_qty = new_qty - old_qty`` - -\` - -``# Make sure the cart is still active and add the line item`` - -``result = db.cart.update(`` - -``{'_id': cart_id, 'status': 'active', 'items.sku': sku },`` - -``{'$set': {`` - -``'last_modified': now,`` - -``'items.$.qty': new_qty },`` - -``},`` - -``safe=True)`` - -``if not result['updatedExisting']:`` - -``raise CartInactive()`` - -\` - -``# Update the inventory`` - -``result = db.inventory.update(`` - -``{'_id':sku,`` - -``'carted.cart_id': cart_id,`` - -``'qty': {'$gte': delta_qty} },`` - -``{'$inc': {'qty': -delta_qty },`` - -``'$set': { 'carted.$.qty': new_qty, 'timestamp': now } },`` - -``safe=True)`` - -``if not result['updatedExisting']:`` - -``# Roll back our cart update`` - -``db.cart.update(`` - -``{'_id': cart_id, 'items.sku': sku },`` - -``{'$set': { 'items.$.qty': old_qty }`` - -``})`` - -``raise InadequateInventory()`` +:: + + def update_quantity(cart_id, sku, old_qty, new_qty): + now = datetime.utcnow() + delta_qty = new_qty - old_qty + + + # Make sure the cart is still active and add the line item + result = db.cart.update( + {'_id': cart_id, 'status': 'active', 'items.sku': sku }, + {'$set': { + 'last_modified': now, + 'items.$.qty': new_qty }, + }, + safe=True) + if not result['updatedExisting']: + raise CartInactive() + + + # Update the inventory + result = db.inventory.update( + {'_id':sku, + 'carted.cart_id': cart_id, + 'qty': {'$gte': delta_qty} }, + {'$inc': {'qty': -delta_qty }, + '$set': { 'carted.$.qty': new_qty, 'timestamp': now } }, + safe=True) + if not result['updatedExisting']: + # Roll back our cart update + db.cart.update( + {'_id': cart_id, 'items.sku': sku }, + {'$set': { 'items.$.qty': old_qty } + }) + raise InadequateInventory() Note in particular here that we are using the positional operator '$' to update the particular 'carted' entry and line item that matched for our @@ -267,53 +203,33 @@ Checking out During checkout, we want to validate the method of payment and remove the various 'carted' items after the transaction has succeeded. -``def checkout(cart_id):`` - -``now = datetime.utcnow()`` - -``# Make sure the cart is still active and set to 'pending'. Also`` - -``# fetch the cart details so we can calculate the checkout price`` - -``cart = db.cart.find_and_modify(`` - -``{'_id': cart_id, 'status': 'active' },`` - -``update={'$set': { 'status': 'pending','last_modified': now } } )`` - -``if cart is None:`` - -``raise CartInactive()`` - -\` - -``# Validate payment details; collect payment`` - -``if payment_is_successful(cart):`` - -``db.cart.update(`` - -``{'_id': cart_id },`` - -``{'$set': { 'status': 'complete' } } )`` - -``db.inventory.update(`` - -``{'carted.cart_id': cart_id},`` - -``{'$pull': {'cart_id': cart_id} },`` - -``multi=True)`` - -``else:`` - -``db.cart.update(`` - -``{'_id': cart_id },`` - -``{'$set': { 'status': 'active' } } )`` - -``raise PaymentError()`` +:: + + def checkout(cart_id): + now = datetime.utcnow() + # Make sure the cart is still active and set to 'pending'. Also + # fetch the cart details so we can calculate the checkout price + cart = db.cart.find_and_modify( + {'_id': cart_id, 'status': 'active' }, + update={'$set': { 'status': 'pending','last_modified': now } } ) + if cart is None: + raise CartInactive() + + + # Validate payment details; collect payment + if payment_is_successful(cart): + db.cart.update( + {'_id': cart_id }, + {'$set': { 'status': 'complete' } } ) + db.inventory.update( + {'carted.cart_id': cart_id}, + {'$pull': {'cart_id': cart_id} }, + multi=True) + else: + db.cart.update( + {'_id': cart_id }, + {'$set': { 'status': 'active' } } ) + raise PaymentError() Here, we first 'lock' the cart by setting its status to 'pending' (disabling any modifications) and then collect payment data, verifying @@ -338,49 +254,30 @@ Periodically, we want to expire carts that have been inactive for a given number of seconds, returning their line items to available inventory: -``def expire_carts(timeout):`` - -``now = datetime.utcnow()`` - -``threshold = now - timedelta(seconds=timeout)`` - -``# Lock and find all the expiring carts`` - -``db.cart.update(`` - -``{'status': 'active', 'last_modified': { '$lt': threshold } },`` - -``{'$set': { 'status': 'expiring' } },`` - -``multi=True )`` - -``# Actually expire each cart`` - -``for cart in db.cart.find({'status': 'expiring'}):`` - -``# Return all line items to inventory`` - -``for item in cart['items']:`` - -``db.inventory.update(`` - -``{ '_id': item['sku'],`` - -``'carted.cart_id': cart['id'],`` - -``'carted.qty': item['qty']`` - -``},`` - -``{'$inc': { 'qty': item['qty'] },`` - -``'$pull': { 'carted': { 'cart_id': cart['id'] } } })`` - -``db.cart.update(`` - -``{'_id': cart['id'] },`` - -``{'$set': { status': 'expired' })`` +:: + + def expire_carts(timeout): + now = datetime.utcnow() + threshold = now - timedelta(seconds=timeout) + # Lock and find all the expiring carts + db.cart.update( + {'status': 'active', 'last_modified': { '$lt': threshold } }, + {'$set': { 'status': 'expiring' } }, + multi=True ) + # Actually expire each cart + for cart in db.cart.find({'status': 'expiring'}): + # Return all line items to inventory + for item in cart['items']: + db.inventory.update( + { '_id': item['sku'], + 'carted.cart_id': cart['id'], + 'carted.qty': item['qty'] + }, + {'$inc': { 'qty': item['qty'] }, + '$pull': { 'carted': { 'cart_id': cart['id'] } } }) + db.cart.update( + {'_id': cart['id'] }, + {'$set': { status': 'expired' }) Here, we first find all carts to be expired and then, for each cart, return its items to inventory. Once all items have been returned to @@ -393,9 +290,9 @@ In this case, we need to be able to efficiently query carts based on their status and last\_modified values, so an index on these would help the performance of our periodic expiration process: -``>>> db.cart.ensure_index([('status', 1), ('last_modified', 1)])`` +:: -\` + >>> db.cart.ensure_index([('status', 1), ('last_modified', 1)]) Note in particular the order in which we defined the index: in order to efficiently support range queries ('$lt' in this case), the ranged item @@ -414,73 +311,46 @@ items in the inventory have not been returned to available inventory. To account for this case, we will run a cleanup method periodically that will find old 'carted' items and check the status of their cart: -``def cleanup_inventory(timeout):`` - -``now = datetime.utcnow()`` - -``threshold = now - timedelta(seconds=timeout)`` - -\` - -``# Find all the expiring carted items`` - -``for item in db.inventory.find(`` - -``{'carted.timestamp': {'$lt': threshold }}):`` - -\` - -``# Find all the carted items that matched`` - -``carted = dict(`` - -``(carted_item['cart_id'], carted_item)`` - -``for carted_item in item['carted']`` - -``if carted_item['timestamp'] < threshold)`` +:: -\` + def cleanup_inventory(timeout): + now = datetime.utcnow() + threshold = now - timedelta(seconds=timeout) -``# Find any carts that are active and refresh the carted items`` -``for cart in db.cart.find(`` + # Find all the expiring carted items + for item in db.inventory.find( + {'carted.timestamp': {'$lt': threshold }}): -``{ '_id': {'$in': carted.keys() },`` -``'status':'active'}):`` + # Find all the carted items that matched + carted = dict( + (carted_item['cart_id'], carted_item) + for carted_item in item['carted'] + if carted_item['timestamp'] < threshold) -``cart = carted[cart['_id']]`` -``db.inventory.update(`` + # Find any carts that are active and refresh the carted items + for cart in db.cart.find( + { '_id': {'$in': carted.keys() }, + 'status':'active'}): + cart = carted[cart['_id']] + db.inventory.update( + { '_id': item['_id'], + 'carted.cart_id': cart['_id'] }, + { '$set': {'carted.$.timestamp': now } }) + del carted[cart['_id']] -``{ '_id': item['_id'],`` -``'carted.cart_id': cart['_id'] },`` - -``{ '$set': {'carted.$.timestamp': now } })`` - -``del carted[cart['_id']]`` - -\` - -``# All the carted items left in the dict need to now be`` - -``# returned to inventory`` - -``for cart_id, carted_item in carted.items():`` - -``db.inventory.update(`` - -``{ '_id': item['_id'],`` - -``'carted.cart_id': cart_id,`` - -``'carted.qty': carted_item['qty'] },`` - -``{ '$inc': { 'qty': carted_item['qty'] },`` - -``'$pull': { 'carted': { 'cart_id': cart_id } } })`` + # All the carted items left in the dict need to now be + # returned to inventory + for cart_id, carted_item in carted.items(): + db.inventory.update( + { '_id': item['_id'], + 'carted.cart_id': cart_id, + 'carted.qty': carted_item['qty'] }, + { '$inc': { 'qty': carted_item['qty'] }, + '$pull': { 'carted': { 'cart_id': cart_id } } }) Note that the function above is safe, as it checks to be sure the cart is expired or expiring before removing items from the cart and returning @@ -512,26 +382,18 @@ minimize server load. The sharding commands we would use to shard the cart and inventory collections, then, would be the following: -``>>> db.command('shardcollection', 'inventory')`` - -``{ "collectionsharded" : "inventory", "ok" : 1 }`` - -``>>> db.command('shardcollection', 'cart')`` +:: -``{ "collectionsharded" : "cart", "ok" : 1 }`` + >>> db.command('shardcollection', 'inventory') + { "collectionsharded" : "inventory", "ok" : 1 } + >>> db.command('shardcollection', 'cart') + { "collectionsharded" : "cart", "ok" : 1 } Note that there is no need to specify the shard key, as MongoDB will -default to using \_id as a shard key. - -Page of - -[a]jsr: - -Actually isn't a $dec command. Just $inc by a negative value. Some -drivers seem to have added $dec as a helper, but probably shouldn't :) +default to using \_id as a shard key. Page of [a]jsr: Actually isn't a +$dec command. Just $inc by a negative value. Some drivers seem to have +added $dec as a helper, but probably shouldn't :) -------------- -rick446: - -fixed +rick446: fixed diff --git a/source/tutorial/usecase/ecommerce-_product_catalog.txt b/source/tutorial/usecase/ecommerce-_product_catalog.txt index 76e2d2d4d04..d18c3f9553c 100644 --- a/source/tutorial/usecase/ecommerce-_product_catalog.txt +++ b/source/tutorial/usecase/ecommerce-_product_catalog.txt @@ -18,43 +18,25 @@ MongoDB enables. One approach ("concrete table inheritance") to solving this problem is to create a table for each product category: -\` - -``CREATE TABLE``product\_audio\_album``(`` - -```sku`` char(8) NOT NULL,\` - -``…`` - -```artist`` varchar(255) DEFAULT NULL,\` - -```genre_0`` varchar(255) DEFAULT NULL,\` - -```genre_1`` varchar(255) DEFAULT NULL,\` - -``…,`` - -``PRIMARY KEY(``sku``))`` - -``…`` - -``CREATE TABLE``product\_film``(`` - -```sku`` char(8) NOT NULL,\` - -``…`` - -```title`` varchar(255) DEFAULT NULL,\` - -```rating`` char(8) DEFAULT NULL,\` - -``…,`` - -``PRIMARY KEY(``sku``))`` - -``…`` - -\` +:: + + CREATE TABLE `product_audio_album` ( + `sku` char(8) NOT NULL, + … + `artist` varchar(255) DEFAULT NULL, + `genre_0` varchar(255) DEFAULT NULL, + `genre_1` varchar(255) DEFAULT NULL, + …, + PRIMARY KEY(`sku`)) + … + CREATE TABLE `product_film` ( + `sku` char(8) NOT NULL, + … + `title` varchar(255) DEFAULT NULL, + `rating` char(8) DEFAULT NULL, + …, + PRIMARY KEY(`sku`)) + … The main problem with this approach is a lack of flexibility. Each time we add a new product category, we need to create a new table. @@ -65,31 +47,19 @@ Another approach ("single table inheritance") would be to use a single table for all products and add new columns each time we needed to store a new type of product: -\` - -``CREATE TABLE``product``(`` - -```sku`` char(8) NOT NULL,\` - -``…`` - -```artist`` varchar(255) DEFAULT NULL,\` - -```genre_0`` varchar(255) DEFAULT NULL,\` - -```genre_1`` varchar(255) DEFAULT NULL,\` - -``…`` - -```title`` varchar(255) DEFAULT NULL,\` +:: -```rating`` char(8) DEFAULT NULL,\` - -``…,`` - -``PRIMARY KEY(``sku``))`` - -\` + CREATE TABLE `product` ( + `sku` char(8) NOT NULL, + … + `artist` varchar(255) DEFAULT NULL, + `genre_0` varchar(255) DEFAULT NULL, + `genre_1` varchar(255) DEFAULT NULL, + … + `title` varchar(255) DEFAULT NULL, + `rating` char(8) DEFAULT NULL, + …, + PRIMARY KEY(`sku`)) This is more flexible, allowing us to query across different types of product, but it's quite wasteful of space. One possible space @@ -101,57 +71,35 @@ Multiple table inheritance is yet another approach where we represent common attributes in a generic 'product' table and the variations in individual category product tables: -``CREATE TABLE``product``(`` - -```sku`` char(8) NOT NULL,\` - -```title`` varchar(255) DEFAULT NULL,\` - -```description`` varchar(255) DEFAULT NULL,\` - -```price`` …,\` - -``PRIMARY KEY(``sku``))`` - -\` - -``CREATE TABLE``product\_audio\_album``(`` - -```sku`` char(8) NOT NULL,\` - -``…`` - -```artist`` varchar(255) DEFAULT NULL,\` - -```genre_0`` varchar(255) DEFAULT NULL,\` - -```genre_1`` varchar(255) DEFAULT NULL,\` - -``…,`` - -``PRIMARY KEY(``sku``),`` - -``FOREIGN KEY(``sku``) REFERENCES``product``(``sku``))`` - -``…`` - -``CREATE TABLE``product\_film``(`` - -```sku`` char(8) NOT NULL,\` - -``…`` - -```title`` varchar(255) DEFAULT NULL,\` - -```rating`` char(8) DEFAULT NULL,\` - -``…,`` - -``PRIMARY KEY(``sku``),`` - -``FOREIGN KEY(``sku``) REFERENCES``product``(``sku``))`` - -``…`` +:: + + CREATE TABLE `product` ( + `sku` char(8) NOT NULL, + `title` varchar(255) DEFAULT NULL, + `description` varchar(255) DEFAULT NULL, + `price` …, + PRIMARY KEY(`sku`)) + + + CREATE TABLE `product_audio_album` ( + `sku` char(8) NOT NULL, + … + `artist` varchar(255) DEFAULT NULL, + `genre_0` varchar(255) DEFAULT NULL, + `genre_1` varchar(255) DEFAULT NULL, + …, + PRIMARY KEY(`sku`), + FOREIGN KEY(`sku`) REFERENCES `product`(`sku`)) + … + CREATE TABLE `product_film` ( + `sku` char(8) NOT NULL, + … + `title` varchar(255) DEFAULT NULL, + `rating` char(8) DEFAULT NULL, + …, + PRIMARY KEY(`sku`), + FOREIGN KEY(`sku`) REFERENCES `product`(`sku`)) + … This is more space-efficient than single-table inheritance and somewhat more flexible than concrete-table inheritance, but it does require a @@ -165,51 +113,23 @@ describe your product. For instance, suppose you are describing an audio album. In that case you might have a series of rows representing the following relationships: -**Entity** -**Attribute** -**Value** - -sku\_00e8da9b - -type - -Audio Album - -sku\_00e8da9b - -title - -A Love Supreme - -sku\_00e8da9b - -… - -… - -sku\_00e8da9b - -artist - -John Coltrane - -sku\_00e8da9b - -genre - -Jazz - -sku\_00e8da9b - -genre - -General - -… - -… - -… ++-----------------+-------------+------------------+ +| Entity | Attribute | Value | ++=================+=============+==================+ +| sku\_00e8da9b | type | Audio Album | ++-----------------+-------------+------------------+ +| sku\_00e8da9b | title | A Love Supreme | ++-----------------+-------------+------------------+ +| sku\_00e8da9b | … | … | ++-----------------+-------------+------------------+ +| sku\_00e8da9b | artist | John Coltrane | ++-----------------+-------------+------------------+ +| sku\_00e8da9b | genre | Jazz | ++-----------------+-------------+------------------+ +| sku\_00e8da9b | genre | General | ++-----------------+-------------+------------------+ +| … | … | … | ++-----------------+-------------+------------------+ This schema has the advantage of being completely flexible; any entity can have any set of any attributes. New product categories do not @@ -237,123 +157,68 @@ searchable across all products at the beginning of each document, with properties that vary from category to category encapsulated in a 'details' property. Thus an audio album might look like the following: -\` - -``{`` - -``sku: "00e8da9b",`` - -``type: "Audio Album",`` - -``title: "A Love Supreme",`` - -``description: "by John Coltrane",`` - -``asin: "B0000A118M",`` - -\` - -``shipping: {`` - -``weight: 6,`` - -``dimensions: {`` - -``width: 10,`` - -``height: 10,`` - -``depth: 1`` - -``},`` - -``},`` - -\` - -``pricing: {`` - -``list: 1200,`` - -``retail: 1100,`` - -``savings: 100,`` - -``pct_savings: 8`` - -``},`` - -\` - -``details: {`` - -``title: "A Love Supreme [Original Recording Reissued]",`` - -``artist: "John Coltrane",`` - -``genre: [ "Jazz", "General" ],`` - -``…`` - -``tracks: [`` - -``"A Love Supreme Part I: Acknowledgement",`` - -``"A Love Supreme Part II - Resolution",`` - -``"A Love Supreme, Part III: Pursuance",`` - -``"A Love Supreme, Part IV-Psalm"`` - -``],`` - -``},`` - -``}`` - -\` +:: + + { + sku: "00e8da9b", + type: "Audio Album", + title: "A Love Supreme", + description: "by John Coltrane", + asin: "B0000A118M", + + + shipping: { + weight: 6, + dimensions: { + width: 10, + height: 10, + depth: 1 + }, + }, + + + pricing: { + list: 1200, + retail: 1100, + savings: 100, + pct_savings: 8 + }, + + + details: { + title: "A Love Supreme [Original Recording Reissued]", + artist: "John Coltrane", + genre: [ "Jazz", "General" ], + … + tracks: [ + "A Love Supreme Part I: Acknowledgement", + "A Love Supreme Part II - Resolution", + "A Love Supreme, Part III: Pursuance", + "A Love Supreme, Part IV-Psalm" + ], + }, + } A movie title would have the same fields stored for general product information, shipping, and pricing, but have quite a different details -attribute: - -``{`` - -``sku: "00e8da9d",`` - -``type: "Film",`` - -``…`` +attribute: { sku: "00e8da9d", type: "Film", … asin: "B000P0J0AQ", -``asin: "B000P0J0AQ",`` +:: -\` + shipping: { … }, -``shipping: { … },`` -\` + pricing: { … }, -``pricing: { … },`` -\` - -``details: {`` - -``title: "The Matrix",`` - -``director: [ "Andy Wachowski", "Larry Wachowski" ],`` - -``writer: [ "Andy Wachowski", "Larry Wachowski" ],`` - -``…`` - -``aspect_ratio: "1.66:1"`` - -``},`` - -``}`` - -\` + details: { + title: "The Matrix", + director: [ "Andy Wachowski", "Larry Wachowski" ], + writer: [ "Andy Wachowski", "Larry Wachowski" ], + … + aspect_ratio: "1.66:1" + }, + } Another thing to note in the MongoDB schema is that we can have multi-valued attributes without any arbitrary restriction on the number @@ -372,17 +237,17 @@ examples will be written in the Python programming language using the pymongo driver, but other language/driver combinations should be similar. -Find all jazz albums, sorted by year produced[a] -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Find all jazz albums, sorted by year produced +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Here, we would like to see a group of products with a particular genre, sorted by the year in which they were produced: -``query = db.products.find({'type':'Audio Album',`` - -``'details.genre': 'jazz'})`` +:: -``query = query.sort([('details.issue_date', -1)])`` + query = db.products.find({'type':'Audio Album', + 'details.genre': 'jazz'}) + query = query.sort([('details.issue_date', -1)]) Index support ^^^^^^^^^^^^^ @@ -390,17 +255,12 @@ Index support In order to efficiently support this type of query, we need to create a compound index on all the properties used in the filter and in the sort: -\` - -``db.products.ensure_index([`` - -``('type', 1),`` - -``('details.genre', 1),`` +:: -``('details.issue_date', -1)])`` - -\` + db.products.ensure_index([ + ('type', 1), + ('details.genre', 1), + ('details.issue_date', -1)]) Again, notice that the final component of our index is the sort field. @@ -414,11 +274,10 @@ deals' of our website. In this case, we will use the pricing information that exists in all products to find the products with the highest percentage discount: -\` - -``query = db.products.find( { 'pricing.pct_savings': {'$gt': 25 })`` +:: -``query = query.sort([('pricing.pct_savings', -1)])`` + query = db.products.find( { 'pricing.pct_savings': {'$gt': 25 }) + query = query.sort([('pricing.pct_savings', -1)]) Index support ^^^^^^^^^^^^^ @@ -426,12 +285,8 @@ Index support In order to efficiently support this type of query, we need to have an index on the percentage savings: -\` - \`db.products.ensure\_index('pricing.pct\_savings') -\` - Since the index is only on a single key, it does not matter in which order the index is sorted. Note that, had we wanted to perform a range query (say all products over $25 retail) and sort by another property @@ -448,13 +303,11 @@ In this case, we want to search inside the details of a particular type of product (a movie) to find all movies containing Keanu Reeves, sorted by date descending: -\` +:: -``query = db.products.find({'type': 'Film',`` - -``'details.actor': 'Keanu Reeves'})`` - -``query = query.sort([('details.issue_date', -1)])`` + query = db.products.find({'type': 'Film', + 'details.actor': 'Keanu Reeves'}) + query = query.sort([('details.issue_date', -1)]) Index support ^^^^^^^^^^^^^ @@ -462,22 +315,17 @@ Index support Here, we wish to once again index by type first, followed the details we're interested in: -\` - -``db.products.ensure_index([`` +:: -``('type', 1),`` - -``('details.actor', 1),`` - -``('details.issue_date', -1)])`` - -\` + db.products.ensure_index([ + ('type', 1), + ('details.actor', 1), + ('details.issue_date', -1)]) And once again, the final component of our index is the sort field. -\*\* -**Find all movies with the word "hacker" in the title** [b] +Find all movies with the word "hacker" in the title +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Those experienced with relational databases may shudder at this operation, since it implies an inefficient LIKE query. In fact, without @@ -486,48 +334,37 @@ satisfy this query. In the case of MongoDB, we will use a regular expression. First, we will see how we might do this using Python's re module: -``import re`` +:: -``re_hacker = re.compile(r'.*hacker.*', re.IGNORECASE)`` + import re + re_hacker = re.compile(r'.*hacker.*', re.IGNORECASE) -\` -``query = db.products.find({'type': 'Film', 'title': re_hacker})`` - -``query = query.sort([('details.issue_date', -1)])`` - -\` + query = db.products.find({'type': 'Film', 'title': re_hacker}) + query = query.sort([('details.issue_date', -1)]) Although this is fairly convenient, MongoDB also gives us the option to use a special syntax in our query instead of importing the Python re module: -\` - -``query = db.products.find({`` +:: -``'type': 'Film',`` - -``'title': {'$regex': '.*hacker.*', '$options':'i'}})`` - -``query = query.sort([('details.issue_date', -1)])`` + query = db.products.find({ + 'type': 'Film', + 'title': {'$regex': '.*hacker.*', '$options':'i'}}) + query = query.sort([('details.issue_date', -1)]) Index support ^^^^^^^^^^^^^ Here, we will diverge a bit from our typical index order: -\` - -``db.products.ensure_index([`` - -``('type', 1),`` +:: -``('details.issue_date', -1),`` - -``('title', 1)])`` - -\` + db.products.ensure_index([ + ('type', 1), + ('details.issue_date', -1), + ('title', 1)]) You may be wondering why we are including the title field in the index if we have to scan anyway. The reason is that there are two types of @@ -544,90 +381,57 @@ we're scanning titles. You can observe the difference looking at the query plans we get for different orderings. If we use the (type, title, details.issue\_date) index, we get the following plan: -``{u'allPlans': [...],`` - -``u'cursor': u'BtreeCursor type_1_title_1_details.issue_date_-1 multi',`` - -``u'indexBounds': {u'details.issue_date': [[{u'$maxElement': 1},`` - -``{u'$minElement': 1}]],`` - -``u'title': [[u'', {}],`` - -``[<_sre.SRE_Pattern object at 0x2147cd8>,`` - -``<_sre.SRE_Pattern object at 0x2147cd8>]],`` - -``u'type': [[u'Film', u'Film']]},`` - -``u'indexOnly': False,`` - -``u'isMultiKey': False,`` - -``u'millis': 208,`` - -``u'n': 0,`` - -``u'nChunkSkips': 0,`` - -``u'nYields': 0,`` - -``u'nscanned': 10000,`` - -``u'nscannedObjects': 0,`` - -``u'scanAndOrder': True}`` - -\` +:: + + {u'allPlans': [...], + u'cursor': u'BtreeCursor type_1_title_1_details.issue_date_-1 multi', + u'indexBounds': {u'details.issue_date': [[{u'$maxElement': 1}, + {u'$minElement': 1}]], + u'title': [[u'', {}], + [<_sre.SRE_Pattern object at 0x2147cd8>, + <_sre.SRE_Pattern object at 0x2147cd8>]], + u'type': [[u'Film', u'Film']]}, + u'indexOnly': False, + u'isMultiKey': False, + u'millis': 208, + u'n': 0, + u'nChunkSkips': 0, + u'nYields': 0, + u'nscanned': 10000, + u'nscannedObjects': 0, + u'scanAndOrder': True} If, however, we use the (type, details.issue\_date, title) index, we get the following plan: -\` - -``{u'allPlans': [...],`` - -``u'cursor': u'BtreeCursor type_1_details.issue_date_-1_title_1 multi',`` - -``u'indexBounds': {u'details.issue_date': [[{u'$maxElement': 1},`` - -``{u'$minElement': 1}]],`` - -``u'title': [[u'', {}],`` - -``[<_sre.SRE_Pattern object at 0x2147cd8>,`` - -``<_sre.SRE_Pattern object at 0x2147cd8>]],`` - -``u'type': [[u'Film', u'Film']]},`` - -``u'indexOnly': False,`` - -``u'isMultiKey': False,`` - -``u'millis': 157,`` - -``u'n': 0,`` - -``u'nChunkSkips': 0,`` - -``u'nYields': 0,`` - -``u'nscanned': 10000,`` - -``u'nscannedObjects': 0}`` - -\` +:: + + {u'allPlans': [...], + u'cursor': u'BtreeCursor type_1_details.issue_date_-1_title_1 multi', + u'indexBounds': {u'details.issue_date': [[{u'$maxElement': 1}, + {u'$minElement': 1}]], + u'title': [[u'', {}], + [<_sre.SRE_Pattern object at 0x2147cd8>, + <_sre.SRE_Pattern object at 0x2147cd8>]], + u'type': [[u'Film', u'Film']]}, + u'indexOnly': False, + u'isMultiKey': False, + u'millis': 157, + u'n': 0, + u'nChunkSkips': 0, + u'nYields': 0, + u'nscanned': 10000, + u'nscannedObjects': 0} The two salient features to note are a) the absence of the 'scanAndOrder: True' in the optmal query and b) the difference in time -(208ms for the suboptimal query versus 157ms for the optimal [c]one). -The lesson learned here is that if you absolutely have to scan, you -should make the elements you're scanning the *least* significant part of -the index (even after the sort). +(208ms for the suboptimal query versus 157ms for the optimal one). The +lesson learned here is that if you absolutely have to scan, you should +make the elements you're scanning the *least* significant part of the +index (even after the sort). -Sharding[d] ------------ +Sharding +-------- Though our performance in this system is highly dependent on the indexes we maintain, sharding can enhance that performance further by allowing @@ -650,11 +454,11 @@ chunks. For this example, we will assume that 'details.genre' is our second-most queried field after 'type', and thus our sharding setup would be as follows: -``>>> db.command('shardcollection', 'product', {`` - -``... key : { 'type': 1, 'details.genre' : 1, 'sku':1 } })`` +:: -``{ "collectionsharded" : "details.genre", "ok" : 1 }`` + >>> db.command('shardcollection', 'product', { + ... key : { 'type': 1, 'details.genre' : 1, 'sku':1 } }) + { "collectionsharded" : "details.genre", "ok" : 1 } One important note here is that, even if we choose a shard key that requires all queries to be broadcast to all shards, we still get some @@ -675,26 +479,30 @@ This is achieved via the 'read\_preference' argument, and can be set at the connection or individual query level. For instance, to allow all reads on a connection to go to a secondary, the syntax is: -``conn = pymongo.Connection(read_preference=pymongo.SECONDARY)`` +:: + + conn = pymongo.Connection(read_preference=pymongo.SECONDARY) or -``conn = pymongo.Connection(read_preference=pymongo.SECONDARY_ONLY)`` +:: -\` + conn = pymongo.Connection(read_preference=pymongo.SECONDARY_ONLY) In the first instance, reads will be distributed among all the secondaries and the primary, whereas in the second reads will only be sent to the secondary. To allow queries to go to a secondary on a per-query basis, we can also specify a read\_preference: -``results = db.product.find(..., read_preference=pymongo.SECONDARY)`` +:: + + results = db.product.find(..., read_preference=pymongo.SECONDARY) or -``results = db.product.find(..., read_preference=pymongo.SECONDARY_ONLY)`` +:: -\` + results = db.product.find(..., read_preference=pymongo.SECONDARY_ONLY) It is important to note that reading from a secondary can introduce a lag between when inserts and updates occur and when they become visible @@ -702,60 +510,3 @@ to queries. In the case of a product catalog, however, where queries happen frequently and updates happen infrequently, such eventual consistency (updates visible within a few seconds but not immediately) is usually tolerable. - -Page of - -[a]jsr: - -This might make more sense as the first operation. The "sorted by -discount" feels like a secondary use case.. still include it, but maybe -lower down in the TOC. - --------------- - -rick446: - -there you go - -[b]jsr: - -See note below about scatter gather queries. Might want to add slaveOk -flag on these and talk about why it's okay with this model. Don't always -need consistent reads. Better to do slaveOk so you can get some more -scale out of scatter gather queries. - --------------- - -rick446: - -See response below. - -[c]jsr: - -This doesn't seem like a big difference. I think that the scanAndOrder -is more important. - --------------- - -rick446: - -Hm, well it's not a big *absolute* difference, but I'd expect it to grow -as your data size increased. The query time is the only thing we're -*actually* interested in IMO (the scanAndOrder being interesting because -it's the cause of the slow query) - -[d]jsr: - -With this model, another consideration might be parallelized searches. -For example, if you're sharded on, say genre, but you want to query for -all albums by coltrane, you'll do a scatter gather query. Maybe include -a discussion of scatter gather queries and the fact that you can add -replicas to scale these. - --------------- - -rick446: - -I added a section at the end about read\_preference and a clause about -still getting a benefit from sharding due to parallelized searches. -Anything else you wanted here? diff --git a/source/tutorial/usecase/real_time_analytics-_hierarchical_aggregation.txt b/source/tutorial/usecase/real_time_analytics-_hierarchical_aggregation.txt index f8032e939da..0624b67fa59 100644 --- a/source/tutorial/usecase/real_time_analytics-_hierarchical_aggregation.txt +++ b/source/tutorial/usecase/real_time_analytics-_hierarchical_aggregation.txt @@ -25,11 +25,16 @@ hourly, daily, weekly, monthly, and yearly. We will use a hierarchical approach to running our map-reduce jobs. The input and output of each job is illustrated below: -.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=syuQgkoNVdeOo7UC4%20WepaPQ&rev=1&h=208&w=268&ac=1 +.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=syuQgkoNVdeOo7UC4WepaPQ&rev=1&h=208&w=268&ac=1 :align: center - :alt: + :alt: Hierarchy + + Hierarchy +Note that the events rolling into the hourly collection is qualitatively +different than the hourly statistics rolling into the daily collection. + Aside: Map-Reduce Algorithm -^^^^^^^^^^^^^^^^^^^^^^^^^^^ +~~~~~~~~~~~~~~~~~~~~~~~~~~~ Map/reduce is a popular aggregation algorithm that is optimized for embarrassingly parallel problems. The psuedocode (in Python) of the @@ -38,59 +43,37 @@ psuedocode for a particular type of map/reduce where the results of the map/reduce operation are *reduced* into the result collection, allowing us to perform incremental aggregation which we'll need in this case. -\` - -``def map_reduce(icollection, query,`` - -``mapf, reducef, finalizef, ocollection):`` - -``'''Psuedocode for map/reduce with output type="reduce" in MongoDB'''`` - -``map_results = defaultdict(list)`` - -``def emit(key, value):`` - -``'''helper function used inside mapf'''`` - -``map_results[key].append(value)`` - -\` - -``# The map phase`` - -``for doc in icollection.find(query):`` +:: -``mapf(doc)`` + def map_reduce(icollection, query, + mapf, reducef, finalizef, ocollection): + '''Psuedocode for map/reduce with output type="reduce" in MongoDB''' + map_results = defaultdict(list) + def emit(key, value): + '''helper function used inside mapf''' + map_results[key].append(value) -\` -``# Pull in documents from the output collection for`` + # The map phase + for doc in icollection.find(query): + mapf(doc) -``# output type='reduce'`` -``for doc in ocollection.find({'_id': {'$in': map_results.keys() } }):`` + # Pull in documents from the output collection for + # output type='reduce' + for doc in ocollection.find({'_id': {'$in': map_results.keys() } }): + map_results[doc['_id']].append(doc['value']) -``map_results[doc['_id']].append(doc['value'])`` -\` + # The reduce phase + for key, values in map_results.items(): + reduce_results[key] = reducef(key, values) -``# The reduce phase`` -``for key, values in map_results.items():`` - -``reduce_results[key] = reducef(key, values)`` - -\` - -``# Finalize and save the results back`` - -``for key, value in reduce_results.items():`` - -``final_value = finalizef(key, value)`` - -``ocollection.save({'_id': key, 'value': final_value})``[a] - -\` \` + # Finalize and save the results back + for key, value in reduce_results.items(): + final_value = finalizef(key, value) + ocollection.save({'_id': key, 'value': final_value}) The embarrassingly parallel part of the map/reduce algorithm lies in the fact that each invocation of mapf, reducef, and finalizef are @@ -98,8 +81,8 @@ independent of each other and can, in fact, be distributed to different servers. In the case of MongoDB, this parallelism can be achieved by using sharding on the collection on which we are performing map/reduce. -Schema design[b] ----------------- +Schema design +------------- When designing the schema for event storage, we need to keep in mind the necessity to differentiate between events which have been included in @@ -112,17 +95,14 @@ If we are able to batch up our inserts into the event table, we can still use an auto-increment primary key by using the find\_and\_modify command to generate our \_id values: -``>>> obj = db.my_sequence.find_and_modify(`` - -``... query={'_id':0},`` - -``... update={'$inc': {'inc': 50}}`` +:: -``... upsert=True,`` - -``... new=True)`` - -``>>> batch_of_ids = range(obj['inc']-50, obj['inc'])`` + >>> obj = db.my_sequence.find_and_modify( + ... query={'_id':0}, + ... update={'$inc': {'inc': 50}} + ... upsert=True, + ... new=True) + >>> batch_of_ids = range(obj['inc']-50, obj['inc']) In most cases, however, it is sufficient to include a timestamp with each event that we can use as a marker of which events have been @@ -131,17 +111,13 @@ we'll assume that we are calculating average session length for logged-in users on a website. Our event format will thus be the following: -``{`` - -``"userid": "rick",`` +:: -``"ts": ISODate('2010-10-10T14:17:22Z'),`` - -``"length":95`` - -``}`` - -\` + { + "userid": "rick", + "ts": ISODate('2010-10-10T14:17:22Z'), + "length":95 + } We want to calculate total and average session times for each user at the hour, day, week, month, and year. In each case, we will also store @@ -149,23 +125,16 @@ the number of sessions to enable us to incrementally recompute the average session times. Each of our aggregate documents, then, looks like the following: -``{`` - -``_id: { u: "rick", d: ISODate("2010-10-10T14:00:00Z") },`` - -``value: {`` - -``ts: ISODate('2010-10-10T15:01:00Z'),`` +:: -``total: 254,`` - -``count: 10,`` - -``mean: 25.4 }`` - -``}`` - -\` + { + _id: { u: "rick", d: ISODate("2010-10-10T14:00:00Z") }, + value: { + ts: ISODate('2010-10-10T15:01:00Z'), + total: 254, + count: 10, + mean: 25.4 } + } Note in particular that we have added a timestamp to the aggregate document. This will help us as we incrementally update the various @@ -192,67 +161,44 @@ and PyMongo to interface with the MongoDB server, note that the various functions (map, reduce, and finalize) that we pass to the mapreduce command must be Javascript functions. The map function appears below: -``mapf_hour = bson.Code('''function() {`` - -``var key = {`` - -``u: this.userid,`` - -``d: new Date(`` - -``this.ts.getFullYear(),`` - -``this.ts.getMonth(),`` - -``this.ts.getDate(),`` - -``this.ts.getHours(),`` - -``0, 0, 0);`` - -``emit(`` - -``key,`` - -``{`` - -``total: this.length,`` - -``count: 1,`` - -``mean: 0,`` - -``ts: new Date(); });`` - -``}''')`` - -\` +:: + + mapf_hour = bson.Code('''function() { + var key = { + u: this.userid, + d: new Date( + this.ts.getFullYear(), + this.ts.getMonth(), + this.ts.getDate(), + this.ts.getHours(), + 0, 0, 0); + emit( + key, + { + total: this.length, + count: 1, + mean: 0, + ts: new Date(); }); + }''') In this case, we are emitting key, value pairs which contain the statistics we want to aggregate as you'd expect, but we are also emitting 'ts' value. This will be used in the cascaded aggregations (hour to day, etc.) to determine when a particular hourly aggregation -was performed.\` \` - -\` +was performed. Our reduce function is also fairly straightforward: -``reducef = bson.Code('''function(key, values) {`` - -``var r = { total: 0, count: 0, mean: 0, ts: null };`` - -``values.forEach(function(v) {`` +:: -``r.total += v.total;`` - -``r.count += v.count;`` - -``});`` - -``return r;`` - -``}''')`` + reducef = bson.Code('''function(key, values) { + var r = { total: 0, count: 0, mean: 0, ts: null }; + values.forEach(function(v) { + r.total += v.total; + r.count += v.count; + }); + return r; + }''') A few things are notable here. First of all, note that the returned document from our reduce function has the same format as the result of @@ -262,49 +208,35 @@ finalize results can lead to difficult-to-debug errors. Also note that we are ignoring the 'mean' and 'ts' values. These will be provided in the 'finalize' step: -\` - -``finalizef = bson.Code('''function(key, value) {`` - -``if(value.count > 0) {`` +:: -``value.mean = value.total / value.count;`` - -``}`` - -``value.ts = new Date();`` - -``return value;`` - -``}''')`` - -\` + finalizef = bson.Code('''function(key, value) { + if(value.count > 0) { + value.mean = value.total / value.count; + } + value.ts = new Date(); + return value; + }''') Here, we compute the mean value as well as the timestamp we will use to write back to the output collection. Now, to bind it all together, here is our Python code to invoke the mapreduce command: -``cutoff = datetime.utcnow() - timedelta(seconds=60)`` - -``query = { 'ts': { '$gt': last_run, '$lt': cutoff } }`` +:: -\` + cutoff = datetime.utcnow() - timedelta(seconds=60) + query = { 'ts': { '$gt': last_run, '$lt': cutoff } } -``db.events.map_reduce(`` -``map=mapf_hour,`` + db.events.map_reduce( + map=mapf_hour, + reduce=reducef, + finalize=finalizef, + query=query, + out={ 'reduce': 'stats.hourly' }) -``reduce=reducef,`` -``finalize=finalizef,`` - -``query=query,`` - -``out={ 'reduce': 'stats.hourly' })`` - -\` - -``last_run = cutoff`` + last_run = cutoff Because we used the 'reduce' option on our output, we are able to run this aggregation as often as we like as long as we update the last\_run @@ -317,14 +249,13 @@ Since we are going to be running the initial query on the input events frequently, we would benefit significantly from and index on the timestamp of incoming events: -``>>> db.stats.hourly.ensure_index('ts')`` +:: -Since we are always reading and writing the most recent events, this -index -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + >>> db.stats.hourly.ensure_index('ts') -has the advantage of being right-aligned, which basically means we only -need a thin slice of the index (the most recent values) in RAM to +Since we are always reading and writing the most recent events, this +index has the advantage of being right-aligned, which basically means we +only need a thin slice of the index (the most recent values) in RAM to achieve good performance. Aggregate from hour to day @@ -334,37 +265,24 @@ In calculating the daily statistics, we will use the hourly statistics as input. Our map function looks quite similar to our hourly map function: -``mapf_day = bson.Code('''function() {`` - -``var key = {`` - -``u: this._id.u,`` - -``d: new Date(`` - -``this._id.d.getFullYear(),`` - -``this._id.d.getMonth(),`` - -``this._id.d.getDate(),`` - -``0, 0, 0, 0) };`` - -``emit(`` - -``key,`` - -``{`` - -``total: this.value.total,`` - -``count: this.value.count,`` - -``mean: 0,`` - -``ts: null });`` - -``}''')`` +:: + + mapf_day = bson.Code('''function() { + var key = { + u: this._id.u, + d: new Date( + this._id.d.getFullYear(), + this._id.d.getMonth(), + this._id.d.getDate(), + 0, 0, 0, 0) }; + emit( + key, + { + total: this.value.total, + count: this.value.count, + mean: 0, + ts: null }); + }''') There are a few differences to note here. First of all, the key to which we aggregate is the (userid, date) rather than (userid, hour) to allow @@ -378,27 +296,21 @@ hourly aggregations, we can, in fact, use the same reduce and finalize functions. The actual Python code driving this level of aggregation is as follows: -``cutoff = datetime.utcnow() - timedelta(seconds=60)`` - -``query = { 'value.ts': { '$gt': last_run, '$lt': cutoff } }`` - -\` - -``db.stats.hourly.map_reduce(`` +:: -``map=mapf_day,`` + cutoff = datetime.utcnow() - timedelta(seconds=60) + query = { 'value.ts': { '$gt': last_run, '$lt': cutoff } } -``reduce=reducef,`` -``finalize=finalizef,`` + db.stats.hourly.map_reduce( + map=mapf_day, + reduce=reducef, + finalize=finalizef, + query=query, + out={ 'reduce': 'stats.daily' }) -``query=query,`` -``out={ 'reduce': 'stats.daily' })`` - -\` - -``last_run = cutoff`` + last_run = cutoff There are a couple of things to note here. First of all, our query is not on 'ts' now, but 'value.ts', the timestamp we wrote during the @@ -413,13 +325,13 @@ Since we are going to be running the initial query on the hourly statistics collection frequently, an index on 'value.ts' would be nice to have: -``>>> db.stats.hourly.ensure_index('value.ts')`` +:: -Once again, this is a right-aligned index that will use very little RAM -for -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + >>> db.stats.hourly.ensure_index('value.ts') -efficient operation. +Once again, this is a right-aligned index that will use very little RAM +for efficient operation. +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Other aggregations ~~~~~~~~~~~~~~~~~~ @@ -427,188 +339,118 @@ Other aggregations Once we have our daily statistics, we can use them to calculate our weekly and monthly statistics. Our weekly map function is as follows: -``mapf_week = bson.Code('''function() {`` - -``var key = {`` - -``u: this._id.u,`` - -``d: new Date(`` - -``this._id.d.valueOf()`` - -``- dt.getDay()*24*60*60*1000) };`` - -``emit(`` - -``key,`` - -``{`` - -``total: this.value.total,`` - -``count: this.value.count,`` - -``mean: 0,`` - -``ts: null });`` - -``}''')`` - -\` +:: + + mapf_week = bson.Code('''function() { + var key = { + u: this._id.u, + d: new Date( + this._id.d.valueOf() + - dt.getDay()*24*60*60*1000) }; + emit( + key, + { + total: this.value.total, + count: this.value.count, + mean: 0, + ts: null }); + }''') Here, in order to get our group key, we are simply taking the date and subtracting days until we get to the beginning of the week. In our weekly map function, we will choose the first day of the month as our group key: -``mapf_month = bson.Code('''function() {`` - -``d: new Date(`` - -``this._id.d.getFullYear(),`` - -``this._id.d.getMonth(),`` - -``1, 0, 0, 0, 0) };`` - -``emit(`` - -``key,`` - -``{`` - -``total: this.value.total,`` - -``count: this.value.count,`` - -``mean: 0,`` - -``ts: null });`` - -``}''')`` +:: + + mapf_month = bson.Code('''function() { + d: new Date( + this._id.d.getFullYear(), + this._id.d.getMonth(), + 1, 0, 0, 0, 0) }; + emit( + key, + { + total: this.value.total, + count: this.value.count, + mean: 0, + ts: null }); + }''') One thing in particular to notice about these map functions is that they are identical to one another except for the date calculation. We can use Python's string interpolation to refactor our map function definitions as follows: -\` - -``mapf_hierarchical = '''function() {`` - -``var key = {`` - -``u: this._id.u,`` - -``d: %s };`` - -``emit(`` - -``key,`` - -``{`` - -``total: this.value.total,`` - -``count: this.value.count,`` - -``mean: 0,`` - -``ts: null });`` - -``}'''`` - -\` - -``mapf_day = bson.Code(`` - -``mapf_hierarchical % '''new Date(`` - -``this._id.d.getFullYear(),`` - -``this._id.d.getMonth(),`` - -``this._id.d.getDate(),`` - -``0, 0, 0, 0)''')`` - -\` - -``mapf_week = bson.Code(`` - -``mapf_hierarchical % '''new Date(`` - -``this._id.d.valueOf()`` - -``- dt.getDay()*24*60*60*1000)''')`` +:: -\` + mapf_hierarchical = '''function() { + var key = { + u: this._id.u, + d: %s }; + emit( + key, + { + total: this.value.total, + count: this.value.count, + mean: 0, + ts: null }); + }''' -``mapf_month = bson.Code(`` -``mapf_hierarchical % '''new Date(`` + mapf_day = bson.Code( + mapf_hierarchical % '''new Date( + this._id.d.getFullYear(), + this._id.d.getMonth(), + this._id.d.getDate(), + 0, 0, 0, 0)''') -``this._id.d.getFullYear(),`` -``this._id.d.getMonth(),`` + mapf_week = bson.Code( + mapf_hierarchical % '''new Date( + this._id.d.valueOf() + - dt.getDay()*24*60*60*1000)''') -``1, 0, 0, 0, 0)''')`` -\` + mapf_month = bson.Code( + mapf_hierarchical % '''new Date( + this._id.d.getFullYear(), + this._id.d.getMonth(), + 1, 0, 0, 0, 0)''') -``mapf_year = bson.Code(`` -``mapf_hierarchical % '''new Date(`` - -``this._id.d.getFullYear(),`` - -``1, 1, 0, 0, 0, 0)''')`` - -\` + mapf_year = bson.Code( + mapf_hierarchical % '''new Date( + this._id.d.getFullYear(), + 1, 1, 0, 0, 0, 0)''') Our Python driver can also be refactored so we have much less code duplication: -``def aggregate(icollection, ocollection, mapf, cutoff, last_run):`` - -``query = { 'value.ts': { '$gt': last_run, '$lt': cutoff } }`` - -``icollection.map_reduce(`` - -``map=mapf,`` +:: -``reduce=reducef,`` - -``finalize=finalizef,`` - -``query=query,`` - -``out={ 'reduce': ocollection.name })`` - -\` + def aggregate(icollection, ocollection, mapf, cutoff, last_run): + query = { 'value.ts': { '$gt': last_run, '$lt': cutoff } } + icollection.map_reduce( + map=mapf, + reduce=reducef, + finalize=finalizef, + query=query, + out={ 'reduce': ocollection.name }) Once this is defined, we can perform all our aggregations as follows: -``cutoff = datetime.utcnow() - timedelta(seconds=60)`` - -``aggregate(db.events, db.stats.hourly, mapf_hour, cutoff, last_run)`` - -``aggregate(db.stats.hourly, db.stats.daily, mapf_day, cutoff, last_run)`` +:: -``aggregate(db.stats.daily, db.stats.weekly, mapf_week, cutoff, last_run)`` - -``aggregate(db.stats.daily, db.stats.monthly, mapf_month, cutoff,`` - -``last_run)`` - -``aggregate(db.stats.monthly, db.stats.yearly, mapf_year, cutoff,`` - -``last_run)`` - -``last_run = cutoff`` - -\` + cutoff = datetime.utcnow() - timedelta(seconds=60) + aggregate(db.events, db.stats.hourly, mapf_hour, cutoff, last_run) + aggregate(db.stats.hourly, db.stats.daily, mapf_day, cutoff, last_run) + aggregate(db.stats.daily, db.stats.weekly, mapf_week, cutoff, last_run) + aggregate(db.stats.daily, db.stats.monthly, mapf_month, cutoff, + last_run) + aggregate(db.stats.monthly, db.stats.yearly, mapf_year, cutoff, + last_run) + last_run = cutoff So long as we save/restore our 'last\_run' variable between aggregations, we can run these aggregations as often as we like since @@ -619,11 +461,12 @@ Index support Our indexes will continue to be on the value's timestamp to ensure efficient operation of the next level of the aggregation (and they -continue to be right- aligned): +continue to be right-aligned): -``>>> db.stats.daily.ensure_index('value.ts')`` +:: -``>>> db.stats.monthly.ensure_index('value.ts')`` + >>> db.stats.daily.ensure_index('value.ts') + >>> db.stats.monthly.ensure_index('value.ts') Sharding -------- @@ -637,81 +480,22 @@ makes sense as the most significant part of the shard key. In order to prevent a single, active user from creating a large, unsplittable chunk, we will use a compound shard key with (username, -timestamp) on each of our collections: - -``>>> db.command('shardcollection', 'events', {`` - -``... key : { 'userid': 1, 'ts' : 1} } )`` - -``{ "collectionsharded" : "events", "ok" : 1 }`` - -``>>> db.command('shardcollection', 'stats.daily', {`` - -``... key : { '_id': 1} } )`` - -``{ "collectionsharded" : "stats.daily", "ok" : 1 }`` - -``>>> db.command('shardcollection', 'stats.weekly', {`` - -``... key : { '_id': 1} } )`` - -``{ "collectionsharded" : "stats.weekly", "ok" : 1 }`` - -``>>> db.command('shardcollection', 'stats.monthly', {`` - -``... key : { '_id': 1} } )`` - -``{ "collectionsharded" : "stats.monthly", "ok" : 1 }`` - -``>>> db.command('shardcollection', 'stats.yearly', {`` - -``... key : { '_id': 1} } )`` - -``{ "collectionsharded" : "stats.yearly", "ok" : 1 }`` +timestamp) on each of our collections: >>> db.command('shardcollection', +'events', { ... key : { 'userid': 1, 'ts' : 1} } ) { "collectionsharded" +: "events", "ok" : 1 } >>> db.command('shardcollection', 'stats.daily', +{ ... key : { '\_id': 1} } ) { "collectionsharded" : "stats.daily", "ok" +: 1 } >>> db.command('shardcollection', 'stats.weekly', { ... key : { +'\_id': 1} } ) { "collectionsharded" : "stats.weekly", "ok" : 1 } >>> +db.command('shardcollection', 'stats.monthly', { ... key : { '\_id': 1} +} ) { "collectionsharded" : "stats.monthly", "ok" : 1 } >>> +db.command('shardcollection', 'stats.yearly', { ... key : { '\_id': 1} } +) { "collectionsharded" : "stats.yearly", "ok" : 1 } We should also update our map/reduce driver so that it notes the output should be sharded. This is accomplished by adding 'sharded':True to the output argument: -… - -``out={ 'reduce': ocollection.name, 'sharded': True })`` - -… +… out={ 'reduce': ocollection.name, 'sharded': True }) … Note that the output collection of a mapreduce command, if sharded, must be sharded using \_id as the shard key. - -Page of - -[a]jsr: - -It's a little weird to have the code sample in python since we'll -actually be doing the map reduce in javascript. Is it significantly more -code to do this as javascript? - --------------- - -rick446: - -I could rewrite all the code examples in this doc as Javascript if -that's what you want. I don't think we should do some of the snippets in -Python and some in JS, however. - -Also, the docs focus on JS, so it might be nice to see how you do this -in a non-JS environment (Answering the question of how *do* you send the -mapf and reducef from non-JS) - -[b]jsr: - -It's worth describing the set of collections we'll have. 1) the raw data -logs, 2) hourly data, 3) daily data. And show that there's a map reduce -job between each collection. E.g. job1 takes raw data to hourly. job2 -takes hourly data to daily data. - --------------- - -rick446: - -I added an illustration above in the solution overview; is that -sufficient? diff --git a/source/tutorial/usecase/real_time_analytics-_preaggregated_reports.txt b/source/tutorial/usecase/real_time_analytics-_preaggregated_reports.txt index 11066f030bf..08ea26ffc8d 100644 --- a/source/tutorial/usecase/real_time_analytics-_preaggregated_reports.txt +++ b/source/tutorial/usecase/real_time_analytics-_preaggregated_reports.txt @@ -28,15 +28,15 @@ want to count the number of hits to a collection of web site at various levels of time-granularity (by minute, hour, day, week, and month) as well as by path. We will assume that either you have some code that can run as part of your web app when it is rendering the page, or you have -some set of logfile post- processors that can run in order to integrate +some set of logfile post-processors that can run in order to integrate the statistics. Schema design ------------- There are two important considerations when designing the schema for a -real- time analytics system: the ease & speed of updates and the ease & -speed of queries[a]. In particular, we want to avoid the following +real-time analytics system: the ease & speed of updates and the ease & +speed of queries. In particular, we want to avoid the following performance-killing circumstances: - documents changing in size significantly, causing reallocations on @@ -63,41 +63,26 @@ Design 0: one document per page/day Our initial approach will be to simply put all the statistics in which we're interested into a single document per page: -``{`` - -``_id: "20101010/site-1/apache_pb.gif",`` - -``metadata: {`` - -``date: ISODate("2000-10-10T00:00:00Z"),`` - -``site: "site-1",`` - -``page: "/apache_pb.gif" },`` - -``daily: 5468426,`` - -``hourly: {`` - -``"0": 227850,`` - -``"1": 210231,`` - -``…`` - -``"23": 20457 },`` - -``minute: {`` - -``"0": 3612,`` - -``"1": 3241,`` - -``…`` - -``"1439": 2819 }`` +:: -``}``[b] + { + _id: "20101010/site-1/apache_pb.gif", + metadata: { + date: ISODate("2000-10-10T00:00:00Z"), + site: "site-1", + page: "/apache_pb.gif" }, + daily: 5468426, + hourly: { + "0": 227850, + "1": 210231, + … + "23": 20457 }, + minute: { + "0": 3612, + "1": 3241, + … + "1439": 2819 } + } This approach has a couple of advantages: a) it only requires a single update per hit to the website, b) intra-day reports for a single page @@ -138,78 +123,49 @@ sequence of (key, value) pairs, *not* as a hash table. What this means for us is that writing to stats.mn.0 is *much* faster than writing to stats.mn.1439. [c] -.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sg_d2tpKfXUsecEyv%20pgRg8w&rev=1&h=82&w=410&ac=1 +.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sg_d2tpKfXUsecEyvpgRg8w&rev=1&h=82&w=410&ac=1 :align: center :alt: In order to speed this up, we can introduce some intra-document hierarchy. In particular, we can split the 'mn' field up into 24 hourly fields: -\` - -``{`` - -``_id: "20101010/site-1/apache_pb.gif",`` - -``metadata: {`` - -``date: ISODate("2000-10-10T00:00:00Z"),`` - -``site: "site-1",`` - -``page: "/apache_pb.gif" },`` - -``daily: 5468426,`` - -``hourly: {`` - -``"0": 227850,`` - -``"1": 210231,`` - -``…`` - -``"23": 20457 },`` - -``minute: {`` - -``"0": {`` - -``"0": 3612,`` - -``"1": 3241,`` - -``…`` - -``"59": 2130 },`` - -\` "1": { - :: - "60": … , - -\` - -``},`` - -``…`` - -``"23": {`` - -``…`` - -``"1439": 2819 }`` - -``}`` - -``}`` + { + _id: "20101010/site-1/apache_pb.gif", + metadata: { + date: ISODate("2000-10-10T00:00:00Z"), + site: "site-1", + page: "/apache_pb.gif" }, + daily: 5468426, + hourly: { + "0": 227850, + "1": 210231, + … + "23": 20457 }, + minute: { + "0": { + "0": 3612, + "1": 3241, + … + "59": 2130 }, + "1": { + "60": … , + + }, + … + "23": { + … + "1439": 2819 } + } + } This allows MongoDB to "skip forward" when updating the minute statistics later in the day, making our performance more uniform and generally faster. -.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sGv9KIXyF_XZvpnNP%20Vyojcg&rev=21&h=148&w=410&ac=1 +.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sGv9KIXyF_XZvpnNPVyojcg&rev=21&h=148&w=410&ac=1 :align: center :alt: Design #2: Create separate documents for different granularities @@ -228,88 +184,52 @@ follows: Daily Statistics ^^^^^^^^^^^^^^^^ -``{`` - -``_id: "20101010/site-1/apache_pb.gif",`` - -``metadata: {`` - -``date: ISODate("2000-10-10T00:00:00Z"),`` - -``site: "site-1",`` - -``page: "/apache_pb.gif" },`` - -``hourly: {`` - -``"0": 227850,`` - -``"1": 210231,`` - -``…`` - -``"23": 20457 },`` - -``minute: {`` - -``"0": {`` - -``"0": 3612,`` - -``"1": 3241,`` - -``…`` - -``"59": 2130 },`` - -\` "1": { - :: - "0": … , - -\` - -``},`` - -``…`` - -``"23": {`` - -``…`` - -``"59": 2819 }`` - -``}`` - -``}`` + { + _id: "20101010/site-1/apache_pb.gif", + metadata: { + date: ISODate("2000-10-10T00:00:00Z"), + site: "site-1", + page: "/apache_pb.gif" }, + hourly: { + "0": 227850, + "1": 210231, + … + "23": 20457 }, + minute: { + "0": { + "0": 3612, + "1": 3241, + … + "59": 2130 }, + "1": { + "0": … , + + }, + … + "23": { + … + "59": 2819 } + } + } Monthly Statistics ^^^^^^^^^^^^^^^^^^ -\` - -``{`` - -``_id: "201010/site-1/apache_pb.gif",`` - -``metadata: {`` - -``date: ISODate("2000-10-00T00:00:00Z"),`` - -``site: "site-1",`` - -``page: "/apache_pb.gif" },`` - -``daily: {`` - -``"1": 5445326,`` - -``"2": 5214121,`` - -``… }`` +:: -``}`` + { + _id: "201010/site-1/apache_pb.gif", + metadata: { + date: ISODate("2000-10-00T00:00:00Z"), + site: "site-1", + page: "/apache_pb.gif" }, + daily: { + "1": 5445326, + "2": 5214121, + … } + } Operations ---------- @@ -327,67 +247,43 @@ Logging a hit to a page in our website is the main 'write' activity in our system. In order to maximize performance, we will be doing in-place updates with the upsert operation: -``from datetime import datetime, time`` - -\` - -``def log_hit(db, dt_utc, site, page):`` - -\` - -``# Update daily stats doc`` - -``id_daily = dt_utc.strftime('%Y%m%d/') + site + page`` - -``hour = dt_utc.hour`` - -``minute = dt_utc.minute`` - -\` - -``# Get a datetime that only includes date info`` - -``d = datetime.combine(dt_utc.date(), time.min)`` - -``query = {`` - -``'_id': id_daily,`` - -``'metadata': { 'date': d, 'site': site, 'page': page } }`` - -``update = { '$inc': {`` - -``'hourly.%d' % (hour,): 1,`` - -``'minute.%d.%d' % (hour,minute): 1 } }`` - -``db.stats.daily.update(query, update, upsert=True)`` - -\` - -``# Update monthly stats document`` - -``id_monthly = dt_utc.strftime('%Y%m/') + site + page`` - -``day_of_month = dt_utc.day`` +:: -``query = {`` + from datetime import datetime, time -``'_id': id_monthly,`` -``'metadata': {`` + def log_hit(db, dt_utc, site, page): -``'date': d.replace(day=1),`` -``'site': site,`` + # Update daily stats doc + id_daily = dt_utc.strftime('%Y%m%d/') + site + page + hour = dt_utc.hour + minute = dt_utc.minute -``'page': page } }`` -``update = { '$inc': {`` + # Get a datetime that only includes date info + d = datetime.combine(dt_utc.date(), time.min) + query = { + '_id': id_daily, + 'metadata': { 'date': d, 'site': site, 'page': page } } + update = { '$inc': { + 'hourly.%d' % (hour,): 1, + 'minute.%d.%d' % (hour,minute): 1 } } + db.stats.daily.update(query, update, upsert=True) -``'daily.%d' % day_of_month: 1} }`` -``db.stats.monthly.update(query, update, upsert=True)`` + # Update monthly stats document + id_monthly = dt_utc.strftime('%Y%m/') + site + page + day_of_month = dt_utc.day + query = { + '_id': id_monthly, + 'metadata': { + 'date': d.replace(day=1), + 'site': site, + 'page': page } } + update = { '$inc': { + 'daily.%d' % day_of_month: 1} } + db.stats.monthly.update(query, update, upsert=True) Since we are using the upsert operation, this function will perform correctly whether the document is already present or not, which is @@ -406,83 +302,50 @@ zero for all time periods so that later, the document doesn't need to grow to accomodate the upserts. Here, we add this preallocation as its own function: -``def preallocate(db, dt_utc, site, page):`` - -\` - -``# Get id values`` - -``id_daily = dt_utc.strftime('%Y%m%d/') + site + page`` - -``id_monthly = dt_utc.strftime('%Y%m/') + site + page`` - -\` - -``# Get daily metadata`` - -``daily_metadata = {`` - -``'date': datetime.combine(dt_utc.date(), time.min),`` - -``'site': site,`` - -``'page': page }`` - -``# Get monthly metadata`` - -``monthly_metadata = {`` - -``'date': daily_m['d'].replace(day=1),`` - -``'site': site,`` - -``'page': page }`` - -\` - -``# Initial zeros for statistics`` - -``hourly = dict((str(i), 0) for i in range(24))`` - -``minute = dict(`` - -``(str(i), dict((str(j), 0) for j in range(60)))`` - -``for i in range(24))`` - -``daily = dict((str(i), 0) for i in range(1, 32))`` - -\` - -``# Perform upserts, setting metadata`` - -``db.stats.daily.update(`` - -``{`` - -``'_id': id_daily,`` - -``'hourly': hourly,`` - -``'minute': minute},`` - -``{ '$set': { 'metadata': daily_metadata }},`` - -``upsert=True)`` - -``db.stats.monthly.update(`` - -``{`` - -``'_id': id_monthly,`` - -``'daily': daily },`` - -``{ '$set': { 'm': monthly_metadata }},`` - -``upsert=True)`` +:: -\` + def preallocate(db, dt_utc, site, page): + + + # Get id values + id_daily = dt_utc.strftime('%Y%m%d/') + site + page + id_monthly = dt_utc.strftime('%Y%m/') + site + page + + + # Get daily metadata + daily_metadata = { + 'date': datetime.combine(dt_utc.date(), time.min), + 'site': site, + 'page': page } + # Get monthly metadata + monthly_metadata = { + 'date': daily_m['d'].replace(day=1), + 'site': site, + 'page': page } + + + # Initial zeros for statistics + hourly = dict((str(i), 0) for i in range(24)) + minute = dict( + (str(i), dict((str(j), 0) for j in range(60))) + for i in range(24)) + daily = dict((str(i), 0) for i in range(1, 32)) + + + # Perform upserts, setting metadata + db.stats.daily.update( + { + '_id': id_daily, + 'hourly': hourly, + 'minute': minute}, + { '$set': { 'metadata': daily_metadata }}, + upsert=True) + db.stats.monthly.update( + { + '_id': id_monthly, + 'daily': daily }, + { '$set': { 'm': monthly_metadata }}, + upsert=True) In this case, note that we went ahead and preallocated the monthly document while we were preallocating the daily document. While we could @@ -500,29 +363,21 @@ probabilistically preallocate each time we log a hit, with a probability tuned to make preallocation likely without performing too many unnecessary calls to preallocate: -\_ - -``from random import random`` - -``from datetime import datetime, timedelta, time`` - -\` - -``# Example probability based on 500k hits per day per page`` - -``prob_preallocate = 1.0 / 500000`` - -\` +:: -``def log_hit(db, dt_utc, site, page):`` + from random import random + from datetime import datetime, timedelta, time -``if random.random() < prob_preallocate:`` -``preallocate(db, dt_utc + timedelta(days=1), site_page)`` + # Example probability based on 500k hits per day per page + prob_preallocate = 1.0 / 500000 -``# Update daily stats doc`` -``…`` + def log_hit(db, dt_utc, site, page): + if random.random() < prob_preallocate: + preallocate(db, dt_utc + timedelta(days=1), site_page) + # Update daily stats doc + … Now with a high probability, we will preallocate each document before it's used, preventing the midnight spike as well as eliminating the @@ -535,43 +390,33 @@ One chart that we may be interested in seeing would be the number of hits to a particular page over the last hour. In that case, our query is fairly straightforward: -``>>>``db.stats.daily.find_one(`` - -``... {'metadata': {'date':dt, 'site':'site-1', 'page':'/foo.gif'}},`` +:: -``... { 'minute': 1 })`` + >>>``db.stats.daily.find_one( + ... {'metadata': {'date':dt, 'site':'site-1', 'page':'/foo.gif'}}, + ... { 'minute': 1 }) Likewise, we can get the number of hits to a page over the last day, with hourly granularity: -\` - -``>>> db.stats.daily.find_one(`` - -``... {'metadata': {'date':dt, 'site':'site-1', 'page':'/foo.gif'}},`` - -``... { 'hy': 1 })`` +:: -\` + >>> db.stats.daily.find_one( + ... {'metadata': {'date':dt, 'site':'site-1', 'page':'/foo.gif'}}, + ... { 'hy': 1 }) If we want a few days' worth of hourly data, we can get it using the following query: -\` - -``>>> db.stats.daily.find(`` - -``... {`` - -``... 'metadata.date': { '$gte': dt1, '$lte': dt2 },`` - -``... 'metadata.site': 'site-1',`` - -``... 'metadata.page': '/foo.gif'},`` - -``... { 'metadata.date': 1, 'hourly': 1 } },`` +:: -``... sort=[('metadata.date', 1)])`` + >>> db.stats.daily.find( + ... { + ... 'metadata.date': { '$gte': dt1, '$lte': dt2 }, + ... 'metadata.site': 'site-1', + ... 'metadata.page': '/foo.gif'}, + ... { 'metadata.date': 1, 'hourly': 1 } }, + ... sort=[('metadata.date', 1)]) In this case, we are retrieving the date along with the statistics since it's possible (though highly unlikely) that we could have a gap of one @@ -584,20 +429,18 @@ Index support These operations would benefit significantly from indexes on the metadata of the daily statistics: -``>>> db.stats.daily.ensure_index([`` - -``... ('metadata.site', 1),`` - -``... ('metadata.page', 1),`` +:: -``... ('metadata.date', 1)])`` + >>> db.stats.daily.ensure_index([ + ... ('metadata.site', 1), + ... ('metadata.page', 1), + ... ('metadata.date', 1)]) Note in particular that we indexed on the page first, date second. This -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - allows us to perform the third query above (a single page over a range of days) quite efficiently. Having any compound index on page and date, of course, allows us to look up a single day's statistics efficiently. +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Get data for a historical chart ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -605,38 +448,27 @@ Get data for a historical chart In order to retrieve daily data for a single month, we can perform the following query: -``>>> db.stats.monthly.find_one(`` - -``... {'metadata':`` - -``... {'date':dt,`` - -``... 'site': 'site-1',`` - -``... 'page':'/foo.gif'}},`` - -``... { 'daily': 1 })`` +:: -\` + >>> db.stats.monthly.find_one( + ... {'metadata': + ... {'date':dt, + ... 'site': 'site-1', + ... 'page':'/foo.gif'}}, + ... { 'daily': 1 }) If we want several months' worth of daily data, of course, we can do the same trick as above: -\` - -``>>> db.stats.monthly.find(`` - -``... {`` - -``... 'metadata.date': { '$gte': dt1, '$lte': dt2 },`` - -``... 'metadata.site': 'site-1',`` - -``... 'metadata.page': '/foo.gif'},`` - -``... { 'metadata.date': 1, 'hourly': 1 } },`` +:: -``... sort=[('metadata.date', 1)])`` + >>> db.stats.monthly.find( + ... { + ... 'metadata.date': { '$gte': dt1, '$lte': dt2 }, + ... 'metadata.site': 'site-1', + ... 'metadata.page': '/foo.gif'}, + ... { 'metadata.date': 1, 'hourly': 1 } }, + ... sort=[('metadata.date', 1)]) Index support ^^^^^^^^^^^^^ @@ -644,19 +476,16 @@ Index support Once again, these operations would benefit significantly from indexes on the metadata of the monthly statistics: -``>>> db.stats.monthly.ensure_index([`` - -``... ('metadata.site', 1),`` - -``... ('metadata.page', 1),`` +:: -``... ('metadata.date', 1)])`` + >>> db.stats.monthly.ensure_index([ + ... ('metadata.site', 1), + ... ('metadata.page', 1), + ... ('metadata.date', 1)]) The order of our index is once again designed to efficiently support -range -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -queries for a single page over several months, as above. +range queries for a single page over several months, as above. +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sharding -------- @@ -669,17 +498,14 @@ reasonable shard key for us would thus be ('metadata.site', 'metadata.page'), the site-page combination for which we are calculating statistics: -``>>> db.command('shardcollection', 'stats.daily', {`` - -``... key : { 'metadata.site': 1, 'metadata.page' : 1 } })`` - -``{ "collectionsharded" : "stats.daily", "ok" : 1 }`` - -``>>> db.command('shardcollection', 'stats.monthly', {`` - -``... key : { 'metadata.site': 1, 'metadata.page' : 1 } })`` +:: -``{ "collectionsharded" : "stats.monthly", "ok" : 1 }`` + >>> db.command('shardcollection', 'stats.daily', { + ... key : { 'metadata.site': 1, 'metadata.page' : 1 } }) + { "collectionsharded" : "stats.daily", "ok" : 1 } + >>> db.command('shardcollection', 'stats.monthly', { + ... key : { 'metadata.site': 1, 'metadata.page' : 1 } }) + { "collectionsharded" : "stats.monthly", "ok" : 1 } One downside to using ('metadata.site', 'metadata.page') as our shard key is that, if one page dominates all our traffic, all updates to that @@ -693,17 +519,14 @@ these will all be handled by the same shard. A (slightly) better shard key would the include the date as well as the site/page so that we could serve different historical ranges with different shards: -``>>> db.command('shardcollection', 'stats.daily', {`` - -``... key:{'metadata.site':1,'metadata.page':1,'metadata.date':1}})`` - -``{ "collectionsharded" : "stats.daily", "ok" : 1 }`` - -``>>> db.command('shardcollection', 'stats.monthly', {`` - -``... key:{'metadata.site':1,'metadata.page':1,'metadata.date':1}})`` +:: -``{ "collectionsharded" : "stats.monthly", "ok" : 1 }`` + >>> db.command('shardcollection', 'stats.daily', { + ... key:{'metadata.site':1,'metadata.page':1,'metadata.date':1}}) + { "collectionsharded" : "stats.daily", "ok" : 1 } + >>> db.command('shardcollection', 'stats.monthly', { + ... key:{'metadata.site':1,'metadata.page':1,'metadata.date':1}}) + { "collectionsharded" : "stats.monthly", "ok" : 1 } It is worth noting in this discussion of sharding that, depending on the number of sites/pages you are tracking and the number of hits per page, @@ -712,57 +535,3 @@ requirements, so sharding may be overkill. In the case of the MongoDB Monitoring Service (MMS), a single shard is able to keep up with the totality of traffic generated by all the customers using this (free) service. - -Page of - -[a]jsr: - -It's worth mentioning that if we organize events as we did in the "log -collection" use case, then queries need to hit lots of documents. the -appeal of this approach is that queries are fast because data is -pre-aggregated. - --------------- - -rick446: - -Added a paragraph below to talk about why we don't want individual docs -for each 'tick' - -[b]jsr: - -Let's expand the attribute names to full words so it's more readable. - --------------- - -rick446: - -done - -[c]jsr: - -Similar to table-scan vs. tree. Perhaps a diagram showing the difference -between iterating through 1439 elements in a flat array vs. traversing a -tree. - --------------- - -rick446: - -Added some diagrams to illustrate the BSON layout and # of 'skip -forward' operations needed to do with each schema - -[d]jsr: - -It's a little unclear why we need pre-allocation. Ryan had a graph that -showed big spikes in insert latency at the beginning of the day, and -then another chart showing smoother performance when pre-allocation was -added. Maybe recreate this chart. - --------------- - -rick446: - -I could possibly recreate the chart (though that would take several -hours to actually implement). Is the verbiage I added sufficient to get -across the reasoning? diff --git a/source/tutorial/usecase/real_time_analytics-_storing_log_data.txt b/source/tutorial/usecase/real_time_analytics-_storing_log_data.txt index 3232331607e..4f342e59b66 100644 --- a/source/tutorial/usecase/real_time_analytics-_storing_log_data.txt +++ b/source/tutorial/usecase/real_time_analytics-_storing_log_data.txt @@ -19,7 +19,6 @@ case when logging a high-bandwidth event stream). Schema design ------------- -\*\* The schema design in this case will depend largely on the particular format of the event data you want to store. For a simple example, let's take standard request logs from the Apache web server using the combined @@ -27,20 +26,19 @@ log format. For this example we will assume you're using an uncapped collection to store the event data. A line from such a log file might look like the following: -``127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "``[http://www.example.com/start.html](http://www.example.com/start.html)``" "Mozilla/4.08 [en] (Win98; I ;Nav)"`` +:: -\` + 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "[http://www.example.com/start.html](http://www.example.com/start.html)" "Mozilla/4.08 [en] (Win98; I ;Nav)" The simplest approach to storing the log data would be putting the exact text of the log record into a document: -``{`` +:: -``_id: ObjectId('4f442120eb03305789000000'),`` - -``line: '127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "``[http://www.example.com/start.html](http://www.example.com/start.html)``" "Mozilla/4.08 [en] (Win98; I ;Nav)"'`` - -``}`` + { + _id: ObjectId('4f442120eb03305789000000'), + line: '127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "[http://www.example.com/start.html](http://www.example.com/start.html)" "Mozilla/4.08 [en] (Win98; I ;Nav)"' + } While this is a possible solution, it's not likely to be the optimal solution. For instance, if we decided we wanted to find events that hit @@ -49,7 +47,7 @@ would require a full collection scan. A better approach would be to extract the relevant fields into individual properties. When doing the extraction, we should pay attention to the choice of data types for the various fields. For instance, the date field in the log line -``[10/Oct/2000:13:55:36 -0700]``is 28 bytes long. If we instead store +``[10/Oct/2000:13:55:36 -0700]`` is 28 bytes long. If we instead store this as a UTC timestamp, it shrinks to 8 bytes. Storing the date as a timestamp also gives us the advantage of being able to make date range queries, whereas comparing two date *strings* is nearly useless. A @@ -61,78 +59,53 @@ We should also consider what information we might want to omit from the log record. For instance, if we wanted to record exactly what was in the log record, we might create a document like the following: -``{`` - -``_id: ObjectId('4f442120eb03305789000000'),`` - -``host: "127.0.0.1",`` - -``logname: null,`` - -``user: 'frank',`` - -``time: ,`` - -``request: "GET /apache_pb.gif HTTP/1.0",`` - -``status: 200,`` - -``response_size: 2326,`` - -``referer: "``[http://www.example.com/start.html](http://www.example.com/start.html)``",`` - -``user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)"`` - -``}`` - -\` +:: + + { + _id: ObjectId('4f442120eb03305789000000'), + host: "127.0.0.1", + logname: null, + user: 'frank', + time: , + request: "GET /apache_pb.gif HTTP/1.0", + status: 200, + response_size: 2326, + referer: "[http://www.example.com/start.html](http://www.example.com/start.html)", + user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)" + } In most cases, however, we probably are only interested in a subset of the data about the request. Here, we may want to keep the host, time, path, user agent, and referer for a web analytics application: -``{`` - -``_id: ObjectId('4f442120eb03305789000000'),`` +:: -``host: "127.0.0.1",`` - -``time: ISODate("2000-10-10T20:55:36Z"),`` - -``path: "/apache_pb.gif",`` - -``referer: "``[http://www.example.com/start.html](http://www.example.com/start.html)``",`` - -``user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)"`` - -``}`` - -\` + { + _id: ObjectId('4f442120eb03305789000000'), + host: "127.0.0.1", + time: ISODate("2000-10-10T20:55:36Z"), + path: "/apache_pb.gif", + referer: "[http://www.example.com/start.html](http://www.example.com/start.html)", + user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)" + } It might even be possible to remove the time, since ObjectIds embed their the time they are created: -``{`` - -``_id: ObjectId('4f442120eb03305789000000'),`` - -``host: "127.0.0.1",`` +:: -``time: ISODate("2000-10-10T20:55:36Z"),`` + { + _id: ObjectId('4f442120eb03305789000000'), + host: "127.0.0.1", + time: ISODate("2000-10-10T20:55:36Z"), + path: "/apache_pb.gif", + referer: "[http://www.example.com/start.html](http://www.example.com/start.html)", + user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)" + } -``path: "/apache_pb.gif",`` +System Architecture +~~~~~~~~~~~~~~~~~~~ -``referer: "``[http://www.example.com/start.html](http://www.example.com/start.html)``",`` - -``user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)"`` - -``}`` - -\` - -**System Architecture** - -\*\* For an event logging system, we are mainly concerned with two performance considerations: 1) how many inserts per second can we perform (this will limit our event throughput) and 2) how will we manage @@ -156,35 +129,24 @@ In many event logging applications, we can accept some degree of risk when it comes to dropping events. In others, we need to be absolutely sure we don't drop any events. MongoDB supports both models. In the case where we can tolerate a risk of loss, we can insert records -*asynchronously* using a fire- and-forget model: - -``>>> import bson`` - -``>>> import pymongo`` - -``>>> from datetime import datetime`` - -``>>> conn = pymongo.Connection()`` - -``>>> db = conn.event_db`` - -``>>> event = {`` - -``... _id: bson.ObjectId(),`` - -``... host: "127.0.0.1",`` - -``... time: datetime(2000,10,10,20,55,36),`` - -``... path: "/apache_pb.gif",`` - -``... referer: "``[http://www.example.com/start.html](http://www.example.com/start.html)``",`` - -``... user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)"`` - -``...}`` - -``>>> db.events.insert(event, safe=False)`` +*asynchronously* using a fire-and-forget model: + +:: + + >>> import bson + >>> import pymongo + >>> from datetime import datetime + >>> conn = pymongo.Connection() + >>> db = conn.event_db + >>> event = { + ... _id: bson.ObjectId(), + ... host: "127.0.0.1", + ... time: datetime(2000,10,10,20,55,36), + ... path: "/apache_pb.gif", + ... referer: "[http://www.example.com/start.html](http://www.example.com/start.html)", + ... user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)" + ...} + >>> db.events.insert(event, safe=False) This is the fastest approach, as our code doesn't even require a round-trip to the MongoDB server to ensure that the insert was received. @@ -194,29 +156,33 @@ index). If we want to make sure we have an acknowledgement from the server that our insertion succeeded (for some definition of success), we can pass safe=True: -``>>> db.events.insert(event, safe=True)`` +:: -\` + >>> db.events.insert(event, safe=True) If our tolerance for data loss risk is somewhat less, we can require that the server to which we write the data has committed the event to the on-disk journal before we continue operation (``safe=True`` is implied by all the following options): -``>>> db.events.insert(event, j=True)`` +:: -\` + >>> db.events.insert(event, j=True) Finally, if we have *extremely low* tolerance for event data loss, we can require the data to be replicated to multiple secondary servers before returning: -``>>> db.events.insert(event, w=2)`` +:: + + >>> db.events.insert(event, w=2) In this case, we have requested acknowledgement that the data has been replicated to 2 replicas. We can combine options as well: -``>>> db.events.insert(event, j=True, w=2)`` +:: + + >>> db.events.insert(event, j=True, w=2) In this case, we are waiting on both a journal commit *and* a replication acknowledgement. Although this is the safest option, it is @@ -240,9 +206,9 @@ For a web analytics-type operation, getting the logs for a particular web page might be a common operation that we would want to optimize for. In this case, the query would be as follows: -``>>> q_events = db.events.find({'path': '/apache_pb.gif'})`` +:: -\` + >>> q_events = db.events.find({'path': '/apache_pb.gif'}) Note that the sharding setup we use (should we decide to shard this collection) has performance implications for this operation. For @@ -257,11 +223,9 @@ Index support This operation would benefit significantly from an index on the 'path' attribute: -\` - -``>>> db.events.ensure_index('path')`` +:: -\` + >>> db.events.ensure_index('path') One potential downside to this index is that it is relatively randomly distributed, meaning that for efficient operation the entire index @@ -275,20 +239,19 @@ Finding all the events for a particular date We may also want to find all the events for a particular date. In this case, we would perform the following query: -``>>> q_events = db.events.find('time':`` +:: -``... { '$gte':datetime(2000,10,10),'$lt':datetime(2000,10,11)})`` - -\` + >>> q_events = db.events.find('time': + ... { '$gte':datetime(2000,10,10),'$lt':datetime(2000,10,11)}) Index support ^^^^^^^^^^^^^ In this case, an index on 'time' would provide optimal performance: -\` +:: -``>>> db.events.ensure_index('time')`` + >>> db.events.ensure_index('time') One of the nice things about this index is that it is *right-aligned.* Since we are always inserting events in ascending time order, the @@ -301,19 +264,16 @@ our system memory. Finding all the events for a particular host/date ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -\` - We might also want to analyze the behavior of a particular host on a particular day, perhaps for analyzing suspicious behavior by a particular IP address. In that case, we would write a query such as: -``>>> q_events = db.events.find({`` - -``... 'host': '127.0.0.1',`` - -``... 'time': {'$gte':datetime(2000,10,10),'$lt':datetime(2000,10,11)}`` +:: -``... })`` + >>> q_events = db.events.find({ + ... 'host': '127.0.0.1', + ... 'time': {'$gte':datetime(2000,10,10),'$lt':datetime(2000,10,11)} + ... }) Index support ^^^^^^^^^^^^^ @@ -322,82 +282,41 @@ Once again, our choice of indexes affects the performance characteristics of this query significantly. For instance, suppose we create a compound index on (time, host): -``>>> db.events.ensure_index([('time', 1), ('host', 1)])`` - -In this case, the query plan would be the following (retrieved via -q\_events.explain()): - -``{`` - -``…`` - -``u'cursor': u'BtreeCursor time_1_host_1',`` - -``u'indexBounds': {u'host': [[u'127.0.0.1', u'127.0.0.1']],`` - -``u'time': [[datetime.datetime(2000, 10, 10, 0, 0),`` - -``datetime.datetime(2000, 10, 11, 0, 0)]]},`` - -``…`` - -``u'millis': 4,`` +:: -``u'n': 11,`` + >>> db.events.ensure_index([('time', 1), ('host', 1)]) -``…`` - -``u'nscanned': 1296,`` - -``u'nscannedObjects': 11,`` - -``…`` - -``}`` - -\` +In this case, the query plan would be the following (retrieved via +q\_events.explain()): { … u'cursor': u'BtreeCursor time\_1\_host\_1', +u'indexBounds': {u'host': [[u'127.0.0.1', u'127.0.0.1']], u'time': +[[datetime.datetime(2000, 10, 10, 0, 0), datetime.datetime(2000, 10, 11, +0, 0)]]}, … u'millis': 4, u'n': 11, … u'nscanned': 1296, +u'nscannedObjects': 11, … } If, however, we create a compound index on (host, time)... -\` +:: -``>>> db.events.ensure_index([('host', 1), ('time', 1)])`` - -\` + >>> db.events.ensure_index([('host', 1), ('time', 1)]) We get a much more efficient query plan and much better performance: -\` - -``{`` - -``…`` - -``u'cursor': u'BtreeCursor host_1_time_1',`` - -``u'indexBounds': {u'host': [[u'127.0.0.1', u'127.0.0.1']],`` - -``u'time': [[datetime.datetime(2000, 10, 10, 0, 0),`` - -``datetime.datetime(2000, 10, 11, 0, 0)]]},`` - -``…`` - -``u'millis': 0,`` - -``u'n': 11,`` - -``…`` - -``u'nscanned': 11,`` - -``u'nscannedObjects': 11,`` - -``…`` - -``}`` - -\` +:: + + { + … + u'cursor': u'BtreeCursor host_1_time_1', + u'indexBounds': {u'host': [[u'127.0.0.1', u'127.0.0.1']], + u'time': [[datetime.datetime(2000, 10, 10, 0, 0), + datetime.datetime(2000, 10, 11, 0, 0)]]}, + … + u'millis': 0, + u'n': 11, + … + u'nscanned': 11, + u'nscannedObjects': 11, + … + } In this case, MongoDB is able to visit just 11 entries in the index to satisfy the query, whereas in the first it needed to visit 1296 entries. @@ -428,47 +347,27 @@ of MongoDB. Suppose we want to find out how many requests there were for each day and page over the last month, for instance. In this case, we could build up the following aggregation pipeline: -\` - -``>>> result = db.command('aggregate', 'events', pipeline=[`` - -``... { '$match': {`` - -``... 'time': {`` - -``... '$gte': datetime(2000,10,1),`` - -``... '$lt': datetime(2000,11,1) } } },`` - -``... { '$project': {`` - -``... 'path': 1,`` - -``... 'date': {`` - -``... 'y': { '$year': '$time' },`` - -``... 'm': { '$month': '$time' },`` - -``... 'd': { '$dayOfMonth': '$time' } } } },`` - -``... { '$group': {`` - -``... '_id': {`` - -``... 'p':'$path',`` - -``... 'y': '$date.y',`` - -``... 'm': '$date.m',`` - -``... 'd': '$date.d' },`` - -``... 'hits': { '$sum': 1 } } },`` - -``... ])`` - -\` +:: + + >>> result = db.command('aggregate', 'events', pipeline=[ + ... { '$match': { + ... 'time': { + ... '$gte': datetime(2000,10,1), + ... '$lt': datetime(2000,11,1) } } }, + ... { '$project': { + ... 'path': 1, + ... 'date': { + ... 'y': { '$year': '$time' }, + ... 'm': { '$month': '$time' }, + ... 'd': { '$dayOfMonth': '$time' } } } }, + ... { '$group': { + ... '_id': { + ... 'p':'$path', + ... 'y': '$date.y', + ... 'm': '$date.m', + ... 'd': '$date.d' }, + ... 'hits': { '$sum': 1 } } }, + ... ]) The performance of this aggregation is dependent, of course, on our choice of shard key if we're sharding. What we'd like to ensure is that @@ -482,11 +381,9 @@ Index support In this case, we want to make sure we have an index on the initial $match query: -\` +:: -``>>> db.events.ensure_index('time')`` - -\` + >>> db.events.ensure_index('time') If we already have an index on ('time', 'host') as discussed above, however, there is no need to create a separate index on 'time' alone, @@ -524,33 +421,21 @@ Option 1: Shard on a random(ish) key Supose instead that we decided to shard on a key with a random distribution, say the md5 or sha1 hash of the '\_id' field: -\` - -``>>> from bson import Binary`` - -``>>> from hashlib import sha1`` - -``>>>`` - -``>>> # Introduce the synthetic shard key (this should actually be done at`` - -``>>> # event insertion time)`` - -``>>>`` - -``>>> for ev in db.events.find({}, {'_id':1}):`` - -``... ssk = Binary(sha1(str(ev._id))).digest())`` - -``... db.events.update({'_id':ev['_id']}, {'$set': {'ssk': ssk} })`` - -``...`` - -``>>> db.command('shardcollection', 'events', {`` - -``... key : { 'ssk' : 1 } })`` - -``{ "collectionsharded" : "events", "ok" : 1 }`` +:: + + >>> from bson import Binary + >>> from hashlib import sha1 + >>> + >>> # Introduce the synthetic shard key (this should actually be done at + >>> # event insertion time) + >>> + >>> for ev in db.events.find({}, {'_id':1}): + ... ssk = Binary(sha1(str(ev._id))).digest()) + ... db.events.update({'_id':ev['_id']}, {'$set': {'ssk': ssk} }) + ... + >>> db.command('shardcollection', 'events', { + ... key : { 'ssk' : 1 } }) + { "collectionsharded" : "events", "ok" : 1 } This does introduce some complexity into our application in order to generate the random key, but it provides us linear scaling on our @@ -570,11 +455,11 @@ Option 2: Shard on a naturally evenly-distributed key In this case, we might choose to shard on the 'path' attribute, since it seems to be relatively evenly distributed: -``>>> db.command('shardcollection', 'events', {`` +:: -``... key : { 'path' : 1 } })`` - -``{ "collectionsharded" : "events", "ok" : 1 }`` + >>> db.command('shardcollection', 'events', { + ... key : { 'path' : 1 } }) + { "collectionsharded" : "events", "ok" : 1 } This has a couple of advantages: a) writes tend to be evenly balanced, and b) reads tend to be selective (assuming they include the 'path' @@ -593,13 +478,11 @@ This approach is perhaps the best combination of read and write performance for our application. We can define the shard key to be (path, sha1(\_id)): -\` - -``>>> db.command('shardcollection', 'events', {`` - -``... key : { 'path' : 1, 'ssk': 1 } })`` +:: -``{ "collectionsharded" : "events", "ok" : 1 }`` + >>> db.command('shardcollection', 'events', { + ... key : { 'path' : 1, 'ssk': 1 } }) + { "collectionsharded" : "events", "ok" : 1 } We still need to calculate a synthetic key in the application client, but in return we get good write balancing as well as good read From 3785a2f7523eb736b1783cefaceed4ce53ad69cb Mon Sep 17 00:00:00 2001 From: Rick Copeland Date: Mon, 19 Mar 2012 07:43:37 -0700 Subject: [PATCH 03/20] Conversion from Google docs complete Signed-off-by: Rick Copeland --- .../usecase/cms-_metadata_and_asset_management.txt | 6 +++--- source/tutorial/usecase/cms-_storing_comments.txt | 5 ++--- .../tutorial/usecase/ecommerce-_category_hierarchy.txt | 2 +- .../usecase/ecommerce-_inventory_management.txt | 10 ++-------- source/tutorial/usecase/ecommerce-_product_catalog.txt | 2 +- .../usecase/real_time_analytics-_storing_log_data.txt | 2 -- 6 files changed, 9 insertions(+), 18 deletions(-) diff --git a/source/tutorial/usecase/cms-_metadata_and_asset_management.txt b/source/tutorial/usecase/cms-_metadata_and_asset_management.txt index f03ec0fc448..e8c719426c3 100644 --- a/source/tutorial/usecase/cms-_metadata_and_asset_management.txt +++ b/source/tutorial/usecase/cms-_metadata_and_asset_management.txt @@ -1,8 +1,8 @@ CMS: Metadata and Asset Management ================================== -Problem[a] ----------- +Problem +------- You are designing a content management system (CMS) and you want to use MongoDB to store the content of your sites. @@ -441,4 +441,4 @@ collection in gridfs): This actually still maintains our query-routability constraint, since all reads from gridfs must first look up the document in 'files' and then look up the chunks separately (though the GridFS API sometimes -hides this detail from us.) Page of +hides this detail from us.) diff --git a/source/tutorial/usecase/cms-_storing_comments.txt b/source/tutorial/usecase/cms-_storing_comments.txt index 1a3bccb61e8..4b750244e6f 100644 --- a/source/tutorial/usecase/cms-_storing_comments.txt +++ b/source/tutorial/usecase/cms-_storing_comments.txt @@ -1,8 +1,8 @@ CMS: Storing Comments ===================== -Problem[a] ----------- +Problem +------- In your content management system (CMS) you would like to store user-generated comments on the various types of content you generate. @@ -587,4 +587,3 @@ comment page in our shard key: ... key : { 'discussion_id' : 1, ``'page'``: 1 } }) { "collectionsharded" : "comment_pages", "ok" : 1 } -Page of diff --git a/source/tutorial/usecase/ecommerce-_category_hierarchy.txt b/source/tutorial/usecase/ecommerce-_category_hierarchy.txt index 7d28d45b13d..a45fb4b48b8 100644 --- a/source/tutorial/usecase/ecommerce-_category_hierarchy.txt +++ b/source/tutorial/usecase/ecommerce-_category_hierarchy.txt @@ -225,7 +225,7 @@ the category collection would then be the following: { "collectionsharded" : "categories", "ok" : 1 } Note that there is no need to specify the shard key, as MongoDB will -default to using \_id as a shard key. Page of +default to using \_id as a shard key. .. |image0| image:: https://docs.google.com/a/arborian.com/drawings/image?id=sRXRjZMEZDN2azKBlsOoXoA&rev=7&h=250&w=443&ac=1 .. |image1| image:: https://docs.google.com/a/arborian.com/drawings/image?id=sqRIXKA2lGr_bm5ysM7KWQA&rev=3&h=354&w=443&ac=1 diff --git a/source/tutorial/usecase/ecommerce-_inventory_management.txt b/source/tutorial/usecase/ecommerce-_inventory_management.txt index 71016e619b9..7a53bb88222 100644 --- a/source/tutorial/usecase/ecommerce-_inventory_management.txt +++ b/source/tutorial/usecase/ecommerce-_inventory_management.txt @@ -111,7 +111,7 @@ sufficient inventory to satisfy the request: # Update the inventory result = db.inventory.update( {'_id':sku, 'qty': {'$gte': qty}}, - ``{'$inc': {'qty': -qty}`[a]`, + {'$inc': {'qty': -qty}, '$push': { 'carted': { 'qty': qty, 'cart_id':cart_id, 'timestamp': now } } }, @@ -390,10 +390,4 @@ collections, then, would be the following: { "collectionsharded" : "cart", "ok" : 1 } Note that there is no need to specify the shard key, as MongoDB will -default to using \_id as a shard key. Page of [a]jsr: Actually isn't a -$dec command. Just $inc by a negative value. Some drivers seem to have -added $dec as a helper, but probably shouldn't :) - --------------- - -rick446: fixed +default to using \_id as a shard key. diff --git a/source/tutorial/usecase/ecommerce-_product_catalog.txt b/source/tutorial/usecase/ecommerce-_product_catalog.txt index d18c3f9553c..c04de86184a 100644 --- a/source/tutorial/usecase/ecommerce-_product_catalog.txt +++ b/source/tutorial/usecase/ecommerce-_product_catalog.txt @@ -466,7 +466,7 @@ benefits from sharding due to a) the larger amount of memory available to store our indexes and b) the fact that searches will be parallelized across shards, reducing search latency. -Scaling Queries with read\_preference +Scaling queries with read\_preference ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Although sharding is the best way to scale reads and writes, it's not diff --git a/source/tutorial/usecase/real_time_analytics-_storing_log_data.txt b/source/tutorial/usecase/real_time_analytics-_storing_log_data.txt index 4f342e59b66..3ba087296ae 100644 --- a/source/tutorial/usecase/real_time_analytics-_storing_log_data.txt +++ b/source/tutorial/usecase/real_time_analytics-_storing_log_data.txt @@ -573,5 +573,3 @@ during the backfill, but can be reduced to 30 days in ongoing operations, you might consider using multiple databases. The complexity cost for multiple databases, however, is significant, so this option should only be taken after thorough analysis. - -Page of From 5484dc8560d9067b0915b3f058aea330d6534dc9 Mon Sep 17 00:00:00 2001 From: Rick Copeland Date: Mon, 19 Mar 2012 08:59:55 -0700 Subject: [PATCH 04/20] Renaming files, updating to match style guide Signed-off-by: Rick Copeland --- ... => cms-metadata-and-asset-management.txt} | 58 +-- ..._comments.txt => cms-storing-comments.txt} | 1 - ...y.txt => ecommerce-category-hierarchy.txt} | 0 ...txt => ecommerce-inventory-management.txt} | 0 ...alog.txt => ecommerce-product-catalog.txt} | 0 source/tutorial/usecase/index.txt | 16 +- ...me-analytics-hierarchical-aggregation.txt} | 0 ...-time-analytics-preaggregated-reports.txt} | 0 ... real-time-analytics-storing-log-data.txt} | 408 +++++++++--------- 9 files changed, 249 insertions(+), 234 deletions(-) rename source/tutorial/usecase/{cms-_metadata_and_asset_management.txt => cms-metadata-and-asset-management.txt} (91%) rename source/tutorial/usecase/{cms-_storing_comments.txt => cms-storing-comments.txt} (99%) rename source/tutorial/usecase/{ecommerce-_category_hierarchy.txt => ecommerce-category-hierarchy.txt} (100%) rename source/tutorial/usecase/{ecommerce-_inventory_management.txt => ecommerce-inventory-management.txt} (100%) rename source/tutorial/usecase/{ecommerce-_product_catalog.txt => ecommerce-product-catalog.txt} (100%) rename source/tutorial/usecase/{real_time_analytics-_hierarchical_aggregation.txt => real-time-analytics-hierarchical-aggregation.txt} (100%) rename source/tutorial/usecase/{real_time_analytics-_preaggregated_reports.txt => real-time-analytics-preaggregated-reports.txt} (100%) rename source/tutorial/usecase/{real_time_analytics-_storing_log_data.txt => real-time-analytics-storing-log-data.txt} (58%) diff --git a/source/tutorial/usecase/cms-_metadata_and_asset_management.txt b/source/tutorial/usecase/cms-metadata-and-asset-management.txt similarity index 91% rename from source/tutorial/usecase/cms-_metadata_and_asset_management.txt rename to source/tutorial/usecase/cms-metadata-and-asset-management.txt index e8c719426c3..3ab74be575f 100644 --- a/source/tutorial/usecase/cms-_metadata_and_asset_management.txt +++ b/source/tutorial/usecase/cms-metadata-and-asset-management.txt @@ -1,49 +1,53 @@ +================================== CMS: Metadata and Asset Management ================================== Problem -------- +======= You are designing a content management system (CMS) and you want to use MongoDB to store the content of your sites. -Solution overview ------------------ +Solution Overview +================= -Our approach in this solution is inspired by the design of Drupal, an +The approach in this solution is inspired by the design of Drupal, an open source CMS written in PHP on relational databases that is available -at `http://www.drupal.org `_. In this case, we +at `http://www.drupal.org `_. In this case, you will take advantage of MongoDB's dynamically typed collections to -*polymorphically* store all our content nodes in the same collection. -Our navigational information will be stored in its own collection since +*polymorphically* store all your content nodes in the same collection. +Your navigational information will be stored in its own collection since it has relatively little in common with our content nodes. -The main node types with which we are concerned here are: - -- **Basic page** : Basic pages are useful for displaying - infrequently-changing text such as an 'about' page. With a basic - page, the main information we are concerned with is the title and the - content. -- **Blog entry** : Blog entries record a "stream" of posts from users - on the CMS and store title, author, content, and date as relevant - information. -- **Photo** : Photos participate in photo galleries, and store title, - description, author, and date along with the actual photo binary - data. +The main node types with which this use case is concerned are: + +Basic page + Basic pages are useful for displaying + infrequently-changing text such as an 'about' page. With a basic + page, the salient information is the title and the + content. +Blog entry + Blog entries record a "stream" of posts from users + on the CMS and store title, author, content, and date as relevant + information. +Photo + Photos participate in photo galleries, and store title, + description, author, and date along with the actual photo binary + data. Schema design -------------- +============= -Our node collection will contain documents of various formats, but they +Your node collection will contain documents of various formats, but they will all share a similar structure, with each document including an \_id, type, section, slug, title, creation date, author, and tags. The -'section' property is used to identify groupings of items (grouped to a -particular blog or photo gallery, for instance). The 'slug' property is +`section` property is used to identify groupings of items (grouped to a +particular blog or photo gallery, for instance). The `slug` property is a url-friendly representation of the node that is unique within its section, and is used for mapping URLs to nodes. Each document also -contains a 'detail' field which will vary per document type: +contains a `detail` field which will vary per document type: -:: +.. code-block:: javascript { _id: ObjectId(…), @@ -53,9 +57,9 @@ contains a 'detail' field which will vary per document type: section: 'my-photos', slug: 'about', title: 'About Us', - created: ISODate(…), + created: ISODate(...), author: { _id: ObjectId(…), name: 'Rick' }, - tags: [ … ], + tags: [ ... ], detail: { text: '# About Us\n…' } } } diff --git a/source/tutorial/usecase/cms-_storing_comments.txt b/source/tutorial/usecase/cms-storing-comments.txt similarity index 99% rename from source/tutorial/usecase/cms-_storing_comments.txt rename to source/tutorial/usecase/cms-storing-comments.txt index 4b750244e6f..0062111cb4c 100644 --- a/source/tutorial/usecase/cms-_storing_comments.txt +++ b/source/tutorial/usecase/cms-storing-comments.txt @@ -586,4 +586,3 @@ comment page in our shard key: >>> db.command('shardcollection', 'comment_pages', { ... key : { 'discussion_id' : 1, ``'page'``: 1 } }) { "collectionsharded" : "comment_pages", "ok" : 1 } - diff --git a/source/tutorial/usecase/ecommerce-_category_hierarchy.txt b/source/tutorial/usecase/ecommerce-category-hierarchy.txt similarity index 100% rename from source/tutorial/usecase/ecommerce-_category_hierarchy.txt rename to source/tutorial/usecase/ecommerce-category-hierarchy.txt diff --git a/source/tutorial/usecase/ecommerce-_inventory_management.txt b/source/tutorial/usecase/ecommerce-inventory-management.txt similarity index 100% rename from source/tutorial/usecase/ecommerce-_inventory_management.txt rename to source/tutorial/usecase/ecommerce-inventory-management.txt diff --git a/source/tutorial/usecase/ecommerce-_product_catalog.txt b/source/tutorial/usecase/ecommerce-product-catalog.txt similarity index 100% rename from source/tutorial/usecase/ecommerce-_product_catalog.txt rename to source/tutorial/usecase/ecommerce-product-catalog.txt diff --git a/source/tutorial/usecase/index.txt b/source/tutorial/usecase/index.txt index ef59866bcb5..b517cf60214 100644 --- a/source/tutorial/usecase/index.txt +++ b/source/tutorial/usecase/index.txt @@ -4,11 +4,11 @@ Use Cases .. toctree:: :maxdepth: 1 - real_time_analytics-_storing_log_data - real_time_analytics-_preaggregated_reports - real_time_analytics-_hierarchical_aggregation - ecommerce-_product_catalog - ecommerce-_inventory_management - ecommerce-_category_hierarchy - cms-_metadata_and_asset_management - cms-_storing_comments + real-time-analytics-storing-log-data + real-time-analytics-preaggregated-reports + real-time-analytics-hierarchical-aggregation + ecommerce-product-catalog + ecommerce-inventory-management + ecommerce-category-hierarchy + cms-metadata-and-asset-management + cms-storing-comments diff --git a/source/tutorial/usecase/real_time_analytics-_hierarchical_aggregation.txt b/source/tutorial/usecase/real-time-analytics-hierarchical-aggregation.txt similarity index 100% rename from source/tutorial/usecase/real_time_analytics-_hierarchical_aggregation.txt rename to source/tutorial/usecase/real-time-analytics-hierarchical-aggregation.txt diff --git a/source/tutorial/usecase/real_time_analytics-_preaggregated_reports.txt b/source/tutorial/usecase/real-time-analytics-preaggregated-reports.txt similarity index 100% rename from source/tutorial/usecase/real_time_analytics-_preaggregated_reports.txt rename to source/tutorial/usecase/real-time-analytics-preaggregated-reports.txt diff --git a/source/tutorial/usecase/real_time_analytics-_storing_log_data.txt b/source/tutorial/usecase/real-time-analytics-storing-log-data.txt similarity index 58% rename from source/tutorial/usecase/real_time_analytics-_storing_log_data.txt rename to source/tutorial/usecase/real-time-analytics-storing-log-data.txt index 3ba087296ae..9e24872de2c 100644 --- a/source/tutorial/usecase/real_time_analytics-_storing_log_data.txt +++ b/source/tutorial/usecase/real-time-analytics-storing-log-data.txt @@ -1,39 +1,40 @@ +===================================== Real Time Analytics: Storing Log Data ===================================== Problem -------- +======= You have one or more servers generating events that you would like to persist to a MongoDB collection. -Solution overview ------------------ +Solution Overview +================= -For this solution, we will assume that each server generating events has -access to the MongoDB server(s). We will also assume that the consumer -of the event data has access to the MongoDB server(s) and that the query +This solution will assume that each server generating events, as well as the +consumer of the event data, has access to the MongoDB server(s). Furthermore, +this design will optimize based on the assumption that the query rate is (substantially) lower than the insert rate (as is most often the case when logging a high-bandwidth event stream). -Schema design -------------- +Schema Design +============= The schema design in this case will depend largely on the particular format of the event data you want to store. For a simple example, let's take standard request logs from the Apache web server using the combined -log format. For this example we will assume you're using an uncapped +log format. This example assumes you're using an uncapped collection to store the event data. A line from such a log file might look like the following: -:: +.. code-block:: text 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "[http://www.example.com/start.html](http://www.example.com/start.html)" "Mozilla/4.08 [en] (Win98; I ;Nav)" The simplest approach to storing the log data would be putting the exact text of the log record into a document: -:: +.. code-block:: javascript { _id: ObjectId('4f442120eb03305789000000'), @@ -41,25 +42,25 @@ text of the log record into a document: } While this is a possible solution, it's not likely to be the optimal -solution. For instance, if we decided we wanted to find events that hit -the same page, we would need to use a regular expression query, which +solution. For instance, if you decided you wanted to find events that hit +the same page, you'd need to use a regular expression query, which would require a full collection scan. A better approach would be to extract the relevant fields into individual properties. When doing the -extraction, we should pay attention to the choice of data types for the +extraction, it's important to pay attention to the choice of data types for the various fields. For instance, the date field in the log line -``[10/Oct/2000:13:55:36 -0700]`` is 28 bytes long. If we instead store +``[10/Oct/2000:13:55:36 -0700]`` is 28 bytes long. If you instead store this as a UTC timestamp, it shrinks to 8 bytes. Storing the date as a -timestamp also gives us the advantage of being able to make date range +timestamp also allows you to make date range queries, whereas comparing two date *strings* is nearly useless. A similar argument applies to numeric fields; storing them as strings is suboptimal, taking up more space and making the appropriate types of queries much more difficult. -We should also consider what information we might want to omit from the -log record. For instance, if we wanted to record exactly what was in the -log record, we might create a document like the following: +It's also important to consider what information you might want to omit from the +log record. For instance, if you wanted to record exactly what was in the +log record, you might create a document like the following: -:: +.. code-block:: javascript { _id: ObjectId('4f442120eb03305789000000'), @@ -74,11 +75,12 @@ log record, we might create a document like the following: user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)" } -In most cases, however, we probably are only interested in a subset of -the data about the request. Here, we may want to keep the host, time, + +In most cases, however, you're probably are only interested in a subset of +the data about the request. Here, you may want to keep the host, time, path, user agent, and referer for a web analytics application: -:: +.. code-block:: javascript { _id: ObjectId('4f442120eb03305789000000'), @@ -89,10 +91,10 @@ path, user agent, and referer for a web analytics application: user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)" } -It might even be possible to remove the time, since ObjectIds embed -their the time they are created: +It might even be possible to remove the time, since ``ObjectId``\ s embed +the time they are created: -:: +.. code-block:: javascript { _id: ObjectId('4f442120eb03305789000000'), @@ -104,34 +106,34 @@ their the time they are created: } System Architecture -~~~~~~~~~~~~~~~~~~~ +------------------- -For an event logging system, we are mainly concerned with two -performance considerations: 1) how many inserts per second can we -perform (this will limit our event throughput) and 2) how will we manage +Event logging systems are mainly concerned with two +performance considerations: 1) how many inserts per second can it +perform (this will limit its event throughput) and 2) how will the system manage the growth of event data. Concerning insert performance, the best way to scale the architecture is via sharding. Operations ----------- +========== -The main performance-critical operation we're concerned with in storing -an event log is the insertion speed. However, we also need to be able to +The main performance-critical operation in storing +an event log is the insertion speed. However, you also need to be able to query the event data for relevant statistics. This section will describe each of these operations, using the Python programming language and the -pymongo MongoDB driver. These operations would be similar in other +``pymongo`` MongoDB driver. These operations would be similar in other languages as well. -Inserting a log record -~~~~~~~~~~~~~~~~~~~~~~ +Inserting a Log Record +---------------------- -In many event logging applications, we can accept some degree of risk -when it comes to dropping events. In others, we need to be absolutely -sure we don't drop any events. MongoDB supports both models. In the case -where we can tolerate a risk of loss, we can insert records +In many event logging applications, you might accept some degree of risk +when it comes to dropping events. In others, you need to be absolutely +sure you don't drop any events. MongoDB supports both models. In the case +where you can tolerate a risk of loss, you can insert records *asynchronously* using a fire-and-forget model: -:: +.. code-block:: python >>> import bson >>> import pymongo @@ -148,82 +150,81 @@ where we can tolerate a risk of loss, we can insert records ...} >>> db.events.insert(event, safe=False) -This is the fastest approach, as our code doesn't even require a +This is the fastest approach, as this code doesn't even require a round-trip to the MongoDB server to ensure that the insert was received. -It is thus also the riskiest approach, as we will not detect network -failures nor server errors (such as DuplicateKeyErrors on a unique -index). If we want to make sure we have an acknowledgement from the -server that our insertion succeeded (for some definition of success), we +It is thus also the riskiest approach, as network and server failures (such as +DuplicateKeyErrors on a unique index) will go undetected. If you want to make +sure you have an acknowledgement from the +server that your insertion succeeded (for some definition of success), you can pass safe=True: -:: +.. code-block:: python >>> db.events.insert(event, safe=True) -If our tolerance for data loss risk is somewhat less, we can require -that the server to which we write the data has committed the event to -the on-disk journal before we continue operation (``safe=True`` is -implied by all the following options): +If your tolerance for data loss risk is somewhat less, you can require +that the server to which you write the data has committed the event to +the on-disk journal before you continue operation (``safe=True`` is +implied by all the following options:) -:: +.. code-block:: python >>> db.events.insert(event, j=True) -Finally, if we have *extremely low* tolerance for event data loss, we +Finally, if you have *extremely low* tolerance for event data loss, you can require the data to be replicated to multiple secondary servers before returning: -:: +.. code-block:: python >>> db.events.insert(event, w=2) -In this case, we have requested acknowledgement that the data has been -replicated to 2 replicas. We can combine options as well: +In this case, you will get acknowledgement that the data has been +replicated to 2 replicas. You can combine options as well: -:: +.. code-block:: python >>> db.events.insert(event, j=True, w=2) -In this case, we are waiting on both a journal commit *and* a +In this case, the insert waits on both a journal commit *and* a replication acknowledgement. Although this is the safest option, it is also the slowest, so you should be aware of the trade-off when performing your inserts. -Aside: Bulk Inserts -^^^^^^^^^^^^^^^^^^^ +.. note:: -If at all possible in our application architecture, we should consider -using bulk inserts to insert event data. All the options discussed above -apply to bulk inserts, but you can actually pass multiple events as the -first parameter to .insert(). By passing multiple documents into a -single insert() call, we are able to amortize the performance penalty we -incur by using the 'safe' options such as j=True or w=2. + If at all possible in your application architecture, you should consider + using bulk inserts to insert event data. All the options discussed above + apply to bulk inserts, but you can actually pass multiple events as the + first parameter to .insert(). By passing multiple documents into a + single insert() call, MongoDB are able to amortize the performance penalty you + incur by using the 'safe' options such as j=True or w=2. -Finding all the events for a particular page -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Finding All the Events for a Particular Page +-------------------------------------------- For a web analytics-type operation, getting the logs for a particular -web page might be a common operation that we would want to optimize for. +web page might be a common operation for which you would want to optimize. In this case, the query would be as follows: -:: +.. code-block:: python >>> q_events = db.events.find({'path': '/apache_pb.gif'}) -Note that the sharding setup we use (should we decide to shard this +Note that the sharding setup you use (should you decide to shard this collection) has performance implications for this operation. For -instance, if we shard on the 'path' property, then this query will be -handled by a single shard, whereas if we shard on some other property or +instance, if you shard on the 'path' property, then this query will be +handled by a single shard, whereas if you shard on some other property or combination of properties, the mongos instance will be forced to do a scatter/gather operation which involves *all* the shards. -Index support -^^^^^^^^^^^^^ +Index Support +~~~~~~~~~~~~~ This operation would benefit significantly from an index on the 'path' attribute: -:: +.. code-block:: python >>> db.events.ensure_index('path') @@ -233,102 +234,113 @@ should be resident in RAM. Since there is likely to be a relatively small number of distinct paths in the index, however, this will probably not be a problem. -Finding all the events for a particular date -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Finding All the Events for a Particular Date +-------------------------------------------- -We may also want to find all the events for a particular date. In this -case, we would perform the following query: +You may also want to find all the events for a particular date. In this +case, you'd perform the following query: -:: +.. code-block:: python >>> q_events = db.events.find('time': ... { '$gte':datetime(2000,10,10),'$lt':datetime(2000,10,11)}) -Index support -^^^^^^^^^^^^^ +Index Support +~~~~~~~~~~~~~ In this case, an index on 'time' would provide optimal performance: -:: +.. code-block:: python >>> db.events.ensure_index('time') -One of the nice things about this index is that it is *right-aligned.* -Since we are always inserting events in ascending time order, the +One of the nice things about this index is that it is *right-aligned*. +Since you are always inserting events in ascending time order, the right-most slice of the B-tree will always be resident in RAM. So long -as our queries focus mainly on recent events, the *only* part of the +as your queries focus mainly on recent events, the *only* part of the index that needs to be resident in RAM is the right-most slice of the -B-tree, allowing us to keep quite a large index without using up much of -our system memory. +B-tree, allowing MongoDB to keep quite a large index without using up much of +the system memory. -Finding all the events for a particular host/date -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Finding All the Events for a Particular Host/Date +------------------------------------------------- -We might also want to analyze the behavior of a particular host on a +You might also want to analyze the behavior of a particular host on a particular day, perhaps for analyzing suspicious behavior by a -particular IP address. In that case, we would write a query such as: +particular IP address. In that case, you'd write a query such as: -:: +.. code-block:: python >>> q_events = db.events.find({ ... 'host': '127.0.0.1', ... 'time': {'$gte':datetime(2000,10,10),'$lt':datetime(2000,10,11)} ... }) -Index support -^^^^^^^^^^^^^ +Index Support +~~~~~~~~~~~~~ -Once again, our choice of indexes affects the performance -characteristics of this query significantly. For instance, suppose we +Once again, your choice of indexes affects the performance +characteristics of this query significantly. For instance, suppose you create a compound index on (time, host): -:: +.. code-block:: python >>> db.events.ensure_index([('time', 1), ('host', 1)]) In this case, the query plan would be the following (retrieved via -q\_events.explain()): { … u'cursor': u'BtreeCursor time\_1\_host\_1', -u'indexBounds': {u'host': [[u'127.0.0.1', u'127.0.0.1']], u'time': -[[datetime.datetime(2000, 10, 10, 0, 0), datetime.datetime(2000, 10, 11, -0, 0)]]}, … u'millis': 4, u'n': 11, … u'nscanned': 1296, -u'nscannedObjects': 11, … } +``q_events.explain()``): -If, however, we create a compound index on (host, time)... +.. code-block: python -:: + { ... + u'cursor': u'BtreeCursor time_1_host_1', + u'indexBounds': {u'host': [[u'127.0.0.1', u'127.0.0.1']], + u'time': [ + [ datetime.datetime(2000, 10, 10, 0, 0), + datetime.datetime(2000, 10, 11, 0, 0)]] + }, + ... + u'millis': 4, + u'n': 11, + u'nscanned': 1296, + u'nscannedObjects': 11, + ... } - >>> db.events.ensure_index([('host', 1), ('time', 1)]) +If, however, you create a compound index on (host, time)... -We get a much more efficient query plan and much better performance: +.. code-block:: python -:: + >>> db.events.ensure_index([('host', 1), ('time', 1)]) - { - … - u'cursor': u'BtreeCursor host_1_time_1', - u'indexBounds': {u'host': [[u'127.0.0.1', u'127.0.0.1']], - u'time': [[datetime.datetime(2000, 10, 10, 0, 0), - datetime.datetime(2000, 10, 11, 0, 0)]]}, - … - u'millis': 0, - u'n': 11, - … - u'nscanned': 11, - u'nscannedObjects': 11, - … +then get a much more efficient query plan and much better performance: + +.. code-block:: python + + { ... + u'cursor': u'BtreeCursor host_1_time_1', + u'indexBounds': {u'host': [[u'127.0.0.1', u'127.0.0.1']], + u'time': [[datetime.datetime(2000, 10, 10, 0, 0), + datetime.datetime(2000, 10, 11, 0, 0)]]}, + ... + u'millis': 0, + u'n': 11, + ... + u'nscanned': 11, + u'nscannedObjects': 11, + ... } In this case, MongoDB is able to visit just 11 entries in the index to satisfy the query, whereas in the first it needed to visit 1296 entries. This is because the query using (host, time) needs to search the index -range from ('127.0.0.1', datetime(2000,10,10)) to ('127.0.0.1', -datetime(2000,10,11)) to satisfy the above query, whereas if we used -(time, host), the index range would be (datetime(2000,10,10), MIN\_KEY) -to (datetime(2000,10,10), MAX\_KEY), a much larger range (in this case, +range from ``('127.0.0.1', datetime(2000,10,10))`` to +``('127.0.0.1', datetime(2000,10,11))`` to satisfy the above query, whereas if you +used (time, host), the index range would be ``(datetime(2000,10,10), MIN_KEY)`` +to ``(datetime(2000,10,10), MAX_KEY)``, a much larger range (in this case, 1296 entries) which will yield a correspondingly slower performance. Although the index order has an impact on the performance of the query, -one thing to keep in mind is that an index scan is *much* faster than a +one thing to keep in mind is that an index scan is still *much* faster than a collection scan. So using a (time, host) index would still be much faster than an index on (time) alone. There is also the issue of right-alignedness to consider, as the (time, host) index will be @@ -337,17 +349,17 @@ that the right-alignedness of a (time, host) index will make up for the increased number of index entries that need to be visited to satisfy this query. -Counting the number of requests by day and page -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Counting the Number of Requests by Day and Page +----------------------------------------------- -MongoDB 2.1 introduced a new aggregation framework that allows us to +MongoDB 2.1 introduced a new aggregation framework that allows you to perform queries that aggregate large numbers of documents significantly faster than the old 'mapreduce' and 'group' commands in prior versions -of MongoDB. Suppose we want to find out how many requests there were for -each day and page over the last month, for instance. In this case, we +of MongoDB. Suppose you'd like to find out how many requests there were for +each day and page over the last month, for instance. In this case, you could build up the following aggregation pipeline: -:: +.. code-block:: python >>> result = db.command('aggregate', 'events', pipeline=[ ... { '$match': { @@ -369,59 +381,59 @@ could build up the following aggregation pipeline: ... 'hits': { '$sum': 1 } } }, ... ]) -The performance of this aggregation is dependent, of course, on our -choice of shard key if we're sharding. What we'd like to ensure is that -all the items in a particular 'group' are on the same server, which we -can do by sharding on date (probably not wise, as we discuss below) or +The performance of this aggregation is dependent, of course, on your +choice of shard key if we're sharding. What you'd like to ensure is that +all the items in a particular 'group' are on the same server, which you +can do by sharding on date (probably not wise, as discussed below) or path (possibly a good idea). -Index support -^^^^^^^^^^^^^ +Index Support +~~~~~~~~~~~~~ -In this case, we want to make sure we have an index on the initial +In this case, you want to make sure you have an index on the initial $match query: -:: +.. code-block:: python >>> db.events.ensure_index('time') -If we already have an index on ('time', 'host') as discussed above, +If you already have an index on (time, host) as discussed above, however, there is no need to create a separate index on 'time' alone, -since the ('time', 'host') index can be used to satisfy range queries on +since the (time, host) index can be used to satisfy range queries on 'time' alone. Sharding --------- +======== -Our insertion rate is going to be limited by the number of shards we -maintain in our cluster as well as by our choice of a shard key. The +Your insertion rate is going to be limited by the number of shards you +maintain in your cluster as well as by the choice of a shard key. The choice of a shard key is important because MongoDB uses *range-based -sharding* . What we *want* to happen is for the insertions to be -balanced equally among the shards, so we want to avoid using something -like a timestamp, sequence number, or ObjectId as a shard key, as new +sharding* . What you *want* to happen is for the insertions to be +balanced equally among the shards, so you'd like to avoid using something +like a timestamp, sequence number, or ``ObjectId`` as a shard key, as new inserts would tend to cluster around the same values (and thus the same -shard). But what we also *want* to happen is for each of our queries to -be routed to a single shard. Here, we discuss the pros and cons of each +shard). But what you also *want* to happen is for each of your queries to +be routed to a single shard. The following are the pros and cons of each approach. -Option 0: Shard on time -~~~~~~~~~~~~~~~~~~~~~~~ +Option 0: Shard on Time +----------------------- -Although an ObjectId or timestamp might seem to be an attractive +Although an ``ObjectId`` or timestamp might seem to be an attractive sharding key at first, particularly given the right-alignedness of the index, it turns out to provide the worst of all worlds when it comes to -read and write performance. In this case, all of our inserts will always +read and write performance. In this case, all of your inserts will always flow to the same shard, providing no performance benefit write-side from -sharding. Our reads will also tend to cluster in the same shard, so we +sharding. Your reads will also tend to cluster in the same shard, so you would get no performance benefit read-side either. -Option 1: Shard on a random(ish) key +Option 1: Shard On a Random(ish) Key ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Supose instead that we decided to shard on a key with a random -distribution, say the md5 or sha1 hash of the '\_id' field: +Suppose instead that you decided to shard on a key with a random +distribution, say the md5 or sha1 hash of the ``_id`` field: -:: +.. code-block:: python >>> from bson import Binary >>> from hashlib import sha1 @@ -437,25 +449,25 @@ distribution, say the md5 or sha1 hash of the '\_id' field: ... key : { 'ssk' : 1 } }) { "collectionsharded" : "events", "ok" : 1 } -This does introduce some complexity into our application in order to -generate the random key, but it provides us linear scaling on our +This does introduce some complexity into your application in order to +generate the random key, but it provides you linear scaling on your inserts, so 5 shards should yield a 5x speedup in inserting. The downsides to using a random shard key are the following: a) the shard -key's index will tend to take up more space (and we need an index to +key's index will tend to take up more space (and you need an index to determine where to place each new insert), and b) queries (unless they -include our synthetic, random-ish shard key) will need to be distributed -to all our shards in parallel. This may be acceptable, since in our -scenario our write performance is much more important than our read -performance, but we should be aware of the downsides to using a random +include the synthetic, random-ish shard key) will need to be distributed +to all your shards in parallel. This may be acceptable, since in this +scenario write performance is much more important than read +performance, but you should be aware of the downsides to using a random key distribution. -Option 2: Shard on a naturally evenly-distributed key -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Option 2: Shard On a Naturally Evenly-Distributed Key +----------------------------------------------------- -In this case, we might choose to shard on the 'path' attribute, since it +In this case, you might choose to shard on the 'path' attribute, since it seems to be relatively evenly distributed: -:: +.. code-block:: python >>> db.command('shardcollection', 'events', { ... key : { 'path' : 1 } }) @@ -467,29 +479,29 @@ attribute in the query). There is a potential downside to this approach, however, particularly in the case where there are a limited number of distinct values for the path. In that case, you can end up with large shard 'chunks' that cannot be split or rebalanced because they contain -only a single shard key. The rule of thumb here is that we should not +only a single shard key. The rule of thumb here is that you should not pick a shard key which allows large numbers of documents to have the same shard key since this prevents rebalancing. -Option 3: Combine a natural and synthetic key -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Option 3: Combine a Natural and Synthetic Key +--------------------------------------------- This approach is perhaps the best combination of read and write -performance for our application. We can define the shard key to be +performance for the application. You can define the shard key to be (path, sha1(\_id)): -:: +.. code-block:: python >>> db.command('shardcollection', 'events', { ... key : { 'path' : 1, 'ssk': 1 } }) { "collectionsharded" : "events", "ok" : 1 } -We still need to calculate a synthetic key in the application client, -but in return we get good write balancing as well as good read +You still need to calculate a synthetic key in the application client, +but in return you get good write balancing as well as good read selectivity. -Sharding conclusion: Test with your own data -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Sharding Conclusion: Test With Your Own Data +-------------------------------------------- Picking a good shard key is unfortunately still one of those decisions that is simultaneously difficult to make, high-impact, and difficult to @@ -501,7 +513,7 @@ approach is to analyze the actual insertions and queries you are using in your own application. Variation: Capped Collections ------------------------------ +============================= One variation that you may want to consider based on your data retention requirements is whether you might be able to use a `capped @@ -516,39 +528,39 @@ collections (the default) will persist documents until they are explicitly removed from the collection or the collection is dropped. Appendix: Managing Event Data Growth ------------------------------------- +==================================== MongoDB databases, in the course of normal operation, never relinquish disk space back to the file system. This can create difficulties if you -don't manage the size of your databases up front. For event data, we +don't manage the size of your databases up front. For event data, you have a few options for managing the data growth: Single Collection -~~~~~~~~~~~~~~~~~ +----------------- This is the simplest option: keep all events in a single collection, -periodically removing documents that we don't need any more. The +periodically removing documents that you don't need any more. The advantage of simplicity, however, is offset by some performance -considerations. First, when we execute our remove, MongoDB will actually +considerations. First, when you execute your remove, MongoDB will actually bring the documents being removed into memory. Since these are documents -that presumably we haven't touched in a while (that's why we're deleting +that presumably you haven't touched in a while (that's why you're deleting them), this will force more relevant data to be flushed out to disk. -Second, in order to do a reasonably fast remove operation, we probably +Second, in order to do a reasonably fast remove operation, you probably want to keep an index on a timestamp field. This will tend to slow down -our inserts, as the inserts have to update the index as well as write +your inserts, as the inserts have to update the index as well as write the event data. Finally, removing data periodically will also be the option that has the most potential for fragmenting the database, as MongoDB attempts to reuse the space freed by the remove operations for new events. Multiple Collections, Single Database -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +------------------------------------- -Our next option is to periodically *rename* our event collection, -rotating collections in much the same way we might rotate log files. We +The next option is to periodically *rename* your event collection, +rotating collections in much the same way you might rotate log files. You would then drop the oldest collection from the database. This has several advantages over the single collection approach. First off, -collection renames are both fast and atomic. Secondly, we don't actually +collection renames are both fast and atomic. Secondly, you don't actually have to touch any of the documents to drop a collection. Finally, since MongoDB allocates storage in *extents* that are owned by collections, dropping a collection will free up entire extents, mitigating the @@ -558,12 +570,12 @@ current event collection and the previous event collection for any data analysis you perform. Multiple Databases -~~~~~~~~~~~~~~~~~~ +------------------ -In the multiple database option, we take the multiple collection option -a step further. Now, rather than rotating our collections, we will -rotate our databases. At the cost of rather increased complexity both in -insertions and queries, we do gain one benefit: as our databases get +In the multiple database option, you take the multiple collection option +a step further. Now, rather than rotating your collections, you will +rotate your databases. At the cost of rather increased complexity both in +insertions and queries, you do gain one benefit: as your databases get dropped, disk space gets returned to the operating system. This option would only really make sense if you had extremely variable event insertion rates or if you had variable data retention requirements. For From 98e2212f62dab2e151f0be1b1e3ad801ec831caf Mon Sep 17 00:00:00 2001 From: Rick Copeland Date: Mon, 19 Mar 2012 09:04:24 -0700 Subject: [PATCH 05/20] fix headings on rta: preagg Signed-off-by: Rick Copeland --- ...l-time-analytics-preaggregated-reports.txt | 42 +++++++++---------- 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/source/tutorial/usecase/real-time-analytics-preaggregated-reports.txt b/source/tutorial/usecase/real-time-analytics-preaggregated-reports.txt index 08ea26ffc8d..58822863080 100644 --- a/source/tutorial/usecase/real-time-analytics-preaggregated-reports.txt +++ b/source/tutorial/usecase/real-time-analytics-preaggregated-reports.txt @@ -1,14 +1,15 @@ +=========================================== Real Time Analytics: Pre-Aggregated Reports =========================================== Problem -------- +======= You have one or more servers generating events for which you want real-time statistical information in a MongoDB collection. Solution overview ------------------ +================= For this solution we will make a few assumptions: @@ -32,7 +33,7 @@ some set of logfile post-processors that can run in order to integrate the statistics. Schema design -------------- +============= There are two important considerations when designing the schema for a real-time analytics system: the ease & speed of updates and the ease & @@ -58,7 +59,7 @@ and discuss the problems with them before finally describing the solution we would like to go with. Design 0: one document per page/day -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +----------------------------------- Our initial approach will be to simply put all the statistics in which we're interested into a single document per page: @@ -94,7 +95,7 @@ end up needing to reallocate these documents multiple times throughout the day, copying the documents to areas with more space. Design #0.5: Preallocate documents -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +---------------------------------- In order to mitigate the repeated copying of documents, we can tweak our approach slightly by adding a process which will preallocate a document @@ -116,12 +117,12 @@ doesn't need to pad the records, leading to a more compact representation and better usage of our memory. Design #1: Add intra-document hierarchy -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +--------------------------------------- One thing to be aware of with BSON is that documents are stored as a sequence of (key, value) pairs, *not* as a hash table. What this means for us is that writing to stats.mn.0 is *much* faster than writing to -stats.mn.1439. [c] +stats.mn.1439. .. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sg_d2tpKfXUsecEyvpgRg8w&rev=1&h=82&w=410&ac=1 :align: center @@ -168,8 +169,9 @@ generally faster. .. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sGv9KIXyF_XZvpnNPVyojcg&rev=21&h=148&w=410&ac=1 :align: center :alt: + Design #2: Create separate documents for different granularities -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +---------------------------------------------------------------- Design #1 is certainly a reasonable design for storing intraday statistics, but what happens when we want to draw a historical chart @@ -182,7 +184,7 @@ more than make up for it. At this point, our document structure is as follows: Daily Statistics -^^^^^^^^^^^^^^^^ +~~~~~~~~~~~~~~~~ :: @@ -215,7 +217,7 @@ Daily Statistics } Monthly Statistics -^^^^^^^^^^^^^^^^^^ +~~~~~~~~~~~~~~~~~~ :: @@ -232,7 +234,7 @@ Monthly Statistics } Operations ----------- +========== In this system, we want balance between read performance and write (upsert) performance. This section will describe each of the major @@ -241,7 +243,7 @@ pymongo MongoDB driver. These operations would be similar in other languages as well. Log a hit to a page -~~~~~~~~~~~~~~~~~~~ +------------------- Logging a hit to a page in our website is the main 'write' activity in our system. In order to maximize performance, we will be doing in-place @@ -293,8 +295,8 @@ without preallocation, we end up with a dynamically growing document, slowing down our upserts significantly as documents are moved in order to grow them. -Preallocate[d] -~~~~~~~~~~~~~~ +Preallocate +----------- In order to keep our documents from growing, we can preallocate them before they are needed. When preallocating, we set all the statistics to @@ -384,7 +386,7 @@ it's used, preventing the midnight spike as well as eliminating the movement of dynamically growing documents. Get data for a real-time chart -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +------------------------------ One chart that we may be interested in seeing would be the number of hits to a particular page over the last hour. In that case, our query is @@ -424,7 +426,7 @@ day where a) we didn't happen to preallocate that day and b) there were no hits to the document on that day. Index support -^^^^^^^^^^^^^ +~~~~~~~~~~~~~ These operations would benefit significantly from indexes on the metadata of the daily statistics: @@ -440,10 +442,9 @@ Note in particular that we indexed on the page first, date second. This allows us to perform the third query above (a single page over a range of days) quite efficiently. Having any compound index on page and date, of course, allows us to look up a single day's statistics efficiently. -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Get data for a historical chart -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +------------------------------- In order to retrieve daily data for a single month, we can perform the following query: @@ -471,7 +472,7 @@ same trick as above: ... sort=[('metadata.date', 1)]) Index support -^^^^^^^^^^^^^ +~~~~~~~~~~~~~ Once again, these operations would benefit significantly from indexes on the metadata of the monthly statistics: @@ -485,10 +486,9 @@ the metadata of the monthly statistics: The order of our index is once again designed to efficiently support range queries for a single page over several months, as above. -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sharding --------- +======== Our performance in this system will be limited by the number of shards in our cluster as well as the choice of our shard key. Our ideal shard From 3f88f1f6bdd9b7c612f7a958a83defcea60dba85 Mon Sep 17 00:00:00 2001 From: Rick Copeland Date: Mon, 19 Mar 2012 09:24:34 -0700 Subject: [PATCH 06/20] First pass at revoicing rta: preagg Signed-off-by: Rick Copeland --- ...l-time-analytics-preaggregated-reports.txt | 201 +++++++++--------- 1 file changed, 99 insertions(+), 102 deletions(-) diff --git a/source/tutorial/usecase/real-time-analytics-preaggregated-reports.txt b/source/tutorial/usecase/real-time-analytics-preaggregated-reports.txt index 58822863080..13e0c0b31cb 100644 --- a/source/tutorial/usecase/real-time-analytics-preaggregated-reports.txt +++ b/source/tutorial/usecase/real-time-analytics-preaggregated-reports.txt @@ -11,23 +11,23 @@ real-time statistical information in a MongoDB collection. Solution overview ================= -For this solution we will make a few assumptions: +This solution assumes the following: - There is no need to retain transactional event data in MongoDB, or that retention is handled outside the scope of this use case -- We need statistical data to be up-to-the minute (or up-to-the-second, - if possible) +- You need statistical data to be up-to-the minute (or up-to-the-second, + if possible.) - The queries to retrieve time series of statistical data need to be as fast as possible. -Our general approach is to use upserts and increments to generate the +The general approach is to use upserts and increments to generate the statistics and simple range-based queries and filters to draw the time series charts of the aggregated data. -To help anchor the solution, we will examine a simple scenario where we +To help anchor this solution, it will examine a simple scenario where you want to count the number of hits to a collection of web site at various levels of time-granularity (by minute, hour, day, week, and month) as -well as by path. We will assume that either you have some code that can +well as by path. It is assumed that either you have some code that can run as part of your web app when it is rendering the page, or you have some set of logfile post-processors that can run in order to integrate the statistics. @@ -37,7 +37,7 @@ Schema design There are two important considerations when designing the schema for a real-time analytics system: the ease & speed of updates and the ease & -speed of queries. In particular, we want to avoid the following +speed of queries. In particular, you want to avoid the following performance-killing circumstances: - documents changing in size significantly, causing reallocations on @@ -45,26 +45,25 @@ performance-killing circumstances: - queries that require large numbers of disk seeks to be satisfied - document structures that make accessing a particular field slow -One approach we *could* use to make updates easier would be to keep our +One approach you *could* use to make updates easier would be to keep your hit counts in individual documents, one document per minute/hour/day/etc. This approach, however, requires us to query several documents for nontrivial time range queries, slowing down our -queries significantly. In order to keep our queries fast, we will +queries significantly. In order to keep your queries fast, you will instead use somewhat more complex documents, keeping several aggregate values in each document. -In order to illustrate some of the other issues we might encounter, we -will consider several schema designs that yield suboptimal performance -and discuss the problems with them before finally describing the -solution we would like to go with. +In order to illustrate some of the other issues you might encounter, here are +several schema designs that you might try as well as discussion of +the problems with them. Design 0: one document per page/day ----------------------------------- -Our initial approach will be to simply put all the statistics in which -we're interested into a single document per page: +The initial approach will be to simply put all the statistics in which +you're interested into a single document per page: -:: +.. code-block:: javascript { _id: "20101010/site-1/apache_pb.gif", @@ -76,19 +75,19 @@ we're interested into a single document per page: hourly: { "0": 227850, "1": 210231, - … + ... "23": 20457 }, minute: { "0": 3612, "1": 3241, - … + ... "1439": 2819 } } This approach has a couple of advantages: a) it only requires a single update per hit to the website, b) intra-day reports for a single page require fetching only a single document. There are, however, significant -problems with this approach. The biggest problem is that, as we upsert +problems with this approach. The biggest problem is that, as you upsert data into the 'hy' and 'mn' properties, the document grows. Although MongoDB attempts to pad the space required for documents, it will still end up needing to reallocate these documents multiple times throughout @@ -97,16 +96,16 @@ the day, copying the documents to areas with more space. Design #0.5: Preallocate documents ---------------------------------- -In order to mitigate the repeated copying of documents, we can tweak our +In order to mitigate the repeated copying of documents, you can tweak your approach slightly by adding a process which will preallocate a document with initial zeros during the previous day. In order to avoid a -situation where we preallocate documents *en masse* at midnight, we will +situation where you preallocate documents *en masse* at midnight, you will (with a low probability) randomly upsert the next day's document each -time we update the current day's statistics. This requires some tuning; -we'd like to have almost all the documents preallocated by the end of +time you update the current day's statistics. This requires some tuning; +you'd like to have almost all the documents preallocated by the end of the day, without spending much time on extraneous upserts (preallocating a document that's already there). A reasonable first guess would be to -look at our average number of hits per day (call it *hits* ) and +look at your average number of hits per day (call it *hits* ) and preallocate with a probability of *1/hits* . Preallocating helps us mainly by ensuring that all the various 'buckets' @@ -114,7 +113,7 @@ are initialized with 0 hits. Once the document is initialized, then, it will never dynamically grow, meaning a) there is no need to perform the reallocations that could slow us down in design #0 and b) MongoDB doesn't need to pad the records, leading to a more compact -representation and better usage of our memory. +representation and better usage of your memory. Design #1: Add intra-document hierarchy --------------------------------------- @@ -127,11 +126,12 @@ stats.mn.1439. .. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sg_d2tpKfXUsecEyvpgRg8w&rev=1&h=82&w=410&ac=1 :align: center :alt: -In order to speed this up, we can introduce some intra-document -hierarchy. In particular, we can split the 'mn' field up into 24 hourly + +In order to speed this up, you can introduce some intra-document +hierarchy. In particular, you can split the 'mn' field up into 24 hourly fields: -:: +.. code-block:: javascript { _id: "20101010/site-1/apache_pb.gif", @@ -143,27 +143,26 @@ fields: hourly: { "0": 227850, "1": 210231, - … + ... "23": 20457 }, minute: { "0": { "0": 3612, "1": 3241, - … + ... "59": 2130 }, "1": { - "60": … , - + "60": ... , }, - … + ... "23": { - … + ... "1439": 2819 } } } This allows MongoDB to "skip forward" when updating the minute -statistics later in the day, making our performance more uniform and +statistics later in the day, making your performance more uniform and generally faster. .. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sGv9KIXyF_XZvpnNPVyojcg&rev=21&h=148&w=410&ac=1 @@ -174,19 +173,19 @@ Design #2: Create separate documents for different granularities ---------------------------------------------------------------- Design #1 is certainly a reasonable design for storing intraday -statistics, but what happens when we want to draw a historical chart -over a month or two? In that case, we need to fetch 30+ individual +statistics, but what happens when you want to draw a historical chart +over a month or two? In that case, you need to fetch 30+ individual documents containing or daily statistics. A better approach would be to store daily statistics in a separate document, aggregated to the month. This does introduce a second upsert to the statistics generation side of -our system, but the reduction in disk seeks on the query side should -more than make up for it. At this point, our document structure is as +your system, but the reduction in disk seeks on the query side should +more than make up for it. At this point, your document structure is as follows: Daily Statistics ~~~~~~~~~~~~~~~~ -:: +.. code-block:: javascript { _id: "20101010/site-1/apache_pb.gif", @@ -197,21 +196,19 @@ Daily Statistics hourly: { "0": 227850, "1": 210231, - … + ... "23": 20457 }, minute: { "0": { "0": 3612, "1": 3241, - … - "59": 2130 }, + ... + "59": 2130 }, "1": { - "0": … , - + "0": ..., }, - … + ... "23": { - … "59": 2819 } } } @@ -219,7 +216,7 @@ Daily Statistics Monthly Statistics ~~~~~~~~~~~~~~~~~~ -:: +.. code-block: javascript:: { _id: "201010/site-1/apache_pb.gif", @@ -230,26 +227,26 @@ Monthly Statistics daily: { "1": 5445326, "2": 5214121, - … } + ... } } Operations ========== -In this system, we want balance between read performance and write +In this system, you want balance between read performance and write (upsert) performance. This section will describe each of the major -operations we perform, using the Python programming language and the +operations you perform, using the Python programming language and the pymongo MongoDB driver. These operations would be similar in other languages as well. Log a hit to a page ------------------- -Logging a hit to a page in our website is the main 'write' activity in -our system. In order to maximize performance, we will be doing in-place +Logging a hit to a page in your website is the main 'write' activity in +your system. In order to maximize performance, you will be doing in-place updates with the upsert operation: -:: +.. code-block:: python from datetime import datetime, time @@ -287,24 +284,24 @@ updates with the upsert operation: 'daily.%d' % day_of_month: 1} } db.stats.monthly.update(query, update, upsert=True) -Since we are using the upsert operation, this function will perform +Since you're using the upsert operation, this function will perform correctly whether the document is already present or not, which is -important, as our preallocation (the next operation) will only +important, as your preallocation (the next operation) will only preallocate documents with a high probability. Note however, that -without preallocation, we end up with a dynamically growing document, -slowing down our upserts significantly as documents are moved in order +without preallocation, you end up with a dynamically growing document, +slowing down your upserts significantly as documents are moved in order to grow them. Preallocate ----------- -In order to keep our documents from growing, we can preallocate them -before they are needed. When preallocating, we set all the statistics to +In order to keep your documents from growing, you can preallocate them +before they are needed. When preallocating, you set all the statistics to zero for all time periods so that later, the document doesn't need to -grow to accomodate the upserts. Here, we add this preallocation as its +grow to accomodate the upserts. Here, you add this preallocation as its own function: -:: +.. code-block:: python def preallocate(db, dt_utc, site, page): @@ -349,23 +346,23 @@ own function: { '$set': { 'm': monthly_metadata }}, upsert=True) -In this case, note that we went ahead and preallocated the monthly -document while we were preallocating the daily document. While we could +In this case, note that you went ahead and preallocated the monthly +document while you were preallocating the daily document. While you could have split this into its own function and preallocated monthly documents less frequently that daily documents, the performance difference is -negligible, so we opted to simply combine monthly preallocation with +negligible, so you opted to simply combine monthly preallocation with daily preallocation. -The next question we must answer is when we should preallocate. We would +The next question you must answer is when you should preallocate. You would like to have a high likelihood of the document being preallocated before -it is needed, but we don't want to preallocate all at once (say at -midnight) to ensure we don't create a spike in activity and a -corresponding increase in latency. Our solution here is to -probabilistically preallocate each time we log a hit, with a probability +it is needed, but you don't want to preallocate all at once (say at +midnight) to ensure you don't create a spike in activity and a +corresponding increase in latency. Your solution here is to +probabilistically preallocate each time you log a hit, with a probability tuned to make preallocation likely without performing too many unnecessary calls to preallocate: -:: +.. code-block:: python from random import random from datetime import datetime, timedelta, time @@ -381,36 +378,36 @@ unnecessary calls to preallocate: # Update daily stats doc … -Now with a high probability, we will preallocate each document before +Now with a high probability, you will preallocate each document before it's used, preventing the midnight spike as well as eliminating the movement of dynamically growing documents. Get data for a real-time chart ------------------------------ -One chart that we may be interested in seeing would be the number of -hits to a particular page over the last hour. In that case, our query is +One chart that you may be interested in seeing would be the number of +hits to a particular page over the last hour. In that case, your query is fairly straightforward: -:: +.. code-block:: python >>>``db.stats.daily.find_one( ... {'metadata': {'date':dt, 'site':'site-1', 'page':'/foo.gif'}}, ... { 'minute': 1 }) -Likewise, we can get the number of hits to a page over the last day, +Likewise, you can get the number of hits to a page over the last day, with hourly granularity: -:: +.. code-block:: python >>> db.stats.daily.find_one( ... {'metadata': {'date':dt, 'site':'site-1', 'page':'/foo.gif'}}, ... { 'hy': 1 }) -If we want a few days' worth of hourly data, we can get it using the +If you want a few days' worth of hourly data, you can get it using the following query: -:: +.. code-block:: python >>> db.stats.daily.find( ... { @@ -420,9 +417,9 @@ following query: ... { 'metadata.date': 1, 'hourly': 1 } }, ... sort=[('metadata.date', 1)]) -In this case, we are retrieving the date along with the statistics since -it's possible (though highly unlikely) that we could have a gap of one -day where a) we didn't happen to preallocate that day and b) there were +In this case, you are retrieving the date along with the statistics since +it's possible (though highly unlikely) that you could have a gap of one +day where a) you didn't happen to preallocate that day and b) there were no hits to the document on that day. Index support @@ -431,14 +428,14 @@ Index support These operations would benefit significantly from indexes on the metadata of the daily statistics: -:: +.. code-block:: python >>> db.stats.daily.ensure_index([ ... ('metadata.site', 1), ... ('metadata.page', 1), ... ('metadata.date', 1)]) -Note in particular that we indexed on the page first, date second. This +Note in particular that you indexed on the page first, date second. This allows us to perform the third query above (a single page over a range of days) quite efficiently. Having any compound index on page and date, of course, allows us to look up a single day's statistics efficiently. @@ -446,10 +443,10 @@ of course, allows us to look up a single day's statistics efficiently. Get data for a historical chart ------------------------------- -In order to retrieve daily data for a single month, we can perform the +In order to retrieve daily data for a single month, you can perform the following query: -:: +.. code-block:: python >>> db.stats.monthly.find_one( ... {'metadata': @@ -458,10 +455,10 @@ following query: ... 'page':'/foo.gif'}}, ... { 'daily': 1 }) -If we want several months' worth of daily data, of course, we can do the +If you want several months' worth of daily data, of course, you can do the same trick as above: -:: +.. code-block:: python >>> db.stats.monthly.find( ... { @@ -477,28 +474,28 @@ Index support Once again, these operations would benefit significantly from indexes on the metadata of the monthly statistics: -:: +.. code-block:: python >>> db.stats.monthly.ensure_index([ ... ('metadata.site', 1), ... ('metadata.page', 1), ... ('metadata.date', 1)]) -The order of our index is once again designed to efficiently support +The order of your index is once again designed to efficiently support range queries for a single page over several months, as above. Sharding ======== -Our performance in this system will be limited by the number of shards -in our cluster as well as the choice of our shard key. Our ideal shard -key will balance upserts between our shards evenly while routing any +Your performance in this system will be limited by the number of shards +in your cluster as well as the choice of your shard key. Your ideal shard +key will balance upserts beteen your shards evenly while routing any individual query to a single shard (or a small number of shards). A reasonable shard key for us would thus be ('metadata.site', -'metadata.page'), the site-page combination for which we are calculating +'metadata.page'), the site-page combination for which you are calculating statistics: -:: +.. code-block:: python >>> db.command('shardcollection', 'stats.daily', { ... key : { 'metadata.site': 1, 'metadata.page' : 1 } }) @@ -507,19 +504,19 @@ statistics: ... key : { 'metadata.site': 1, 'metadata.page' : 1 } }) { "collectionsharded" : "stats.monthly", "ok" : 1 } -One downside to using ('metadata.site', 'metadata.page') as our shard -key is that, if one page dominates all our traffic, all updates to that +One downside to using ('metadata.site', 'metadata.page') as your shard +key is that, if one page dominates all your traffic, all updates to that page will go to a single shard. The problem, however, is largely unavoidable, since all update for a single page are going to a single *document.* -We also have the problem using only ('metadata.site', 'metadata.page') -shard key that, if a high percentage of our queries go to the same page, +You also have the problem using only ('metadata.site', 'metadata.page') +shard key that, if a high percentage of your queries go to the same page, these will all be handled by the same shard. A (slightly) better shard -key would the include the date as well as the site/page so that we could +key would the include the date as well as the site/page so that you could serve different historical ranges with different shards: -:: +.. code-block:: python >>> db.command('shardcollection', 'stats.daily', { ... key:{'metadata.site':1,'metadata.page':1,'metadata.date':1}}) @@ -530,7 +527,7 @@ serve different historical ranges with different shards: It is worth noting in this discussion of sharding that, depending on the number of sites/pages you are tracking and the number of hits per page, -we are talking about a fairly small set of data with modest performance +you're talking about a fairly small set of data with modest performance requirements, so sharding may be overkill. In the case of the MongoDB Monitoring Service (MMS), a single shard is able to keep up with the totality of traffic generated by all the customers using this (free) From 4ef90f0253a485cd984720156bdd791e874a2dd9 Mon Sep 17 00:00:00 2001 From: Rick Copeland Date: Mon, 19 Mar 2012 09:53:40 -0700 Subject: [PATCH 07/20] Make images local Signed-off-by: Rick Copeland --- .../usecase/ecommerce-category-hierarchy.txt | 21 ++++++++++++------ .../ecommerce-inventory-management.txt | 3 ++- .../usecase/img/ecommerce-category1.png | Bin 0 -> 7514 bytes .../usecase/img/ecommerce-category2.png | Bin 0 -> 9004 bytes .../usecase/img/ecommerce-category3.png | Bin 0 -> 9131 bytes .../usecase/img/ecommerce-category4.png | Bin 0 -> 9404 bytes .../usecase/img/ecommerce-inventory1.png | Bin 0 -> 15587 bytes .../tutorial/usecase/img/rta-hierarchy1.png | Bin 0 -> 8351 bytes source/tutorial/usecase/img/rta-preagg1.png | Bin 0 -> 3464 bytes source/tutorial/usecase/img/rta-preagg2.png | Bin 0 -> 5874 bytes ...ime-analytics-hierarchical-aggregation.txt | 3 +-- ...l-time-analytics-preaggregated-reports.txt | 11 +++++++-- 12 files changed, 26 insertions(+), 12 deletions(-) create mode 100644 source/tutorial/usecase/img/ecommerce-category1.png create mode 100644 source/tutorial/usecase/img/ecommerce-category2.png create mode 100644 source/tutorial/usecase/img/ecommerce-category3.png create mode 100644 source/tutorial/usecase/img/ecommerce-category4.png create mode 100644 source/tutorial/usecase/img/ecommerce-inventory1.png create mode 100644 source/tutorial/usecase/img/rta-hierarchy1.png create mode 100644 source/tutorial/usecase/img/rta-preagg1.png create mode 100644 source/tutorial/usecase/img/rta-preagg2.png diff --git a/source/tutorial/usecase/ecommerce-category-hierarchy.txt b/source/tutorial/usecase/ecommerce-category-hierarchy.txt index a45fb4b48b8..2766ee18e44 100644 --- a/source/tutorial/usecase/ecommerce-category-hierarchy.txt +++ b/source/tutorial/usecase/ecommerce-category-hierarchy.txt @@ -14,9 +14,10 @@ We will keep each category in its own document, along with a list of its ancestors. The category hierarchy we will use in this solution will be based on different categories of music: -.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sYoXu6LHwYVB_WXz1Y_k8XA&rev=27&h=250&w=443&ac=1 +.. figure:: img/ecommerce-category1.png :align: center :alt: + Since categories change relatively infrequently, we will focus mostly in this solution on the operations needed to keep the hierarchy up-to-date and less on the performance aspects of updating the hierarchy. @@ -84,7 +85,11 @@ Add a category to the hierarchy ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Adding a category to a hierarchy is relatively simple. Suppose we wish -to add a new category 'Swing' as a child of 'Ragtime': |image0| +to add a new category 'Swing' as a child of 'Ragtime': + +.. figure:: img/ecommerce-category2.png + :align: center + :alt: In this case, the initial insert is simple enough, but after this insert, we are still missing the ancestors array in the 'Swing' @@ -126,9 +131,10 @@ Change the ancestry of a category Our goal here is to reorganize the hierarchy by moving 'bop' under 'swing': -.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sFB8ph8n7c768f-MLTOkY-w&rev=6&h=354&w=443&ac=1 +.. figure:: img/ecommerce-category3.png :align: center :alt: + The initial update is straightforward: :: @@ -183,7 +189,11 @@ Renaming a category would normally be an extremely quick operation, but in this case due to our denormalization, we also need to update the descendants. Here, we will rename 'Bop' to 'BeBop': -|image1| First, we need to update the category name itself: +.. figure:: img/ecommerce-category4.png + :align: center + :alt: + +First, we need to update the category name itself: :: @@ -226,6 +236,3 @@ the category collection would then be the following: Note that there is no need to specify the shard key, as MongoDB will default to using \_id as a shard key. - -.. |image0| image:: https://docs.google.com/a/arborian.com/drawings/image?id=sRXRjZMEZDN2azKBlsOoXoA&rev=7&h=250&w=443&ac=1 -.. |image1| image:: https://docs.google.com/a/arborian.com/drawings/image?id=sqRIXKA2lGr_bm5ysM7KWQA&rev=3&h=354&w=443&ac=1 diff --git a/source/tutorial/usecase/ecommerce-inventory-management.txt b/source/tutorial/usecase/ecommerce-inventory-management.txt index 7a53bb88222..637911a6a8c 100644 --- a/source/tutorial/usecase/ecommerce-inventory-management.txt +++ b/source/tutorial/usecase/ecommerce-inventory-management.txt @@ -24,9 +24,10 @@ for a certain period of time, all the items in the cart once again become part of available inventory and the cart is cleared. The state transition diagram for a shopping cart is below: -.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sDw93URlN8GCsdNpACSXCVA&rev=76&h=186&w=578&ac=1 +.. figure:: img/ecommerce-inventory1.png :align: center :alt: + Schema design ------------- diff --git a/source/tutorial/usecase/img/ecommerce-category1.png b/source/tutorial/usecase/img/ecommerce-category1.png new file mode 100644 index 0000000000000000000000000000000000000000..0483e3785be91b10becb88a952530464a8000aeb GIT binary patch literal 7514 zcma)hcUV(Tx2}j_0mKL>y@Xx_X@Z2J5Sl;~5a}QSDj*#~2SvdkT{@vi?;xQQiU>#v zy^3@~Pax8yle_VE?s?94zWdyB{@C+mubDM_&6+jqeP<=)2~3sh8soJK7cNk#t3h=x zTp&RMzYnjF18q|TviHIT77lgj!>69VHc|sRt*jfrQCG%hkA4n$X1w3dh0e`Heg1Jv z(D+SdrVSIiJfj`VbXAz)52n)zbbN;=wEIzradGDK#u1agZ`bB6jqY1?O!Rc4InV3x zY#44~7sAwGN-6;--tW`bb8NYsnk1f1?dvsr-*q`fd_{KUP`0^GT$w}6Wb?Hu=B{^J zIlRZS2+_$UZuinbh<(fl_}zwa=vR7NeVw>eYvI=ZOJg<<4~0%Jbi@9KHZ~O{9-E5b zgD6e#LqOW_U=+=NwxAmS{j~161J;8s*OK?DXQC*}$wrR=+~oerlZg-nA9_i7@Rj0% zeu>fO(eA5e*x@<%6S#_2~r;nu^P>8imaZQJtl@&P*=9@R6=^GZwJlhx+4-hbK^VkPg{ zMijRa^U~nnb?Jig)%FfvW)+tA%hHJ(v)#>aB2QIIcGi@d9qrHB{0UmW%AHbu@t1LUgUj0GsrkJeUwst@Pr5zkB&!}a zU7mISqaUzYm;sh;iUGFR<|}^v;-I#;hO(+J;Y?Kkd8NQ~tIi<(7c=;(XpUA0Cfd}D zLg~3s|GqAlU${ihcKy0Gd9qkhbrBF&j9d^~N#@6GV~;oa>@*~Hv^fiprZN&D;&J}Y z;~w4q(%FH`!vqzr!qN~4`yVF7c$b1vxy-10MP_?jc9`)_s?<4xplO01B9|zccZAT<27U2&m!so_ijgZ=2HC)5 zaR1;EDl}Oi;WmJxI81XrQ+D+W7yS}!xgv)YRfaMOKq9hcqkPgdEE{)8?2~y4azN!N z$snAbnPM&0nrw>6rbh3kOafk4$EBq;~2qR@pvk`7x_pzEAMue-Z-?o#s3psSY^t zSy+oex7PU_cA9N`FkARqL+p9huzm#`m;X%*B$*66G`k#SnR{Aq-g7TkYGI!gNEMng zKA~3YLaq6Tgvl*oM>HS@^J}VD&o-**#w({vvX_L5a z0|HK%b8Wu=k2j5ZDnQ3KGUZd!Q6E`3V5&=-s~l)-dx+{yK%XaGZ$sK>H%%CqmBs@C zjJij6x)MA^u(=6}R)6U}C`Y|0A`>{7EabX=7X-4Qtq97x3w#N15JL0bRzs?8PzXZJ z@=#Y5E}nJ`1SD557<`_V|Bg6!83@REfa?JXkcfdy{NE)Az`6fT?cXK;uJ-SezY`hC zt3pu58wEE&(YZtG9}qx>{RrP!&|0+4R^FP&e`+GMMC6~cgD|MdSO>_2OuvS4A zy-L{)P?GP&hS#ckYvOHSy^P%G)`0sC^gD{@p>vKo|C{#ziA$srZ(tpDo}-0&B^7=0 zdG6Cx>5eSkYx!dTgs?f>5qZDDy}hHlxBWatV>}6`O!C=1=P-xvN?-}Z%=*knb{G*1 zJq!=eB3X$AihmW8c_9nSlLVpJ<~{WOQ~y7IA?Yl)th&jA$eF`Dyq|t>Na%JyAihhN z)h|76H=VVS17PSc$T6`^oFmBg>1l;X-yI(Sc`Ah)4u)I~ac0Eev3035qceq_2oHG4 z{#w6(or!;MM5?P+PMO@{B1+-L7J$4U>zYiIlG(`2(1W)eMZ4k>_Fy4fKW${z%a5Hi zsO1e;QX@rp=WhTT;xX4*4^?rf@Pq^mVR}Ci*gkD|7E?IGLwli(GHgJ+E*4YcK0Mx} z-Q1SMAW1@Ybt;y5D$_{_oCP)v*G6?CU*K$*CqhJY7;B*F$X@~ju&VQ2vHZo>c_Go} za7!PMcK1Cwci#WpX5uNWfrV#tpo);v6N#IE+7PXo@X-l#gy`NDss_Li88-WlJ5=FH z8wUhst99Z;1n=Y)6mr&t=s$IYT)%f9>#%zd1Y$71C!}bgVAsLFAw3YHD}%Xpbm*Su z0Cdo)dS6*5$^@`EbQHpi`OR#`mWVMjgdE(nc1PJ>Fj!0=YP3mS5`~6jzEADiMnW*2 zj*rvr@wkyoKx%sy=23->fHUJb3mmP}At?>}I)7$1{Hom>Sbz?b$IS=?C57#C{c3E% z3xD$|aXOb@h8&&#r%n*TF^dP7cc1(m)NR9nG>BOUTY$!3%`1$?rq%D@rZRo#Ppdy| z-~|ei(yrCC(~))!g3(K+hW6WpRK|Hmkr)I*h|JdXSZwPq{n8PY~Fd*kP1M8K_Szjx_HAQX_cMcuVN$o}^IaiKcEgL*~kvZyQmdIO;dEJaw=}eM69iKLr$IYuPeYB^o1i)ab z#V}a-ozzmh#MYxSp`-qY?H~@kN3OJ}CpX<4$#F-zGD>d%je^UYqzCuo$2?b&mfyPW zvi+PfaO)>!(8{T)!v2E2pz^$$P&9|`=y+S}@UEn(7{{!|ly!p69`3p* z+e%oMHq`3tK-e?MPf;twJodEnTwTIn?sX`wH4D{WuKfMGe#iL}LP-3*{lbn#3?G`- zc0hrzeg(S-5jCAaYE}aq`JRMVB4&vc_8MmQy@xGE82x43Hzgr4Oc`33aX&1*z~(3= zLMz(Tek*btMLL1(q*=n~028RZwr`DkpX9=xuV{-S8B@E#Y zcWcgy6K7{-VAKbp2ug0!sb^2Ar6(km8RjH9g>xLZN&w5$ve55;+Ulk|3+&F0oX(ax6k^Y4FjJFE?uW1u95zA3u|8MZt`7_At4GFL#7^%DtG zhwa@Fms;1f;<~4;oY!_&a(A4EI&i)zU{q$0n^;!XOBD^UzoTPgUZ$f(p@Te(LKiG1 zanD^Wxgp8M*g`x8#Uy+$CbaWO0g~wSc>XPob4@8@X2arhtg7>-q7UB zq_979)15zkpF{~dOEza zS6K%XErt)JmC3J=!aN_`V{=*{MRmiT#hdLor!B033}ttz?zRuU%S02NS39k9TJvTF zYxknBtzc;l#a_ADzFE_9YjCBxKL?%YI-q-0!!%szLO=JYYJuU4As?4>d##4aN$9$ZM%XJx?)P3{03@4<&YpWrJ6qeji zEi-;%^jI0|3^PiSS^xFtwB(L8Z*Sh_xQb_lV@aNB^4qLx=!Ck_Y6&tjW_QjvMa=oP zvEkucn)D8UYPM(s`z4k!K;HkB=9SS6YhDWOW$cOavq_&qCwUlyIK9*NHViQM*D8w> zz*BE^|ALw6NV4U5g@hGLROjSMz;kj$wc#kr*W54Ka!a6MV)RZfKB58Nt}4W@zopTJ zi@h#9B6IWd|ZKy%=;R^W08<@rUR1UMGGG41>b`>E&+?)*GVptdbfno|2uW@R}7j z?)s_lo-OrM7_FGrj=2s-aFq)cO*kyG5O#eTcNRv5VUBky3=@n@aieATyLKib0_64s z#faM`DF}f7Nc(-jLSMoUL^`M}%_twif$|7zf1apF>A}zHcb%Je>}EC(C6?QPLdV~t zN(O-AujtPKKlq&QoQ!;mtJWXce@YGE5+<&TP48*w zH%9P6R(2Njik?Xp-jyssg;tIV{kt0e)N7R zL(vnG*2&TlAF`qHhXGv?n6|_=DP8Ot!JWp45n$tYASCOH6f2%vaHM55N=rTf6vwPp+ zZFfnjC!hT1c>Afe;Cjchz}4Oey}y}m{0d8lTSTKOn9Zup3pjY6hxu1cx! zjAZg5qD%Yy_YbXQ`i!q1n4K4_mSjwHkCr?Ms*%J>&GE`NQLJ_M%%}EJtuYZBJO2J9 zb-Zdyc)_9DvkJXm4}>O{Kw*FK;mez&Y}JH@?j})^ zHLhNM1&L-*A@6WzcO2t%4c!dZzT(Jeb^GwD3)nuI2(_@r@p{A(22sixiFV(s<{9&X z<|VYIcdHy@_hk-Q(z%IDAMk1fAteW9@prMDsr!r%+Fuz&D*DO%7NlwtpA98M=gz8h zMG))0dFk>nTxF1#a!y-o3(^#ken|CkX{u)p=iA1cj&Zzrld&Tz;ll!0^*c*8UX80j zjkC0Qtw_%C))+78(#bC46wTzruzQxm{x=@V-xaVm_i7;lLEi}|NJXn~nmV_-_6`r- z?Zt&gvy97$P*sG51zt3*m$wPqKfk!x-M0y}-%1^#jIhsHi^eYQOl5UDINMN&>YXaW zjlM>nd^o<}>jTzWLM{CI?U%pNF7`($`4QUY%C#37Hn3jn+m^hA*9?VlOIL=dZIBI; z3Qd)JcRtkWRMkg!l@{{HAO*AIFcB0Qx|-A9XoM;K%%fKeALVexpYU{wsMLpBbPC&k zbw4itrJ8(~Z)0>Ll=8mgO>Ov&I9Ia&dhkcp9c`nEtdXSsqV3lgaa&2dQ;THML;Aw) z?PKHyA9vGVrcF@@g${`;bqpix3f0jAY|TZ^T}2&l_cs?^o(Z(+41|;>PbR|H2hBu$LwU4f8G3iySh+-R#5J&BBsG<#fZ;$ST^F<%M{$Y14v0<4uQeshxMDpQH27+?BX3Beuu82z-{7ApgHaX`@ z;i9y)Z@g(Ikxu3M4I}Yq!daf*?=y_g&8c@mb`gasbGHIZNb2oP3)S>T>~y-XB=XU= zpRJ35#zKAA$8reyn4Ta$2f>3#2$QF`I0+C-3VEh2>rb|L3oXa2BeULmci`+N2tGgRn9Hm{27 zA;E1e%G{#u?FPm8!bdG$Ds+>}Xcc3*<7_w6s>mQN%17RBB8}8dE8d}ulko>k3|DO8 zbndKu2(uMMWf8b%_1{?DM`@&M78Y^pF> zv`*Vc(lR#$2%ERdX+Hx-xD4aX4_5a-RWAm03Lnu9+iwLVYjd>n1+9ohG-bH4QK-{* zhJR*X;Y~)xPS080mhzn`yFvD~(!A)=qrt_el49|i7B^6#n!y`wt3CBEg@)8A84;AF zg*LK7J6|atcW=fmiAv`b9!Boo3R+v8EPAB7NOSsfl!Hy{a$+`FS(NDrtGcn*49z}N z0G%(r%Sgid-nMYN`3FY$=tgFC)HA;wOiZ>89lM+~Lejm7_6Je*XM;}_ z-E}^_dvfRwjm5;VFfd&VeP=3^HH+xnzbzR^-$1*6p=wmtwEc?K<^19l{6bQfEaB14 z<)cMUmT_rEtxp%9mXX-f!aI$Jc0pt(wIY^~L#(8mLi@METGbaPWQU%n87Ye9%sYH} zwMP7# zMMGPV83M0Z&&+u>Jh$}qtH@k-TW=xD<2OLqf9u+PVVZnL{af^xpD;afvJOUWZ?U36 z_rjx)*y#nU+hn5hm~faDZ}L>#4B!*B;C;S0S8R+5_WnRnXM`Kl%-EJB_N!&oqhIMv z;M`r0gwks<3Bf4(v!xIg3Bt+jZ+Z8KT<2mU?PMagDZ=X_7vTmc2U z2K++ilAn$qDOoK8rzmtFK1Kg{=VSL_=-Zz!>jQ4;$e6#9)d>1f`&OIO_;otVEO#~? zOYdQn`%TrOZ?Ew&1P1CZ-rL#t4unN7)d~kYw?0PVqm1QAE{sPaV=7h2L0aRDw$D3x zw3tkg?Gl#H$ku|s8;;h6T{Nqg--mfC1mSD<)r!hl)0sv;d>a;qT${If$hUpvTBBD9 zKP#-i%vrumMNgM^LGv<_J*#0f?2$6XGIB#&S7|Vx zQ!REwth@T|9zprh9Wq3E6;r%nFuvL+i@|RpeWTVYWX(W*HuqB^kg)9757IAAm%UN=l@MOUYXVMw zqVqE!d*1XmuHQn>;R*Q?lqST!C6oc$e>wGX@)f?U8|TsNUtnK{KreHQjOpH__Jg!^ zB^P*(viTo1wMdW#kqz&C2SyJ@;Hz>w)8q}sA8CO-%1h+Bo-x;=ajVsUao;F4N3W_+ zu3UOc4hBfKviK-ejD*2(vg=u55@lRR3|=J}_{JTs(yKmM{`q>79$>lxBf!fW|;!8!W8`ca`?E5~$fk@l&s+H-An}6B+Kg-&roy@zwp<5mAJX1;j8J)wJ(cOGUX z%$*2od3L&rR(&0b6Aa7KLm#Dj4^yigMy42zM%x~~!Ee}ot3J#;l~OIIaQy6d|Ci3- zuL`6QZeDCJ9|vRH}-lN8%)Sh=+G^M zXu=o#pUpl{xm7{>E(U~!c4tfBJY!@eO-?oh zN+$yWC#=RNaAY=jv(d?%Nq*nFEE_|db=T5#juBs~e-NV$(zb=f?d|>HTH5{sIH**c zjF($&5+Ay9uwH*wA`4e*Ay3;ungYWZRaH?8=v{XjtzHIVlQZj0L{*+tu3!r?dRiju zqKwe6@$|ulE&4enl!QSwfQw$J`^7gVD$c*Etw|PApYXf~juBH~UAU-`DToh?vaxBI zyR0=^pVDL26Mu2npwaE;XY-($VaS= z)9wi>Oz2(sXm_NFWdx;2V(*1rV7*?{u-ho(YzYdwvx=T zhZXbnlgnNrlOcLNVJH(k`c7K4DHZU7=4ikwh^;iSjjw|OW2V@)_BaqgUV0j=#6i+q5Q)B$BYqK64>I{+p_W^T6E#y%ixfsRq;>^Ak3lr z<7ctHQhtTOGAZSoU~&A{!hq$W(xZC>?!}>oZWD%(kttX2>i9^Mc8y;S`;VU&hy!KL z)J!$qw(=m*RcgF$%kttAj%O#*5kI-Menq0%D?J^+4-XQTIcl~E%CMar2%zOv)(l;$ z7?7>JdD%zTXN92Z24_g^(#4l@P^e>=w^8it<*tset3wi&DRsY5e-^T20@gAL(W_3( zpoF(AE9}?1YGg9HHC{+E!y*l(g z{HUO)WdmBwWBozKJd+s&vh*&znflVe-DS5cAGfJ!tDu|3? z5N(=1F|ohBe%01N-m(>Lv!GDzFtwIhYk!|?&albW+JgC5a?7Z{xzVaP!1HP<8aMJK zJIDnT*Qq!;01H?VRT>C|$^Qfta?F9$FIvC)Qi$ayElFytmvv|61*vs*=DXaeh?O7`*H0K=beNzu@LJ3lQ zPozsdq;&o$SQ!K|BDZpTZ}0LAZjw;#gEbAwfu~*nQscUDK6qR9pKAEA@+OTu4Inz- ziTy59h#DJTIe?eB6;#*)Vd%;jGgppWX4UY(j~y4<*fTQ#9EZ-!1=6$BMH@==RD8Kj zF%#>0b%qSW?&*5dq2^~uv0a0Vz&fjQa{lJN%MW|gO4^pbT!e2^UV(SmdiKFwxxgM% z$Q{;?EPt25kzJFsa)Di2HjA?q<%Z7prXK#H_{UVB7d=O&+|vawY8BTvere?W7@w}# z!PF`O_p43XL&-LsTk))b&%P3i2Rk9aRbECnFygq}t>_#tB~S44r^uB4UL-VPp}6y9 zHKC-fNGyHzA1PFF{)ZW35Q=R*wEt+wqK4;g_z;pcZ}TeE&PwJV+n8s_%jBU%4UapW z^-Yujz!io)y#<^ISANUSe}3PepoCZVVDso*{FuCGJOCyL{4^GWR-1EV`x&y^+Byob z!EHJVCV!u=dr-kl_md76BL8$w>Gl7f7-5 z3-9BwiMN1BysZD|X{H*`6bF;H0eC^s_KEMcC}+TNEoW6I0KTnlE#SOr3MkPI|Ei-& zEuJya>Ug0T*qz%k5C}K~^UNoyb2=PZK_8+Gqk*jjFRB#g{;a1f z&Fzc)EA-`Ky$~0Qe?BM9>r=djO*B)r<=*pO0Qb+Vj^6(*46@X;pgVqTHZ+ZcuLu^k z!OMd~eU}!qWGo-_Es|53tTipO5@Il0C1G(s%<^uA3Qyp-qD2s&>c}y80QcGT1&}2I zx!KIfPg}}xD~6GmL=Z(5fo7z39)l-V@f75Fg90xpNV8juT#K>N8#N~XP3=Ic|C`T& zR8PtKlw?o21%M3Cq|kl-E`{)S{yzsM! z&rU&XwoFEYR!d8{;k`Ww^4ccNjh~s3|}xEQGAcV}7^?9p{&S79IbDz$86` z+`I1?2#wx9eD)ppev{^E?8sHoXHE-OawASY=bOJjXmW2jP{z4;;o>eEjO<{t4Xswr=Vr`o#qYNwMn3Gdi;PW6p>L1mafk$}@9jj>2bFA(d0KbWx=+tKMMN73 z9y}b6Lbp>zj?Siyn;r8YI-Pfth5@|xvNTkE-qx?Amwv@^#dxUo;m?&6yvcMtXUmU#C-dnk`$TBPV-*XQj5DDYG_%SI3WT-dZY>>r6@% zbGB5uSin6T8%THQ_DO_{8kRBTHa0tS>N|ZlGR8IXd_?Qi()f{!C3k3+BHg_{PCXjF zgP9WFw%iImsyvv?7>u=0+xlfpsaj>!&ERu8_Bn-yar>|Flatpce3R%_PtUSl7|?KN zEo;SPWL|`n`)PuvF#61Gz$=nol6L)J@3)r6!Xs~d>90d-1&4j=erevta(G}n!{QT^ zU}_Qk^oHN1@Q7o9ClVuZMRha9S6O8<*V-KTpKutgGTgG*O&FqFSfk!34xZ zVG&CS5jUmS&sg7cf$}SP1dLi!Lxs?r&LQy$*3jQ!#K|+U{v5}eNzPqoZz85|AovU` zd6(WLIdE)I_ng||FE>$!-pKqEP19K_QD%sBaMb3deUX}J2TcGfMokAiTh?1?mdzQL zO?jV?Kv!A< z`|QSBBQ>(8jTpVU`l}^xPIfk&`=94bwP$EdkGY8NG#xO$0$Eahd&-L(!+rgs_BfG| z0+*!uDpp%X#`|B1k-8|NFeZ#xCH?3iTg=NwBGo~k^}Wmq$sbX5QPNH#v6piyv1hGy zdtR6CweM`!hFJxR0I98AWPO>T!$-u3l3$q=aPCj(cKkEc+qFqRyR5#D2hH z?AIh7M4@jnJF~zKnc^_M%kzRQ`hEM!Fo9_NZ=0Ij+-+0Bv2;B3N<+=Ioy|c{Av(E@ zl(9Sn5w4HpV6mt;egr9`d))RkN`nQz?XUQh9_e;}>*uIlY(P9-os6j+DXHw$wC4D; zH@GkjF7`A-F7KYAA70U!mo_ajKrHungh)4%=>76J|%IV z1$~^OQ0?I2YW7&Aj7Wx1NJR7?oXXrevmr~fao~{?f{-*wD6?ZK8BKjqdmRiGzlr$_ z`n~ov$=94PaNkKAg1D&m)|k$N)+>S6s!e zs+5Mh2>H6KadbF{)uED+j-fSRc^RSZQ508GF;^$`mY2H|3KFa9ALa*x;l0)cp%hC{08cy?@2sZ>kuH-S>{NdCrk|SnFJ@f9gwEwOmgw zly0IqPnex!_Q??XTTB$*#F&Tv7_Awm?PG;P%`TZ}1ULTBvC;Cr9%<82y(pg{XdQ0? zb;QMJr=HO_+ebO(M%grgikG~iP`CJ~(r*(=-<;7O;z!&lY`Xz2e#MY_@^E1OiiY;? zais<*E|GZdFI{^=f7qz&vaG#ud$+@3U8@n<_%BWtVlj+wD(*QWWQD25h@wF;n0Mlp z^=C!Oqs*2Y=MJsp{)DK%ja}U!Pw&s~wXhFPXPgHdkZ`V7bixrp_Sgp>DGZ4gf(XXnmT+MG0^b?^ z%goiGOf}mSg&6TEzfs*6LU8WJi#330x(sPiE=YZD&n8avX@Mx4VL(O#OP^#!XrOa8 zz%R`ZFf8YK|GD&uRz5q58u}Q@60w$A2u`0R4q)-s)57QlkE+xJ0GcN$bIj-c)%_bI zUC-S_h|2@p9j^wWsnR*S1JDO#-_^bLiUf8&BK3xz|EGa80Qv8aZ z#b3>Q@8JzZF`3KoJ(2P13;y>%R@C0#eSoUfS|#mH114j4ScIF0)_nb7;##BIRy;t7V>3NvOUdf7o^HsJj4rU9C| zDCwORK;EZ+E{G7fMB5UklgzMHpBw!ff5fTngi(6AOI{sp)71f4#K`6F(&~uv3QPnd zTS?N0?8ps>_$r&{2Q{5NEwDePflagrI9ehwpLP6d=VL!tf4Tixhe_e8Fc8h1Q?_{* zWd{obRhbkl&j8x8(9I1m^2bGf7@C*`!2G+@glWzfTfbj^(&K=8T?;(%7mtFjQA-FAe)N=>@I!-f-}=~<5HeDBa&PGK$-+{w2RusRp?{bzcpR-R?8t5Sx6qg)NlJVzQ?Znpzp{|CC?{&nI zhF8)zL`vOVO2`_bA$Og%D?~Ddb|e-E>zN{>L`x*T3jNIbS%&Wy@gel*5Ux~E4dj_( zzKha!U8ahZICl6Tb~RA1%u@5EBrVoK#nwPrW{B(G8U?uO{$6T-AODZ`?@jkV5dU6} z|33b6WnobrlC*^u8ZRq=cLyRig?9k3-{DOJx{LNXZEbT0RYgHd5wlffmCcq~l}=LO z!EQm+C@n+H=#EnV-b93oQAj*Sydq?Bthmcq$&YiXclsE!(^{K_hckjR@ zpI>zQFx=nCiNsFZXDktWi6Xc?UMu+TwV71H5T$z!KN#J)f#|%2Fsj%eH1^F3&yOzk zjT%L@%mGb*=apwyd7NQe$^~46QOa7I{nD zQosN)ol8iE7X_YBAXv}|#$Y_37@JR&QaagiVGVavKM$($a`|MIL|l6)T_tW;v9^w1 zlBugIU;m>bf@j-Wdr$B$dY9HTVx9D0<5!;!sDM$S+yLm*b0g0-3~>I02fDaol?V zG#lAT_Mw>yT8b&QZVPc@5m(74cC#qd%WQNNr3U!!agSYCA&^#%hQ9U~n+x6>&DQO= zoBQbB?YuBAM%oTtQWh^IiV>O)P`1@al*(byeg-)@ueh#Y>kk~zifGT?M=fq4t+;fK z9@-O8g0|O{mN?QI(g9Fw!RL+*dHhzf+0Sl_JGXGVaqeJOLOfUd5U&t1+|}E}GMNN^FW8Mh1C|{TsWs561kLzv^?&`em+a9!=H~ z%T4Ki=0tKiO&`Ng7|s+bp*MGreUcvwP5EQ$b6;rrckm%7rLe!~)ob5kdt%hckF> zj}+P_t+U-?_bplPFsp1H#vwgCp7u%oyHozNlUZx2+}@)< zJnGj|Q!+G1(|NO4S*2AngZ%K$H}1tF`(auywbZYurSg<4uvs$(kR+da+4_cClw{&i zr`)26bg;D)wqQP9->n;T@)J_A&CS&GJC(tCq`k9xjvifh^k76%cVIK$knfRQ-WzSfAYr-RAYE zVNmRt+X<`eFz`!ssEohaWATS;p{6~r9%B->(j8#V3*XfsQW0*0) zB+I$EyI|O3)4i`%nLWMgjOqH%Wlu;B-Fn_=@}5}!9izd;IMuL^S4=`5JEp~H^tg&~ zJvOG4Gd2V?nB8Hg4Ar%^2&j*<)nMNK<Hs|4gGAVSceUpm-}@0}!BDs_zs4~7^F%g1Xq ztfT{q^yrO45*Z$}#C9b0j0(Ra_MQTFZ1&6Y6ZljfJrVodO|o83IDhbQ^1;{0nY-wc z>Cm=ZsM2HxZp6k`U}^DjKf1aqDY?crk0bG*i$<9$NaS>L6E5{#fkKG&m%TU!9|z^K ziDEIUBF4`CiwsRrYC%y}blDz8u0N7%D>{H_*}-acsU@!Z1rdJvXf7x_wvchjEXro3 zW9B<4T(gM3fWf*&wdJxASpdN)suU7)X5srUi$AGp(H!=CFmpoU51!gQ(><~5$mh;1yh)D+m08Vhf5Z8ogd>| zP7l>ZqFTFn5F(dZrGr0g(5gCQ)S!%a1X$(3=+u$qg+B_`@6L#7QL*N&e2=M0Oss@|pao&NfOltt9&ADGrr zOK{@*XwZ%|Oe;8QZ>|(Hed3u$4?*}W<_S#lMQU7Ke8Jk}09wh;Blri;B$FjFZs6Qq zT0|C^VDI8n!C+Z$c$JEe_j(}bvZ2W}>Dhdqy+*#F2^yA2co#KY?-+*bAo)?`W~6*H zjXmB!A1rE~M9jVS5*PdQGeb-Zm>$*9cz5GH;i*Azi%TivtY2RKt&tfY{OzWrg;>wT zd|olJOxhk^7%RAQ@28O31UQ2K3}@w!r_+@&L$c0schAveD$*Hk*6-e`JK>kn*u*s9 zUJvIn#tImbr=)!8_$Jx^7;0U1X73okrZbr}r`+rJaPk71DVd>YbL>}2_E|Pkg2+{t zSpK#TiCRkP^RnDrRf@Uc%bOE7+R7sRXL4iSn2<3FH^n^KlrVaC;S0u~n0YtN37QK| z>hp`x+fracw(=Ef1``#RQx9{TdRqvxhjT6zaU&>rs2j$y(F!k&)6ih52-@w|uX} z!6oZ=PHToNKt&9>Hadw-sqscrH}Hk_>J?3(|QxFrP;_ z2ltGJm`kjPEo(J|;j*)NxMpt`ng=MCe)P%vwedmeBOdBZ8_IPJEwDJKV14w(`N zAo}uE`*wPDOFeybP3p5#%~Ue{{=N}+-)s{Lp;9XbFGsT1ROnY{rwORO=QXCt!%08+ z`M~#8DWjHU$M0@mAw*|itC#596RoVekK~}ym;DzOVGVNivhlMP5ARUgC~nuJb#-DX zpJY%zw73DZfiaX1FJKi{V)kY%Os)U6TK)?>{?mE++gbU)#Qz->O}m4Q-r|xHjwxeg zrnF?Nt|L+h+QL~B^wOHcuTY*kSOr6=N2vpjcI;|o_f6Hlxo?a%X;u%9aK=9v{3Egi~j?mq4W&^ literal 0 HcmV?d00001 diff --git a/source/tutorial/usecase/img/ecommerce-category3.png b/source/tutorial/usecase/img/ecommerce-category3.png new file mode 100644 index 0000000000000000000000000000000000000000..472fb06555ec9089e3e1bfa28c789188469502b2 GIT binary patch literal 9131 zcmcI~c|4SF-}aSCB}-CCLMbF`_AM%Vwo1q{mc(QmVUR3SX&DmPckU3$n%xYBQbNOT zj3vt$+nBM8vBYG?@?7J7-sirb_w#$+_xn z006*cY@~Mw0QRB4pZkZ|!JelXLLC5b!QWW#`rXGPR6_Viw?G1Y*uHNGm z3vm%6l_j}cxql^|;CS`)89rN17o~qZ_>~y{9ywsebxQYDLc+^effm8m1_rv(-)`Oa zJIQWfaPS*RWM@)jdy zV5*Z``!%GbUm5;Dt91T3{_bxsMMs{d^AjQwp$^FShB>s(c@02M0?7>^&5sEI$Fk39 z0OvHW!+~q?=M>;6r(H~UK79_$=eU)@#!4`)?`q*I?Lp`N7*NIn&|Y*Iv+Z? zGOEhf9~;C)@EIuAH(o9N(^Y(&mj1 zn=M<-(@V@lp>(S6kl(zNg_DD$$GoOx{?CPi-9erRT)Er=R^_hKRNE*w*M`F*7xY%7 zNaKBP8Ks2^rA>6QXACP2D%auh*ZVH|$h^_ZaM>kn(ohY_zwo(!4(LUcAk5K}e_!_U zRwBE_=x5hQOF8xRdEL7eR~Hsld^k3K%nd-Z23pKV4qsAOpyRB{XjtD+L#EKCY4}nB zf6L=X-=O7Q_2-Gnmm--w9!pn*tUrX^2RS(e;no&0`f*DZm070X%a(?N4I|Y zM8_OJ-zk2K>y#5(e6lGuvWs{!V?hme_|p<@(wCeSs}CpG*@@l}uFsLup}&j|e1N{Q zW`Zrxlx2&Hv6K?i0DiG6qohsnjvTwN05_c_1jWgR(Vrw*e)y7EbUb9-f~%M~Gfb7x zw8pI;Dm6uMy^pXTkNf$_Ln)VL_`-423=YV@&RZ}My^ma6b<`W&b?CFnb=Yw~W}E@w z!|r|gNE4PoeR}DWM47K1IHV?vGAjp@B|sK++{9C(_=8)L>#pM5er8WxgnHkJ)1Aqb zzPJ>)$Bjh<-D-$f`97~(Wl#m(3;;-2)H{XKA6}foDP><5CbXboc()9d3tbVpc3gMn zqRSa;gC_8h>@%KqMTqX}j^t(~%yA*WTM;MLS2|>7S^WX-4elH}R2EZXENqp^pje5S;jFkdjHJ8Fc9G51)SA1bM};h=pg> zUgv|}eEXww)Gv8k{wcn@hf;0yq3T?iEmm4R)08FoNEZfsa^cb zX}(r&JWdD?Is!Q7{2G!L@_AfEU<1CGTMz}ebZWNgXBLF>e* zPA0BVIC$6$cpBn7bPFE$>m&ex;Q5pSfB*|78ux!)f{6-%KzI!f03ZYd|91KJ_p^ro z{r%wZzj^+DPWLy@XH7R9vV99qA3h1L^8x)9{52J%w7-3NWwX)y9puC5TwUDuXv^uS zaC1JO`sX#I+tjIla!MRx-SE7@RCKR{Asm=Bg!AlB{lM#tiZLTF1O6v^+56I2Yy(E* zU-Y6E+`qDRXh-YMR-64Oo4OxmdA}Efpc+n>hz#9b=E=JN=287;zg^wkGg-UkZ#vAc zg^kQXOQV)Qy^CAi%P+z!17le4L4I!T@t}_t3vcqCRRk}n6*_sQ%2=D1>S z#PmcgnkJ!ts9kN!0jL4?sjEbP?dHF(Zq zor!uxi1+ejTm*{Ru|UVTZGzl&ZP+MKj_rcGwn8f}VdA;U$h4yr*N_g-v zh){p+%vzdX^?49-F^Qyeb+t$=9KM-35LYdy#17tTzZ@^ za!)U*O?sL=KZ%mVB4E*yUWd&_xteaBKk-th^tTcqa$ck%p%eHJq~soRmmX0xdoHK3 z`QbvotYtT&?=2Cezc%G;ZL|{9z*^u(0%@RBV3HKru4V=YN!Lk%O{p8NXB&BMm613U z5#mh+>Fg;bR^v{a%vAF;b;9hgQZI=XI~{ zFshr_JMOMQ>)CDN?L_ugL2+L)riGIYG2WSq=xzVvmu|KhCokmnQ|?rU;zN9Wyq*qZ zrHgkM({VBqsIlmGyO!Nq7poJ_u?EY8@<7;QOmA9-KT<}m2-Xt^T6&0w=fnWJ=zCzEds zQw&qEX@Y;*TtNLkzfn)?x5*jjM$V}egN$^H&~Iev>Ivhu7p*_0_dHSjxp`Z51ZrtD zCiX0SCNj(-WvDfE3%lS)%q2CJ{R5+={L>EV`DP&aA%{GeE$MH^YI5AWT5j=q_?zhn zu_A@_E|^W@&5E~M7b!fI#$3<168@T{$5aF2JM$E~s^`{)W;dQMJ>_R9#LVFz54)9| z3{ItIDq2`ArwRY>qVj*(^(-Mes1Hy7=7fYbfk_O4zQeMf;;f(Q4fxc;9^fqSPaXe! znl95$UVI&|hVx(p*4q4zuKc-iwo(ej{U77AorNfbY&$o4KOpZrh6j;9Fo&=YNRu#! z8627&b58|;*(eL8{0m%oo8YWH?B7A#YLrG)u-b=uyuTB5g6bjr!FCI%_2#Mrcmx3F z!cLz9aVm+>gNtV-o7dH5=s+<-&QUE&Rg*M|DS}>%BV(MLRIXc{>=P)N_ia^yMH5| zf8(KlBPx(cS{a`M&Id)QaC7Phx3=7ifvW2RKQC*Zn%Zz_lWnhQ@JVw5W2NS_Pc~L` zL(L8Nc_#Hz)3@90V&(#Xq6`WujM$pcm%!Xbsw9XY!7n#z={bE-Nb661$IFKoHlO{! z=mHj6vMP{&RxL1@=?S0L>_cIs@9_=b);RQE93jf$2-Q))*AWYrmuNaPlI7K)o3cr-Dq$?nv=9_o`(aCH~L&$HA- zKFr!hJ}s2K7wL4=HXO6v?HZ|yp$bt~kqggsjX?O?x8prA7mz(Xtgp@-2+R_MU$x)lxzFPsDbFPFGsz9h;Y@=TsokQ+s$ALU-J!)odtg(@ZP2)A zITT|gxm#oba~R=)Y^;doF_#T>u70NZ?k4Ag6~H+Z_9VpJc#cD=7R{+4kQxM%bDame z76Z$L=zB(o$Pq!2e+`~_8Fjhcsue#V5l*oRh{f{t|&F z*mJIqTf}We?fqv4aDj7u*|gP93%(COq|g;^RI%rVtub}H^i!WnC>x1gkW|!A^jI6- z`PluNKw9G+A8^uA3ueA!K_q>;YD!awLAonI>SUku>OPl@tYDBBdEG5HM+!Q#(us^VIn#aY zVCso{hHRW<)z2)o^1punx6=Cmgn|&bgw}3BkivuW$2)oS4mp?WNwh0~qupbJrC2)( zovi@4znwlcg*iEK|KLidOT@mX>8>3&nj+2z^?Hmoe|f%zI5o9>)Nf#U9s0Ur7y2@H z7ZcFL8 zp21^PF0Xh1D3zSkMLjgdO8KK6pBJXeV5)6LXz+&A*|sYS4b`dQL}rh>hm$kA?*p}# zeP@BER`KrNy!fYbg*vtBjyc&QcPRzTiXNtu;DyG-%;fcEQotb%niDaEmNfZ9m>MPD z9i)C2c$)7*uMAq(GrZhrHPz={qh9q(*}xQ{4TAR}@d8Xx!;&AHChT4&(~O2zXeEWc z<_6@W4tpaMpU{YWx|7+}H`L9i+K%Lf)}(9IM9?_~(|1tz8Mu8{;yAxO$eU=!)Z`+z!uVtu!+;vFFV>sOLj z$o{q3Zsbyr@{SKL8G?OS?UkXfIFFw=38V$w(d}Nmh@crwthl^gvRbI=!KiiXv}Nx-?qNMw}SY-jF%(KvUkdeE=A_&Rl|ef?`7kzn5vkX`RW zyRn6m_=SF5tAqF5*BW#DnOo9KLz@`cr%z=d{|#)oU>&;LWrn3pB=@W7;7ZL97Xjz} zKel~5qQgoBgd7rDsuER0Mpf+L`kwDZybQ?0TlC67uVoIv`+DEjMlQBbKu;mDQy)9% zku!QZZ=tFv#J_3|R=pJYb)>RQ3aIQXXq3Ui5VQeP1%6UV-NXh=A9u2t>PVE%J~(?# zQR65lIG@ql2o?L#$ueV7*0*j$de`0-{>|^J-6O>E&DnK_1$#LFXl7HxdJt?4`o>b+ z>#r)V3c6p1Lmcta;dgXBdu^22D}dIF+ld17L#YMn)ENP$Otm;FFQ=J)iza%Zc0S<2 zZ9wfwpH@m}6C={3FDvM_B0BP-yA>Sb@AVyL`j`u#yc4*g&`_WQoo>>41Q*t`DNrFG zUf`beD(nXoP1sY%24lEoLmEO>9V@Z}eruAA;e?T3&tbtUzL$WXC#G{sQf4bnjhdk5 zh&8R~^6n=(N7w-O=Md~42{$L=&)vi-G6epu2Fj!caR6yH^D9>fvt>liqq|gv{T)HP zkW>P;uT4t%{f2qb!Kb#+zWvuN|H^ITqIU$HfW#MeE4gfl#Mf9sKks{59Fr-w8~^^d z`Pg@+o93JHkFUqa@!r%B<9*IY5_3XRX1ms?!g+#U3na{bPCWnxO&a8_ge@OoKJE8n zt8K=I=}UDhKRt=2jTOIf!^#=$9@F);L}E8SW+tV*Kr z2%0c!b+FE7-@7(3`Z~u`({D44E_$kQi=N+gB&RWFB5j^FiYnLaB>qFO;i9FSfWr;= z=rg}wZKbP1@)9pScULZ@-=`-{0pq#>>0*j;BujfRWVZQ;|#@HCO{SJ@V#Ccl;0M$4zpxZ}s* zBO(q78{w>FubCMb#bpK$mR(N^5x>eWMg7CCYyjmIsuOJdNYycJf54Mw3@$}YHTy~5 z6&$aSDe-l>SfmNe%ivI|)|i&ME~-uJoJP#o7mL4EV?zJoIx?xqx7SsgMW2*PBUfm z>M4-mCH9?mV9Wc;u#B`>^7wASLlG71_}S1Aprw#B9u>p`hOV{X2$7J8=0s;oWRABd z#&fQb$CEAH6*jw6#wAL4ndL6Rd}aZDMb1?^Ine{N71YP(?yo-BXX4zkyY^KjA}LGVLd@S9E9osJWIixyn#ma4E- zO!-kR&_tO@F*yUIxdQ?9uw+jkL^jvC4YmVJ?x>0LI27%nPTpYI*m zI18GNaq|JMxpn@ESelm;zKD`JhdTRa=9w{#{6`%7c_x0nD~=HGe(DRIDj4=BKUX+x z?z=A!=tW<``BYKnwQ^74R|)PN24rRBJL<)=#*tK84%I)*LYtJCn^Ar!ZO|Im-lYe+{lKT-`v7f(e-~>Unz0 z4X;;y_P7(?W_dTL^?o$ZVt}Ou`<%dmN5JPz^T-3b*Y9PR2nK1w#esfUuOg z=5|aFWZ^tGchg$Vg=y{|zdmXgd3cD5=h-(?SZ63IL@^9$yXNLh2y`y9SZPkKI~Xy& zfgX9&P++26{^cR9W!p#l=V}vTxlY$h6cwgDiicQ-EV9lRRySv<|Ks_cLuM` ze2%%0mO`r@hXq@TlCMYTs`6MhYRihQ&V2Upn2KyX0a3Qt0AsLRrS($Z39fEziSLej zJg}}>lfD4@gY^xin+dcJCZ(JFTr%`W~52jkjm&%ci>!dsq z$tbdGPj;4|y5Xv0i&u5VXniE*vd^kfSDFJSNsZK;4vvL9 z=$Dxm?aeO>dD~6zRJ09Zo%itj4d+}ItRC;8OZry%oZNKiBUKNl9M;OE78*BZ5i_V* z^}kM%9iqz14=}K)!~4{q4o8LK1_n&(2NR9lFWC-s<_--{Zr^wxDqkI2Dm98D_M9Cn zkO?VIruhF=-Q)MtzAWHQfVU@4G4ITQ^dxbNBu0WMdAuL4ekIZ3?b1rrIn<4X*onp2 zs1eoST|!NxfA4D6)^-UgSt-$a++?z^nZx}2ZG{CDd&l6fSnBG|CCm(Ira3F1HaxI4 z$7ivy?C2+t{oj$Wxghe^SsT~6lq<<99o13=S$N6c%uR`S>yj+ImGFmJPY2u6=Sk?e zx>ri(@ZlxoCyY$8s(A9t;qR!Y8U50km!1_lt$WBz~t+BHUH{{R^hfgbpthN@MpBNY; zU@vrrZnq@NeE5_;bWop~8KG$TVF6*d+S;(CW(@b}vlxl4vAbJVDTQNjO=nboI+&j! zpIG#LJ(p8h6GizXi4w7LT_`N`FGrl$Ua~GYes4VUZmd=5MT@eUxhkhuvko6u z107_fX%bHEf(+~o;o}*n%PIi`!yzGvEjE?cv3`OQUCym%-tR}WH&Y&f#!HnL3(=<; zPJIrEb>s-l@RKN**uo+-=fp$2r^j#{-k|z8;WHga>QXn;m8_iCD<|GQ=t<3l;U@~T zYdkhr9%jf|9wV)V=wNkiQ>Z2~@R600qwg2rApOSZFoQKxkSA)fDEn?e#nCZ_DG4{z z=2zR{dB^D}?n9Gt_chNXLc~`;rYf5`s)_%Kg`JIZFKp20Hy1zTBwL_VG6B@!#-C^6y%d%C0&=y#kWJcnap}dP!R#1 z11mfu@mU};Tg41+b*)H+e=^X^8-n|gE}{9bZ~F#8JZ)rh`qk|0Kdjmkq-e(eYAJ=? z-nzl^#;TS!_&a3*4wDMyb5$I8538*%ElH)JP$YN|1L2B|o?r-V4Qy3qQ9czz9W18` zc1H4fAq<$;U)Sn&pbm~7?GVbEZcZ~(6}jbhHcAj!7UlA8jG$z8lO^O-ck;_l9z1*g zOu+W3iY{`Xi)F}V59W8fK)i#JM)!A1Zp~8_y|KY3nH?XG#2Pqv*|%r1gw@+Bl#62 z)*K$_(&x+F)9Ix9HI)yNaiRk zCgN0{y*-Y!Wwsgr=2MHq{8TeL*N0iA$3FG%Ue997jL2zzUg5zJ!_htQNUtrrRMvtMs zHd=lhxHYVF)WB1H!s;}fXZEI^O?KyIc)Yg)lxwZRC$pv}tPtH(^VUW)OkuQPHtNBu zC)&d2kweCO{dtrtAH@8xvlU9k5+bDSvzhb33#YUb^-@2EE+_B=7NrJT6U?@YhNp z&*$Z5TFkT6hn86FSm$JjCnnrmg?FXgJ@(1F)Xudab4W7rj=5XNidGt!`e9@o&9T z(uG1xJkqJ zNP0A3{7kVDuK$}+?O9Vk$i@Eb)^nSxFHPTeX%05ZT0WgQP0!R1(7sEA(Z5gcJRCd2 zaKx6ZGyUUdU#-05hYb8clXOY-5djanlFuNO%9_$^VqrPuM>rWWZh_|^ZV#hMR=01? z{#i^CkV0icW%cT<<4Xof1$dbzYXz=eJF~CJZ1=_RiDe`Wvwd|tn=%KD?knTZ>0c}U z=6hE}e=u>~3?=qKQ_*B~$^bU16$QI0O41ExeDSalR!i|XP?BX=ezwem0~;t=XZBc_ z5cch}xxYE69uKgD7BBd!$aVMz+1-#)B4PH0b%F=>?{$J_dFHG)Nv^|h{PWMhI`;JI z-beqv^|O+qQ2Y?C5z=Ew`pdT>8otulHmuy)N-ny-EvAph`>i#JzG|j_{Lj;W z?%TKTxS;{Wa^JrF1mNRxSCnB z?V!yMwXb;om3&I(^I^@kCN`&Uq1o*_3G;6~J%YNG>nEROg$u1=pA8IT3}gfidk!~R zodkUtU$r91$ASIJYojr7@*ZP)<(I$1K_C%1Li)~5!VSbn5zX*xq-YH8;>j$`e{UsK zbVyPv92STu9M|7kuNrzpI6YoIpft898|_X?m|fT0tyGKhj9A6*4GoA5okPEwWchF= z<1+DAUXG^Hvg`UvSF3|FaiNbdW2gUZME%~tB`)fCzt2;Pr<4VHaw(%`2pSPFv>=fp z_VR|{;ewjW3AQz>?{XLY5_ABSB?+kLGKXTX(e&CB#?UWn)-h=n3LOGDNxhFL^A(hL z7Y*JSG*9O!c#lV3!Hdd0+cII`j5QptS%4I4S{u#hLMy7hmiq3T@r>AdHZ@*h=9A1R zhOXR6L@#9CLC>VSVro8S)vwg_IY$K6ZJkpZavxuzmukZv48oLY*U?JU-L3>VWY!{w6Gh{|5CPJ%vIma zGpDw2D>IOTUA}`t$9)Z&nkKQw(|vjAY|yo9 zpTQ54vi4r^9TChyPIYd0`!d3$1cbdVor+;A*C!oBDRRRkuAc2}Th5L%$`(h&v0@#+yR7)$0|?K-ZUNk|;Tr|MyxLoe8u}g<3ZGV&uF(C5m@j+}*CBR~_pxA( z@awJ6muKL2wu~h>@%xH3SZ{o+DS_}5yn4t93sV~0R{|=%V~qfTj`P4U50g$5TnbU1 zpyeK)iFV$p$&+Spe@MTn@2ZYxUxNElcHZ5ONpOvGMF!sk+&@!TR3-1h7j{!DV$pyy zN5&iD6*=683r}T1ElIke!>=H%w3MsL*2<$zjloe=%2rRil{l?C}?3ZgtZ zho0z)rS*Kkx+89C#j#}RjR@qA&aGv_zmz27d$eM~g9q2uNp*+Xn;%EwrR{YkK=Oxh zu{RE|>=rHS&Nt)K6B5WmV#+O)<|p+B*E2qo0Ei-z@gV+Q^9@N|6u<_O1@-#^YV>RjeC_$PwBsV#pi;M-2Eg0K}&jS@emM9 zBnd4?jO}M8Ws@{$aH+E8H*EjbS zTZ#)PYdy*waOqxj4SjeSiUB-~ivzGCungci5C~ux4AUSSfEWCS9vl$NAk475H;p;w z?>&DP^j}2pP5awo|69=i?)jHS&;Tt5gj}#R1d<fw zXmNm(h>SWq<}G};m-@|zHFQ3XCLTU<96?FkS{~?ILhlsetPp_tPANKt z%{rF?q59>syK654y7)0aLCN~a-@+We9BD6Y+?wv|f`uAa2f9Qa5wb zYR3yvu?aQrB`-`cgdO+RBj_zZiR(Ipx@wu1qE{>nYoVpoNDfrQzgTp05blPW0bmD> zFC@pgVpbr(C!CPO_--TiX~`Ho_`X|}y1*w^(jdyKuFy0580!Dp2LSC3c&815rS*z~ zu=aScRJdh~%uJvZ)4a3(eCoVQSI(ylEzS?EzRMe*WO4~t<<~OOhn{!qBdUZaT7xE9 zyQNe==3C>LrPFV52FpI$}U(ii?-@z&J_;SL_kfr^U8z@BB59gh-L5&W+ zYmeoM9`J4ESMkChne7(`meW(@tuj+^2Se{m+YC5qb1mnSg4-w%h+f9A!jTYH6{jaL_3CVb{z(A; zHMnT2!98!%e8T)85R|c|OpvH8bjc{v0YZJt&d4^h*jP;v1R{%G7#4HTN(7wdcd7OR-ha=8(JikfaKCgyx^I!z*ZO!CSDP~pX^cl*tY9PTl3 zb1Z-l@;Pz@bM&v!-#L&kfBs()W^dO22MjP3`a9+@d$2z$t7Qp?)wuz~ zLFT+;=fMB3%KwFi@@m&injcuSB-GH6!QjGCTR*^v)@i2AbI2B7UaE_YDiJW_N&e1v z@pd0W{P;Nd6v~vA6g@clLpv~S2C;KM80*>iIo-pW$4ZXGabfkM=N4}%Ry%EMszzBH zn$r4rr1;Ot&VNjBs3Xn&JH3R-okyM@6ZCU@)E7EvUAHKW8q0d)H9>QOl)=#NkAH(p z)S2$Bx!y9a6Xh!7U;BW>3TZ}}2&LA9_bE?SXNT-?>jG#8Asu%{Ff$HI6OI&u?^d7s zILk+tg37v>1X_-7CV8Yq&M#xQ{2vm*gr)un!s1fq{G0NvFv`l|2^m6p`1Z=DZY9U+ zimkA>*7M@8vwJbC*SPgFBBb?>#xAKlhhh8HbGNCF zN#_GsP`B)XbK{V9;deg}rvVx~$1ht^sf-q7+bK@Ev#>#4yF%;`g#OB#x>^3o*lL;R4w{Pd-}lpMx_SQ*oRHA0AsC zx$BBoH8V=D0Be_U!Kj|UckqbDvHhgNmR&wqCEAaJEVI5qTL-p$gtb{q^d0rK8YP==C$BF)A$7hX+p1^_2>iVd17Q{& zsBL>BYKpj##YiGMFXK#%AIHU>8&uagy3@^DhiKDk1F5`wHIGwQ^3>P|1CC`}1^q>W z$z-lcUO)dK$vTqqK@RyZsx5B{DvQ3t>-&c?+G3IfdY@ZcJ$NXC+fK;N6wpywa|9+z z60kXPmDhQuaTS3nqT-u|C$lJ^9bAr7$5?q8uM?SbQNuD+It8vLuO5 zFA?}6ekai>Prly8IP^zu7m_qzN^x1qMacSH+qBO zIcSmfnQto_r9$sR^InH<`V7@Z-9b8^8DhIf|6Cv;`2}!W0IX9p9vHrHO{YqqgYh#5NE?}j{2bLFRecE zaX{Q2_+9{jnF{;2K3u>IEfoHvlwp++*Z>tFe9#T=-)PrZF$|@;>UutUqC{~~Z#R|( zfv!w);`w{J#Guy0)(y0acivLw``X>u3tlA;-U$UjxZ!olWqC7G2&9G8$ihT9HHcmL zi<3s%{?64Ec53$iUB!><2$J2bk%9zKW3~Z6quY|X7gXCdIgd86ijwJ|=L`Z3widPVc^Wg)F+&UUj zrdA7^Ca*RJ3p9&poE&@>?7T$L5|wnv*9NU8%fTT&*cJ21RQ;H)u5&4u8|_#N)h!{A zQ~O$KEx6Z_q=C8zHLrVP!+c_833oZilw!J2iZpn72vf<|EV&!AOU<2pkSFb9gDz9h zZdq)Px5f9!OEvGi&VfJ&M8q$4sit53Todv!qT@w07!x#4OjCEeig)tlO}iO8-kDu zST{}CfmEsqVzFH~GpoBqOQ9l22dTreN>j3DL8f9(XOge^ZRcjKCw8z3jDiR$?JgdZ zsZXwkKCFF+iBk!bV-d^KM6-6cUXB4jpH7nQ+~PJg4m`!rfh~i6PTKWw2$6_iK=FB! z4la->+knB+p%Ax2f%EElggcQ%`XuYCtA)`P&seWOPIp3nWWMC4ZoLQ%`YvfYFAh5H z7G7VKhLnFE&_40LzaF9Lxq$R579T#O>CJ_VHPEpob=`K7Q&CMn|Bb?fT=94vj6PqD zD59KuqA@kafz{%w{5F|o+=ek*T$UsVTdrEd2Jb7`g`Nd1<<1e^HV635?iL$G9H)jj-EsUe$Kxy|Pm{arhYR>r1*D z`&Xg^X=)x4Wzz6gwAAUwb0zp4rVx1ndc&dtX#n-`QK#Q2#bX%O2Lv8Va@`oGc_J?P zpoo*jxQsUF!2Ja{uy|z^@D2;NO8{NXPI0T0UzU5>Ch#J3{?x?`M~83V=XAGXi`7Cl z30L-Z`ii)_YWk(v!1fQZmmU#aq3?g^I#SJy5ij;!FTir_CRIb}{+b^WRdmicpiB)2p+91@i~K=n%Yz z?H(O!Z?jr;l9^s8mz~Kfpt}uy&s9VxW0KX|3@ld*T@Ph#6uJVW@#EB{n_F%`dl|6V z?TltDA9>ahz&reHZKT~Il=uZ366~Q}8xSq2GL`OI{_V~wQM)SE*eMrwYGt3WWK|rLzJyXJzJ{IwX|F@!wyV^`?J} zv-&;uZ{z;`aSIgD53ehJ=H~-kPat4^t$F5V-KL84ZQ5FCsK++(ayGNLdyPg3{eE?w z2rzIflI(txzpTN5z?^_^J5R8eo9)yEA`hr+hdj{-T#$q^D<=7YKYx9Q@x|)$tfFe^ zX>+GkPpMFqoDe=r$mOxOJh%Fd#o2_`b~4nEH8c^cAGX=+)RF*#|6y4lU8PmXuq-?v z-1Ys`!^_zbWY>Hm;W~e%X?l$B%~t9!71`dUw_m0rWFV`|8w}Nr>G;}fWlJTIXemu+ z{}Cg=A}3+0v*UcLR^5&A+XF>?u$T0-yYx3gdBMZnRCZsV`vOl~YSnbYM$#MMS>eM^ zry|w82VVE-riayp5ls6-GnZj3%F(jSTZ-w^qs)RH zii-ZZ)@|?k4AqAS!5ro$+vlITe=C?mq~%lyE~*(VdOdw+ta;yeDto)|vnk8!{zhR_ z9c;PHL6Agzpu^!amn;z$>KQi9cNT2TJg~6|nJGy^N+UJ1Y$?%<3Dvnwr&H#dL(dH& zSzZCTO~)@>O55vYm!$PqiOd3Iow6qg+_)x=>FE{VqKdUNq2#fkDyrD zbQ~o#M4V%*^dv+Vqh1Uc)tah|?@SB|=UqzF$*z<|Sr>$oSv$Z^XuEax)&r+#U%7k?oOJ@a~s3HY#X-ap=+H9oYp-g8H@Hy^vw=14BOE+75a&K{1U z9Y{>;SaOQUcP9(2ItnaAvV!F7&mYipFSwhN_FR@gc^VN?&gUI7_=(Q!<F>MfU1%#uJLL6TTD>niK{~uI{&^qB=wUHr zR0(Q*L-)8I^hQ!4e()3Oy;EElcOIDVBBo2k$Y+Xx?H*?vvDU$AUk|&Mo~XN=T)0t6 z=v=xw_pr^m#OOwt8DWff(R=X{?xTKFAqkG~n*a3TEhN~h1sekfvnYA@BJ=R#gEGOd z@weWTd!us+QH03`s*f#>;hBRP5!PW`Ga6MxGCt$!6T^C~$rr74+-&_tTIVsW3SFl5 z3by@GeX{CZh0{rKnE55;ULBdyHOYs_`iOg&267&1+jmU9rdmhP->vM;Y{HNY;vn+7S-2K_pg#;RJD=(i#zLst+ z0@A8%PN^h+ugB`44Db%+Hofjnma*_c1rO|D%bRD}KBl4p2AbCD1%|35H_|ydQFR0(?)~cG>2ia~h3i~i zF@XVqJA+lq#(FDIPe<;RJDiDZiAgxdtHZLmLTI>XH1rO20Cj2QktpNzZeF&ke42sn zdXM>~!6H6wOWAsQtWCe?s!y#RzB7Y+v(UuaPOw&UjpA7JZ~iwEV=xL}J+w zB1Avcr8n8vEqI9=HY8E8O&?1tx+SkEbSSUVW{^)zP#mjqJ@A&N1tOALa8+*oFq?Iq z)xp+dn)cj8#z(-{KrTI zy$PtiV6e)dVcXX_i@IoKlH(Izr^gx>9$l`xQMY!1Rd}|`wD1(nz4#Qw+wo@`D>-%` zk&e5NL3`w8_>A;fGojQsCrpAW5jbkWX6{mg&pwOQr>Yk;;RA70J;-=14_dmVV^y}>BW*z;fW+VHR1ihOZv4w-BeCMS&!h2|t*Z>Yg>s)kkwJ}jmR8~_3_-FWrUgw-7J8!B zx%^n=J5>*kMT#n;tcfFQ5pH^p=Z&mqREhUEv{>=+w9i}@GVHcALoh9CjlPYEH1?yp z-rrH0o1#rE-$#8(-71mU21o%4{SB$-meEsOjJS{ z%d?mHJIQ|9W{1~Ly&G*%@d;c<#QqAlc*3{i?Zi#8~dn^~E-S7#!^ zyACZ6yHqHJC^L5TdA~|hI5(lvUiX_R?$w_5nB!zXMdZpf!I1ag2&5E1B7gHnCSTm+ zlz(;oMLTopzi0jHv51U{ICs2q7pn-!Nid%tR{&NKA6>~-SfsOO-B`w;V!O^0J~rrj z@~Y#$DRKLq)=U3(DKYM5qf!P*1lcJPP95L&TE@V3*$O8uJ?6{8)qsB`?K9LhgOq7s HzyE&#O~7Q# literal 0 HcmV?d00001 diff --git a/source/tutorial/usecase/img/ecommerce-inventory1.png b/source/tutorial/usecase/img/ecommerce-inventory1.png new file mode 100644 index 0000000000000000000000000000000000000000..3ac5a359de1bc04bd106c42b21a4a2bb9515d31a GIT binary patch literal 15587 zcmYkj2Q-`S8#msns??TRHDlJ^6hXy`QPgZ}i&?97?3S2SVpFS1sl95i8nud`YSb!; zy=QFy`2NoSec$68M@~HVecji6&Cln0qI9%WNgh6ac<0U?5;drj?wvb$Al%V1M^dxaOcX_y2)Wqa!m#&OCsI|)<3Vmf8jBx z<1U9PCpjZ{idpXedGeLtV4dbseE9c11wvINQ{H@5iYN&Z=ScNrgXci&ctX{NPRef$ z9aOQE@-m%g41gMJbGyuHyzkY?(cKbAO9 ztUpmOmNQVN@BLN8!d)ymS3kDUecmR>cqYvEaOH~UO1_9|eX||Ya2hBL~=5+S2x!e^i*D-6jsY79~YHEzv{_X;TB{O%XTUicg4#N z14{VPXl?m;&B4~cf@L4wp6 z+N$!+c-#!3%J&t>5j^kl`-j$*unF3JvLEE0Y6^5=vv1P-6z+xbFjD&NGG}SKvyQvF z|NYv8lLJx{Qk=plwrAy@_a#(hJor5IhWzf6+<#H5k5-Z;*LP1o=-{lmc8)ze3OTb=5!m%snvzTj}Rp1(ZHmaH~1qrhEGF5kX0uj$Q6>OI<4 zfv2NXT1rqegG15VZ*yNdNZot2s`Y&0kKjH>Wx{6uKm7mMD&8D9d8gpkpE31B z;e}LAi|aJC^vhJkvumd&Rd^5B;xAYmr@S=1?}fmT51MUGXU+X<{VZ(GtYSf+8m;KT{9^|-a3Bj@jT8+~r!TxxlOv%X@knpUSNCD)EDpGk>Hjhh3x z*!KF1_n2R(AU67p8!3V7!3rkn1le7+W^mA2d~dZIZI0>N3^&PtxiseeCwWya4LE5+ zgf`b~4y@Z#ZJoNk4*BcZw+1y%@yG7O6TZ;uB{PFdv2i+`_H2?zM=<+?8!xgis?}2G z=@?Mv3)X@`^44f{@VI;r?9TdI?Sad>x-F@mRWNP4-DpAqpW{mNCrKr>;@cyahVG>I zw}pm2m$ZaDVPQGjDT9lzg>~FqZnwY`CYZa-Bb>nJr@-ljFK;z9AuM(+3xNCRf7>kS ziM>g}iqZ#|NMY{)2lL;idc)JuzMIi95YCGSCA<-wsHh)#ccVgRk@^qS{#7L6cM2x< z1N!}S;Vr9<3xiCF4ZltUIOEQkDe+!K-QU%i4;|x6hFDr3Hj5THJ-5kq0p7Z<#4}2( zu^5|1I7eSQVG0r6B1zMU_sJ?1+L|D=$QS3Q4gw7{X|q;L*CD2!N~(y zy+)^-cHkREPk*oG+qfbc^;1)9tKqK`kdx^Fn1VfgrAt@~(sv(pq^qBx-npR!jr5l% zTtMDXM3ncbO;@^kShlH)`X#eyOU}OSYR;VyM4Vh@=t^<#M&4eH@l%Bw9_>M=f&^Vk z&BJ6~%@i%j%V|ZHrISM>As5qv|FzvD2M81T&*A;xp0C?7i5UenF2b(GDnc0Snm;{| ztS4pine7+EIVW9`ZI`gOY;>CqC#L3uR!u$)CO`O7t$%#Y|8oDefbft6U`ix@w<0Gw zAr~3q4LM<$e-QM|&t?Vj4@QJxA+bgC^WX)I`|kGumH`PCvJ6N6BA6YDH)z3bL-Rvky)BgFmm0zSGFvy@_hDaugo@fty+>ymT7T-XN4Xy`8Xg2!fPbu#qLC{}?_*ef_(Bv?bOg_qmN$X*>k!nrR~$4wT87sC$^se=|uGz;x+Qz%()_y=oU8QOs%L`8&1zUZq}`eP(zNv5zR%q{_aje3~Fk)5%bRJ z(|yK}k#cRDCkL@6LUS7v{G2%L#YC9R{FOb7hSvzF9X_;9Kp0c%oS8bWVwOEVYbejh zvmM0Vn!=><(2hD+iTQ()YzjIDvxCvAHY^$a*+1;;-#d6L|C7@}NP#$QQ|{GwjAGZ( zkG|b$R&H|{e;%)$#@+X60mzFN;npu&*qU`4(+*>g`eklY*Q574#TUQ(x12di>jDdd z31oWTj0wg~hK4(fxi|0kWMl@kP`z11iWn9@dphyeT?{9=SjbqOu@+8ceW5Ay8M1cx zxdS2AsuT%5gx#Sehc3-mv5WR1g>&aUIz*vj_x#-FCWkH3m*h?;qxqXD^w|6z_ID+e2qP*1SZ{w|{Is(JBu5>Ct ziivmjx+!{YTs@r6%AuFY+_LAcvau&IN-5XYR@7n>Y-FHQ=oh1J`eZNm`(_Fdx91WS zpG}(nGo*ZVI-tt~YVzdlr(WlZJHj>TpY-bnCWKLg99$Y)C3fKbOxixKZIpBUu7BA^ zld_s%e~Ho7#TQ3R6W^aQE}X7PYx^xLE%08&4^VTWJShYD>l07|s?(maSLQGGYkU`js?ggel>_NJbgUJ%R0RRVI_ zPdg$Ww$e#`QGjnG#n)xkg_v7TzU8qupk)2e=#Z8$>&tY6_Tb*Bsl*E`UB~-;hVM}; zp0e5lCJ%>Zx4>G{z9h8$Q`WpE%fLPID1;`nW6Y19`S{i%zTaPQKK+d|JSLzUc(_F1 zYwb6fHkm?GlL~WCfA*@-+8U`5>xylAFYE{>uuw z{qeXfeLlv!;z_$7n_0Ii?d6LzP$h^G1kw0oK;|k+d}r(Luboz|PTXQwDO1b!8mX%P=P^DF z8&{mnCTh!Xc^o|*{;$-Yt!&Osh}%tZC+zeTeuYa5WCYHYl6s%pe+kss^?9>J3T#67 z;|f(LP5yV(V!YkHbN%p6`&~lpD<98U7yoVfP0P05}|)L z5UnY9uL;b5zlBX++vaKE18|T`@}DLplKcIgS1Kx6o9=`mP^ba%%;tTd_7}B64z)gb zetl6=bl#W2Ys~?QnE9koeqeT^{V7J#F&lyX*wEp#S#WvqIUbBIy zK%>749*2MEgfOmCu5Ns+0<=4SR19s4i^d!Emkvz}W^&he=ES{kmPY%E`E*-@yuLJf zug+mjlzsfE`PB@;`=gH4CuDN*0b68}n$X*>>E`bRQ{ZJ3Vjnpp?V2;vip&-S<hP;G>^bN zr*{b&8L9#Ban}}$Mbj?wKZSCQ!oLBY^zQq4q7~xFE|YAVvxP`b!tqaWvdQ41J{Vmi z&&aV6o9QmepB*p{B`dUuUJ26o=f_;Y6LIMZ3<*Vv+-wIe3``rUDDUj22u`QsOe9Mr zC!nv;qL5OnnEACbw;@?!7=zB8_^w`c*7JELDE00|R{50+oPgv~^X(gKy`PMS7WEpF1WC1r*nL$zeXpZb z6ZT?&6CN4)6Brt&7Dcy_6ZV?*bHW|F2hJ*B((Xkn2r+^l*HbKHt~%%&u(&LM&+`V2 zA?1MFX(4I-Mo}?4u9xxT3hdV59SBvXPB%%)QmZ~ROT3nfuXmV`*1QfqNXizWR5nvc zX1H(xi(~5kgwTZW9(WFZEJTo#g34xG+S#wH)|VVJ>TNnPF$;=_qEA@lhGv?`wNh>wNQMatA;x^+-|Jme_DU*6G7 zH6hTj8YYB$N_y7kU!)$Lmr_Q|@&m zqHa1&B}7UDa{raxhbc#eZ!x-hfXX2cxhF*i2@YO;gi)oI`L&nl)NR3a&9Ue?;Mq4r zj`RozC&5f{R#39~B|(Q=(g6&gO0)9G}m4f8&Dp}R1uvz;UV#=^BT#&m_GfnU-ZSH@D|w;+BcvCi4UNU|*BksHnHb zw9}#A2AJX{E@G#zWy>be_u|MF0>KF#tterGQrZuMSWcXUJW189_azLUvrv{(Vdu0# z2=x_yTf%l9+7S2|h3(Ruhz(GB8ox*dF;8l=gniTZqQIl*V6)0iH@kf{PS^E_JSI{# zkW%-!Jc-%w^aYCIl|hHUjc_9B3kUOK0T03+Mb6(+s$g(T$4j7W_Q$$IwDbRS?(RfkDB5+CKz1Mlu) zJyu26od)~|>_2xwgP?z&txn~aMop{{H~FBSZ%?}lwU_Y|B}&mL-1uh8rd`UeQJ``+ zk)&8_fri7Kzoc#JFJsov)5v1(}<3*xA&%d&rky9;Uy%C9#U1 zI-v$>I4cu-Su}lzM^wOu14JuJTp|C!?O&#FG_8X9{<4uw3umhol^_RFVza8VGF6hCq+M_2g6_1qY62|JP?eeKY`=*`O^=)`Ml3g2UxEd39KHGwXrFGtswYRm8QkJs-6_Y zfM$w2&GuqM336su9{N!L6m~0ck9UW|OnUdPsSIKiOa?R|J1^g^8pI=YKMzrNhJZoK z#|WhV3qi|quIDD5v|TM0@VRHJhj(dbNH~Ip&y_iv(DBHCu{Tbi)1d)BX^Bw3fFi4l zY4!U|N&BCgdp}0WaDoJ?xIW(Oe;Q{4o%J|sjcvvw!NR<(AiYty77m*~Zsy(jB2OsC z-X6`kg1!p@)=4-}fPVUcDd3WTWKP6ia!ApmIyDR zH1i+kQ)E=Q4S|TyqB;I<`MdK3kMe8(2-o+0GU7FRs9>e5mKKj4CIBo3_n?fH;0Hc* zUbBi&%N?0-(YT5~67i_9YrupN?Y249_p{2KlgcpMR}r9K8MdMpFv$i|mI9gaix^OL z7K`uNSOv%x#t<|u{SglA;CVLik}kCT3$QA?M(A0+DHn(s`I<>l%c|9f@isG7Pyizz zoT+917N%!M_#+P^X@rA^nz13S%311`%?nHRAN>Q*M(Ee@oK4 zmkR59ut+Oe2?T*?or+=PhWXv|fclrF zr3+HZglWQwyY6=U*i)ce>U+9VhdE~gglwIL{u+*&^d#EHs%%c{F zB$T{7i!IuBbhEO{gl=ZoFGB2;)i2x4uA`(2hNYWUhcd84og!}8%IVU6UuMYfB`Bah zlp+Ey-o316satc%40xU(0ec&;H8<#^scduz`W(=!e~n7?VOm9nRnd_N5~67+GF3hR z`D(nJpF361zt<~b7?!Q)_H)suogod@5E$>g>3tMS0Vs;l-C@*Ifl0i0e~zEJIN93771-JySZ5%)F`?0n2QPYNihRoP1u^`L^=9IX z^7g#P#nYRO2MDD7r`^@6sTQ`m+!Xr(@f|+GUe8<4(o|lrdrjbLYq8YI%C2WDoMLGM zjU=?PD_@@+^7>nBq5Idex!7#R9D4qD7Bc)Wn=VN25MEZ&(3(xeP_w7r^3rl2;-yhR z5k=A{g+a(Jh~A|fD_S>U&nFa`%iXfgmJdz?hga_=a2ILI1++6Hz8AuBMBDtVeu+|6 z`IH)vec@+g_)bzlm%~Gdti(HKcynW|0AA^ra;`!=B zb&Ogl#quuIzirR_xF%w3qIyh_u=9KFS#b1iW5q>b$$wFF% zh}U7|?@)Wa72isY&wcD!IUCat$^a*}RP~SxFIuI&++Q`$GBQ}3P^;mRc#>x-hhTF> zc0p+%P}SfO6>Xc7gi77QqTd#o8kGJ7XCL>|kX?$avkl9>REhb4Xte_=`~7ULaTV#| zu-9Su-d6wXg6x0@7kifOYdlrH9+`nB@#r5+yqHvrmY2_~weQa-+HM+zfNVKpO6a~Q z!ym#lvEz8viv(wtWtID$-{Q3o@j9V${iJN16_2h~JS)EbB^HTo%i~%j@>5Lv5>vLk zzlFpq2a|Rp*taOv73yETB>%!Q;tYvMQqb#Blbhb49^2)KC#OzHxUsByi3#6_i zby4EfpA`NwyvaI^oe-VU(w*8)c?wqg>BIE1^@BopUWl8LP!jFIdIC)!KKftLe$~E3 z!(H1#^Wl4Jnip>%+Q-8SF~Tz@H4d{b#&@fu4eh_4uL*TE&-VdjX|#HzIF&o|V}}(D zzlA_1O?LOcH10Gv#A+!um|q6M?PEy5kVr#k#59(Xe5|{#AhFts6Sq6!<186|PMX!=IBS={hf1VMkuUMT>3TXR|tOR3NE5;Mz*iz#C|cQN1qh7B+(6QtO9+X=|);I>oXSh zvnI)=FS2yDDyf)qYKG-f`|~v9!-P~{+WRfQ&Lr^-BvKeO>hQ5QJB(FZ`$Hd9>LY`&*eHrx5 z!`cnmo@wTf_Z*H-Nbgy88%vO@U?pC^vG(cP@(F6^t@yO$&sc?85E%_K3r@QV+RyT? z8B&fZ;vaN&C5T6+htjW^nLfpx4MjEqKoXx^%|X_Pv&cOi!UigJSz!aSGd4g-eSszmJU4j0qF04W$gI}5F?AR?T6HnR``*q{uOqWjK=OpsKgg1i z(mf#a@Qoy@SHgec9DqdbNPGBXWE~E#HIY7=Iy~H`pGfl$7p!V{huHfHc4=(-N92i! zXXo#E!n3|nP1HUoV0Q78-MUdW7IlAe)bJvvjp6XMNksQ4EfZGVDaL?GcLT>1@Q?s; zfU|Awfm|`3<2yNb0PBo)mb5@J zlp7B_DC;D*!GU1gdcH~!%@x!kfY+P#5C)Ud$b%7dK{Xw0c9u-)_Q_^83%6M@YJEI1 zCvQC$>uGD24&$bzvWeN$B#%YlfwSVHjX6SfT$E#(?D?vx5S+F@jhkI`%f^T8fijE* zS657OL?nzFMCGJ^h2OYH6+a`_R5>X{K}w^iC!rz%nscL#k^S6D%RU&(6y6 z?_K;ItaYZ?dJLFX)hR}T?Xl~Mzz-{WnB$IyVC|t4Ed_G<Ru%};0*Q;9YlfPuds9K?%-3|{6mEFdmk1{*vODk7TxHNly z>pR5v>De}U$?(2J zA!cF;z^*9WwoE%arEn!kaU8B*3V}un;gF5Vn~)(vURyT>FOKt=V@vkGcbbT(cjgN| zi+&QNgMHv1_gQVB#8t#h>wAsxyrEjRuUh6}@)d+IMIqxalc~xJ%hqCcaT*Xx*sEim zq_oGuQt}-iMiKt=ZE3(e^8J3|9YG2}kHySBMj3naCvn$n;M@K0-ojBzY05&oPLg=! zD7CZUYF^tPq|Op34NCqt88%}do|xgXttzySipPps27PPrSW~7X>rR^AlNkP}pTchk z+t`TC^Hl6(m3zGr@nRoLa~oi9f@%AC8!ci(8z@&A_TSMOZ=E?tebVCqIkHWR8?%^F z0M;8w0aE0RZc83ZOIZSN^Grad4_zZ9>Xx*))d!tWTxnhZQN#1GV@fqaW$$ZE+PPa3_42I(_A#&V z<3?wJh$rqH@ljrk)|MWhk?>R}4r(y(3if}i5W?GqkOhITAEjgrD%@s4uIGwCV zs~r&4|FO4>KpFBW1z{gYe0zr^N7J*Es}4??_qD@y-VI0oTz+L4)FrrUnxieYT_;Q( zcPQ2N{*_%U;qfC)fw}ukL|`UfMK{s!rwr68aWW6(N*q%vmYX8e`aciy0Vr_PlBLAk z0qXw1rbm$U{@2>H>(B_^*lP~|kDl~ z1UyS~W1zmp1o7jM5OR*E79Yf;6}<76QR1?F$x@6eSoubGM#`majIUr+lXFvyXgKIU zJ0u6t-2)#X;)fpt_QSulo~xA*q4*b7V--aZ=1VG)-L$ofpZ@dLNW>s@$|_V9hqXNu zC;KA<*V0;Z|DWwZ`}pp3E|MrXm{wJj0xTC}e6nLJGrFd-U!OhvOm&~go=K4Z@aMn$ ziy_fUu^4A9&ac!sIdMlh@mslbcYjvdoidIFNnWh1288#T$-{`OT}7g4Zk%E(P3qjx zs{&zF+mjmib~oMn+c57q&Et`;SU@zMbU1w}ocKNR>sp&P^`#WC@bK9Bx%d2N&!w~u zO{0qTZGRLTCwZ|Yu&F#nuOgjKfU`s7D^?s)j5Bu@e_1!mF&6}D`v-m$dV5a*eYu-W z@i>a~Z{`mDtov+{h&kf@>_@#g1Rn`3BU!EJ{(s3SzBSvIa_Oc*5Wc{jQPPnsG^Umb zR_PRYkhl8n(VW=|vw(o!wjc`#8JF0Io_d__Qb5!-Rkv+s^{6)|fUQM<+w0d4V8KJPDdh~gj^$0y&BuHt}3vky0Jvs zQA?P%the4zcKmu&*qT@6MB84Ff57a^R9G-=bMLPp87=iW=_#fSto3x+FE)MrQ1sdv zrzz9)rUMdb;jukG>o5Ir%O#GNDD`hS%ybG_ljhq7o|J7*pa{riDlo|luB*35i3l^3&Hr&7dd-1V5H={LI?QzaWRmcgZh2e~7)qJPaSNg4dc65?e ziSP4aXpu3cafrz{atb9 z4w8$)@LEZPk464-X`!PgK3E4_r z@C=fsRE|sW;}(g8*&l(88A$Y78sDwx>My^%VdY7>^?b2m6h_Airlh%Fp3yvy(D!J12ZU>t8k_TFeJJ8eI4f~;lZDOzD)IBRhED0(e$W#GKA%; z08K2Ssx0Du?wh9qvZ?>|B!gX|75#%dB;u~LUoGf|)>!p*Nyg)I@()nxmbTwLQ!Z%M zz>`-=aGy;s@)dHT=L`PN&ske%nbuo=x%@^m7an9~YV}r&Keqe$aeJoTXRZ1duG1^w z_A<71l1zI)=bBCQx+JH4F`WLJz1jfF;mXj(*DKD_r)2c|xgm~ObRO|ki+&iZu=r?x z6(;*KEv?yf{Yi7@{Gh^|8yB96saDBH@#!B`UZ+Hr;zEY*hOKxT`N!uHZp^}Y7cR1) z4Y!n-?1vO~-t1HR1LO9Ki~io5Z}MlqEtr{i`>N)0(%Gp8G%J1?{%$IcCnA zbM(z)4tetDjrCz0DLBTTPet#Gw?zMGrGSTf$JZrw1-yFKQ)Bn`Q?}etzDvG&nwe3MTvmd4k z!&Z_#c;wvGGnC$z7g`>7K}u@&TMy%px+&F3g-Hc$g}vB4JE&P`kYGZyX$QK)s#jpz zHs`Il366x@=DCgZm9Ol0F_0|H=e};O{>f{0C8TcZUD*{n3p_bpSFecCtB>;w!^Hpo zV@PiYVCmF|xs-AvpCzdUd9Ku2zg;Vw?yf{@T!_GiG5fa@w)JyCsa!}I%}ek2lAKDgWJ*Ba>XC}U$(?|`=dvsuhelNn%nE%h;29lTfkFsNX0SN0jYW@+J8}73~4ut4*uhVF3YkYJ$e7i-$)+`psN`EbYptxT?Z0fAySB9LmZ9~lH|u~!S*Ao2oxkG z!P#9IWx9LzSJDPi2ts(hlv4ahsHcoTWTvB69CXJvXWw0NKme-feEXJ{nCIl`x3kJ> zu3R9v9H4fIgae;Gk@A-y=&rZ3eTb~ee>u3ZON5PS)yhZI&6!g(K6aXe1FjC$9V}zb zLVB*Bq`AfQ5*+TBvN2)Y{;ji=a<*o^(hZSBl564MlHlj2anvWDH{KkdmySI>=}&Z|Y9k>a`Dy708|+H=8%J&3 zdP=&`O9y7ai?7tVA$~S<35`Ja{yij|u4OrE+Zk^+2bVi3MMw@5faQEU&x?v6{L3^D zeE2uk-99NIZo*>)m*1Zg$qs-}9;#6RuEicUIyz#@;6Q`XG0)Pqexgku8ri3U=#|H8&$M zn{Za&!puuQTr(_%WkIKz*E;S*)~4x-q=iDyKUZAW>JEg$!Je>3=f$uoYwjtfr^;FDt& z!@lj^#BKg`K^E)TlPGX;766f(XG6)fh^-<+%Zr?t@K zP=eDxd;r=zk-x(BXX2~la-5(5d^u6yp|t$}3|wW*f?Li?JJ zvVX?Rv`t8A-TxyJb9F1&YcdT&)yY4EMYDMKW)1pJxg}x6yYe5YoC|!1(8Euj^h0fU z4O=ypQ-2Aac)I5{F|Reg@zp6&hT!L#sLyGwbCy_gV1`QFmQQV2K!m39&b9E*OITvd zLhF7-GBU@7FB2@FHX{;W`H9tUtx>m>V=SU_)Wm^P!?3{f#yWXJX0yjQUGJ+R9*83e zdhKb&Q||h{hOvlH)DLvW z!qXUK<T|2CS!6Zs8y>}&$RqvJ?EZsi0}Sq%*%8GPz!&l0(uFlS z)bcSoqhg$|;LRbEXQrI+mJQIT-YxEPPL&H#0s~iXY}5`&wUdiG{CD)5Ip*J$EHf1Mii3|WZTi}3eXR}p+=JF2hGN$zZxA-$>qxu(lT0p&ad)w(#m(k{ zPLItlwHYjc>GxFhq?CNygVo`*w#P8do|R@dftFrW-e_lIM*RGX(Tn@;$?+rAl;oaq z9!eDr4@R6ldptgj$e3Z(nRbkK?)KHKo+6y;H#nxTO7$kImrSTl_sTKroswZ0YU^%V zNSAO+>Y+vHsfTP##Pxw$A8GGFBaUti2yQBUJ@o=t{||f~$AnbA77UI5d>< zVO0oTIxpm>xl(0!7u}E8A()86!e-y~EON$3lnucsDY?cMi+#otBxjKzx!?U^M7;-0 z*h);7uqK}-fg4kZwshOM(@}w?3M3ZC?^vSpoW2O4*#V2n?VxW}cI2!dRtIwwcUc-3 z$j-tLPXRuZcy4=W(4#fp>{uMVi5EM3=u3zt?!~kUWiq4Ph!GrZ4;pBE%S+k2X`~9! zM7O57ul_tZQaI5e55Fr)VMu>Yy(Wk}W>y_Pfkwo%( zvLZfQ>x6KB;^i>=8=(~qRmog?rIa>Xd3xi^s`sc+fE9FBUU}wr1(!TJ?p72L2E2b8 zgUznSU|u?6VN|W=5uwVvvNfw zhF)H24u@~L;?OijeJu#|9ltagM9gd=tNp=1W?D3|#>)SfSP7017>Gx1=d!!G$4T{C z-#K`1X&ZK*n`MO|)zAvrU-Q;=*b_>)n0!h*J30TEkLjL~&kMq^Vx|K}aP08DKGo?a z`PYCh(H<)Zw_@Zw4^hhM>MvLg`sFott`rW6=5OX2ZTj=P`k9EQuioGJ?7Eh|0MvpU zoOjEH6uPmp<1fk`qS2XTJEA@1s>QOY`K65d&)Xm1d^s1a-YO(M=sM?p!PxcDMSmsz zT63Uu2de(_P(hcG0BJxA@Wq59eH229joKR)vvEzQO~2ARw^gs6W8L9P+EP9#nn=Q4 zG~*%|jSp#i(6c}SCdfd&PZJWz+5AKHapx6$!l^RRmd|#ah-4iEjR%S6YcZis3`D`_B-MHfo^luuHQrMzSZPMvc2pW zloVj`DU|#MR@p;Olee-Q$AB7$k78*HB7aOQQsKMB07Kz-_-E>fH`V%&_pPUrh>duJ zdCzG z*Hg_|wDc-1TCN9>heWLVNrGhLCg9BzDZb4Q7QP85=jzNzLU9YJ7mt768a{dTL~#K( znLAAdzN^`-@@%4hsU%8Gql2@^~a@4AXa^p*zWcI4{d_@>l&18s<6eV7r2=|L( zwU~M;iPJaYlp%$`ntPTgEFdnU7j_QkS2tW!E=>k|ccKw7`hFByF&&)pV8&PMWdu%LhiX0L52rJF=Ek;w zkKX*bYueOm9WpjV4>W<^>hc4M=lK!Muv)&Z7V3 zTNy2PJmeb+Q|P9DG((dYbM>_%Mz?naQz97v5WNKQUR_QxA5)A&$vChnUDcoGs)*r`Di27x%gWOw)vQ(^rUx^c z&wOEP9usQ&EG+%cF{pjG^heP*(5PhlD$?4v)sLJE0=q9QHxuZ5zAcJUb?0Q|^rR`c ze<{FrkSiJuqVk;uL`68gO{X+U&CRBx5tCv%lC${TftNRH2P_b3o~M}I z;6f)0GJ1m$N&cypqJ6f*`z+kF&F`-wJsFxIPvMGBR2$1!2y-(z%(LoPSqiS-IDHf5 z`R--Q`gKgD$1eajcn!j^B8eFiNfUq9;TiKUjp{(E0rkxk9gb#{uS8-f`(>`x2$G3CrcuPZm zrT<~}_tSaESaF9A%j-%@6^L~ zQ6`}GkTPojlaNB<>hDtgF#kmVM`UKDtMfUWg=pmW#Fh0Q8Q=AJ*a(Lnr~@M0B~M>v z{5UD7<{!8E)|xv(S5ue|d~D<5xE&y@&K2DkMZ_o%;--E2HxBlri1>Z!zw7YIf2Kdu zP~Yu-N*+N7x%XdTT67o@T=%+uNmY%7T0i@?5lD7O%vie+{bzxPB33=GD+tZM@~0Da z@LKlj2b&6nQ|j3o)p=NKo~zb?U9(KFk42c$VLCa-$=^j2!rXszAZE^u&Wv%;AN^<@ z&Usfe(&XZYke_{h5^T&sZsuQ;6{%F3ebyNj^}F?xZRq~}7$4bOSAv-s8h~&>3hkCm5;p+|3;0_WB9cBipq6qqLyo^tNpw6kZctXbh!S-n?}q z792G5bCW#M=Qm7%*3vsm-V(<6@hGs6AOF+|g5cfGR1!gqE#2I^^Fc!EU*j4P?&ak> NYRXzl0;!g36=M*_57O3QSx zud(Za81&+oIeoLF99F+E^sE5X^5^VD%*7{sc%InyzvZ(WtKSR3yj+7}h@sWJ|h|=IG<1-D=Qnw)eaP^q=j2eOHsYF+c!Na$Ur0QFFM_)KVEFxt=4+6i< z4hq&J`5L6`X-l%0JS)vCTR086O&dj5Yw|pQJ4Gwdd^^b^MgaX}0*s8Zx^=`$z3zTJ z4cffOe2B48Y91Z=%^?k#(g?1cTAU}MyObpU3Aj~Daio&hl*$O9A^bSj%}5IUEk@;t zyRmK<$19buC7`hWyEGV~v=2I{SJ9ERI-7&LhH%16RM#e|Kb%CS#yuM6>IpbahtvN& zb${k$Fz*rlL5)dSigb^6d1NE}(-pj;?J$4!}F(eJ6(&nXS z;azD~z3d?0{PLsVtHGwNI=YP|lFt9lY3jD_=lrlngqM7Pc)qSX5SuYFjx`<$hXSuJrVJ+uyvaBk2y=}K=4tPbD?>^YC5_(5XUv{ zDW;8GKWTRBjOK;Ki0T_Qq8u_wa&Ope*G-vx0R^GE925borK$FJ`6!!0z=cMzz?XM4 zZ5pZn8!EoW`aBjwO!h>R#<2Wb;FxSLhex}T(F8ZyD;ry~(VIf5rJ4W51p`y`pJWMdTA~C_AAyHB#85qn(Xy8;)6L;*VLaVEtRtx{47Wsr zyCZEg{^}#q7NMPb+yt1?Jom+a`9?1?|xxf!&6*YGg>?JK<2&TRKs4$vkiMk_2(ZQU0E-Nm+F7N zeB6cadA9!4@ykPaQOdxy*uOjH^1BanrBn|xY~MnxRrhiCFhC5D8ndJVvqcA!PE|Wz zTY)(80xu7D1!n5GKIlyCvOePM+kHPgmzQJDj;?Hst?%QQjaKpPN}=&=;DvK^(_VM( zRJomdGa}wg^-l+VrK~lTdf2!Bm0;O9<_uW(>|y999dy_60j4*IkLuJ0t;@m+A?Y;y zl(u%ZO^aDfL1)_+{SHs^XjMx{AhBz&j?~p1u5|^M@=Z7((}z^;!uUD-hvOjlVoJ&J z-k?WR;8DU;7FxfhFLhCjWAPn63ciL%$j~%&@XlMeLjewJ_AksWFP$QGxAWsIrHWga&&9wlVW7rrzf43X*u0lU!>gqZ#Kt)y~Z%wK2DU zB@NbUzYMr1mwV5foz_GDP$Op}x1qb~va?q#QzeeS)ek~j&_O;TU8UV-7n1$IagQ$e zMhrk*O9TARrwrpd14=uKA?cZXl%CEwWV;-Kfw^vi`dV(PO)M>2nG8l(^d4e`+MB_T zCU3W>C1~5elegF&z!Jt=WHI7d#$H0^qA<}uTakw51LG~dwI1kig}oIA=DyJ+1PPheCTP2u)+E>XoVSj=^St>s=wFs#(q`gg0kw!@P-S=i=`;+)sB% z2FA*z3Ca~$|xnkf8B&<(bZ~f+ADUNGeT$Olv{YYxN$`-V`5=XdoM`T<{%7E=bn- zp@A$~qxJ4WLwO;g%u#Pr^j*-&1B1qbncVk3jr6n7Ht3%|@;8Uxz2}SJGfq%UvC~e^ z6r}9oAd{7ACT|y@aHus{dbS1DEzuNB39%P>kJW16znc{2RXy=1n>Y}8{*}Sjyjys_ zzR?V=c2|fpF@`^;>d`IAPr#34X$*q+tS0La=U=t3$%s(e66131j0}qc%hxKsUJ?2T zIu+DtD(`g}gD864nsoMZ;%D_>lIyN>g0+0yGm@|u>foGz_-Jpmv~oP;@7k-^*ojFf zZbFGLwc1r)!y8a)HR=Ae<0`PjXof2+XCFlsV9Cp!`}DK3(O9-fWKk0Nqsj@iRjRwR zwGXX!>=>aqF*0QjT+oSq;JT8uH%&8RNjhUGBEi?I?&?1`S|qZJc_qrV-zB4^rs0Q{*1HWe5WQ#f!U3vJ z2>X3@+>q-J2=9MlkN!FMpO#YwT`TMDjZ!;Kbx)JK1J7=6{91h#%~RuZohS0V`!5Mo z8oc^2j{)=_0SM&F9r{9<(7U_uJ{8P{dOn;9E=5Zf@3x;#4160A!X)mPo6LjHYHPx) z_fCV7Ud)fxl@$LnU|Sy_i|;g1i+%6_x!t7P!1)Wkd-7Z#g+aCqE^ggr%D?5`6P+|R z^wSBQg)WaU_g~D*v|LZSUsB{oGxd7U*2-$8vD@EIS$L(Uz_BP7q^r|&Z6v8gLb|&a zMl(4(Kn6MAM2V|>JQF;4uV5u_PeV217a7WeXU~!bN|$?@$T<3spYqqxE|B!r$^lMN zJQKLxW-4{S)?VJfX8tVe7wZmfAWdx&9GEs@n6!Ik;x4C=iN^kWqO#mxz9{u`8Bza| zHu^?{Cn7DNgpnvQPOe@+(`pa8KWak_lUF8%I>gM$>(f(v!3r7`wdKW*lon9Mla<=%E|EjPVGWW*PewW z9P7{mk)24Ds0my=U6jAmCNL+DEfd#|i9RE0NPxQ8YV%Sr(URRD{_&N<_6IyJN~;`I zL^qyTNQ|h5^{~NoFrNQL9%!@$Cb1WtfF<9~oT~=XqKFUm$M|kzpB5(lS`>rP3F0~b zRIB3}i#0nT5$+3}xbQOl<0w29iM){Ebw9?&^9n-aB4~sxKeVSqn6xcgt>fiq(jnji znX;(E=T)}S#jkAg&A0hAZ}3y%?$C8MN;G8=>yX4vn;A61R6VlXL9Z9x9c?Zd8Z(Z> z^arZkSTL-2(Y(B={3GA^+GUPBe`HeVIu>5YS8es?T0Bf8EVH^}D`F9{=Z10-HP~$< zvw$x#5Ht6v->i}A20Ms6cjNr1+nhe5WA#BsBHy?HUJ*0B*PM?^2`nStxdpRQ+WRM6 zb5bcb_yQNvfPf0ZDB{3aMPrAJh(I?s4u>I-T(24p zNVWEXy99(L*E#Z)4o5<{hx+y^x4rB~)MqDk!tsJ=Z_;&bC5n-eM@Jhea5RD>${%wh z-z=&gJrK0+p@i$|6ANxq7N#7hqex6(mdzd1^9wG+08^_5;aWjldfRNbBHhrfR6%bF zn%28jwV4-Glb@k~d~8G>Bh(t78rq8FB;i>pcR>&eH!kgqOr(2v!v1KUfOGiSJ>E=5 zc^Rm39%P{=b;Wn{@$_ZByJGUl{1Qrv(7Uw~-LWb)8_rZ(e{4Fx436bVY--Wq%o=F#i)IZwFaLBEOY`lwPi%e?w5BF<5;on0qr;|K)7LH5%Mp?Fq51+)4?KAx#Nn^7Isp%iA@@pbpK5 zrgE5bE?>JRy!qGlD(#*0qwnTL;^_B|Gtf9^YQnP?&lJ;>o3W5^L%SzJ zd*@sngUz_uIbm8;UKJBoy(00wq`s;xi+`CJKl2gN=L6JL+X!i}ZI5{G7u}PmMr$=8 zodr?kpA6Kty59QB_bv{}xgN|4Zf$Dy)Qeqxxpu==^99SzmGKOi?yfn3SpR&L#InGj zZE@HA*{?Jxb04UAYo*)FlBD_a^RbYTd;Iq zbi|{@=GAPNU3Gj`&%6Z0Y4OOizXI^&0(O;?^Ox7g`zVy~$Xe~v87PH%lgBX7laP$y zo!FS}YZl;!`{=!S4pb|Mdk>va6Cd>stzp-m8AOB)51UfLTqo}M+d#SUD6+$S<7+~V zXZfB)cc9_!k2rIq_d48#Gl5cbl#FekrbU}f^fLV&W9-`Y;!>>d-AVBZ(no*j$C>z; zST6NX8-P?fO2z#=LW$>XrZV@J?gPb7R8O_4%4x!_9Djg%^qYj1io4VDm5m5`1k>*Q z{uI37nL0RReXn@n-vkXCt6pFgJwlwb4}e3T5E2Axi>68qKpU) zrV*CTmI>8{gEzHW4PZ+GIjY#?{9{b}NW7FK`d z?*#a1h$`(!{fKAPsU^)JZVK;L_juA9?J6?^uj{qgFz~9zE6;n89#(LrSLnetxkq@S ztI_;0X+ukrXr`_7%s@4>5N#t6I~E0vvjxJI2No06rKm1-8xP5*SSnK~#&0s^sj`d| zj3h$HO2v;5p{pmdYHPCTn)a!hf z^1r|8hq&>?R*T7n0aeH5^$aY(h$)HAopM#4+HgTel>bA&bQ1KN^jfN1MFuX<&dhNv z;d~dc08Um}C&#qfB~Q-!711Bxp(r*0QgZrY6b?g!ol;!CruW}v(BBBA1DBp=e($}!B|ynt zG7Y$d4Q+7UeJ~&L~E?#*XpRF5wT~?@w=Z+)FTApZ+y-V8BcH zg+Hg1%{eel(dQOLfR->=vRwf|Xu0mL7&~Imll6Lp<|bueE{35}`plSZOuHzZHkA}J z=iAImxKZ=QQZ8&$HbYAIjq3BI^y8ABJeNPlcRb(~iceQXy0_N-F?*aHb9A>AJq_iJ zOiIKVyTUjbYr#P#86p_&3!b? zx8ohe;!XyasEl6@-a-%LV13sh{xz6qt)2}A_CgT9Y}01T(v;nbzf5*GcHR|&ac0Sf zx}%X(HaD$>vk;9`kX*E!{k>f!&3g9m$g&N1oaFQ^;@9#oX|^7a(Ml-n;J3uP3&eAl zG8`f?lNs3fAlC^mE&ePkS;huObXE~r5#D3+i#88px!-Av7NdwG-MKFdwnXv1KP2(CoH!zsGFbN#hZDrWKI|Ok&17;6-~;X!FtT6 zTwJo=NzBOf+#zjbQnt_6>w{Oo5~D^^MD{J|w=T1jJhQ!Cb{K>9kj%IJMTTxOV6~O44B{;K?*2q|gg#hoqetREk7`t&OAwY>)>aI%-w;v<))T{84<1&nqpig`2{d(f|nTF>^cPm2>$Hl=tIN)j~yO&$^dRPk0vSs8klf-|S5!&5te9^|1M>yLHm3wh3@t>&lm)~7{{9ogKH~nofQ_)BbknR4G z5pZT#_gsq3b!pxe{Zrl!__R=E)33q?BA9{#ETk6v zp!pi?D8ie)*fX zrsrSHhkKtbUeJu7@2THmO2dTO1~|KR3p>;Y|0goJh9 zyXG8qY_N&&0@h^9S#E&WB9GIpRf zYMu4*zz1oVJqqAOP{OhJ9jF_qE2nGuv4Cm*N_%agtbg2fQ$jAgpC_hhpaT2GPhJWq zzjLY!FvgCA7=X_UBZcz`j{H&}7LqyF+<+Io(6*t`1{BGLr9jPidJ@eOYopo`=)>L$ zpw`NH`9yb;gJoJZr^Xd@|NO?9!r+14>7ye6tQuTdgEm&K&I8&2B7vz7*LctL!5&EE*gL^hr#FO(J!ZF7sG0z^6;}4^tjYbu=WP`;(^u!p3tZ8HULED);11Y z=Uu%I0VI|Q319|?ud>}2p1xh3p4a+S%0sgPXkyu_lZw8e5{(sF6xV<`2waehWSYjB ztKmN$q($WRSzcWC$ZAHp5F32AaM6A%`kRF^2H`W^frsW(<^VmBl8IY-(Xg!!^pleE zt8_!GK0M3+YT2UVp^vs;ClmeT4P>dCs)E2wDXn{4WQ>!U$bH#e|rbJM+S*!;?y(2 z+mVL+yGvmk{i0H~I=AHUMd?^NdTAwxMm@NC}XgyLQTzU z54fh<#Kmh>&tC>KN-zzo>a=E3S#{YUSz_(Lsu}HDD2?mXPRA*d&HiR}!pH8;>!foL z=O*KTq$oZzfqf{v69QCNwKO>&D)_= zisDhhAaoW|7@61^SdxPRqNuQtI5BS3qM3=!PV?=A4EhZ|H4tGe)h8n=w^58rIzI)_ zl8uPy`Za7pI1AhJe5B6Ks(}JXu78^d_90m&j?+M2rHy3W(*=d4_RfQZI5!G8)|om{ zLjg!ynZ5eVJ*LV+<}4VUHE8xk)_Lp=Gd53A1o=};gR}6@G_{u?9Rq~$vQGr_ zI9s*VdWw_Ou6M^|Z5d%>@v0O9Ga~%5IZ4xj@V7wvlqpU^CP_$PuLnB5vg~my5oqDT zC1sB;XgkGXcuuY4CZdUPD^Oes4)&&Hr1%EEN zIDvE}rJ~{h zbP-OorDr5`~&nJ`vOM zt@(Y#J@D vu^}(E{~G$&xOZ`VvHpCiYQ>BAiZ*I{bI1GJeOKU*2R96L&2*ZzT;l%^P6qHT literal 0 HcmV?d00001 diff --git a/source/tutorial/usecase/img/rta-preagg1.png b/source/tutorial/usecase/img/rta-preagg1.png new file mode 100644 index 0000000000000000000000000000000000000000..8a5b2d27532124aab899746179663874d9636a3b GIT binary patch literal 3464 zcmXX}2RPLKA3x5P>{Z#>t4Juh_?az)qB1Ufoj5bxStTPxlyyd;vuF3Sj*t~+9cP?T zW^qo+IPUm={r~^x`+Udy{XE~#^PaEw>m6@?&zOVl92*D(;xIKaxDNu+Wdq-mEKI=D zbzN*2c+tV{8|#Cra2I|80@Ksm#s;91)2pDZG#${eJ~Od}gFx)Or#BraJ68x$G6$I6 zHDo4&dAZcNRi=5?fa6(A4fL!+W=MHq=2phoUdWS@*ho(LbKwONQPBn;&bJ9McBxfK z3NS1wvfC!P^SOLAxj83fk;L>M{E1bj1&=|umJ!(gjKvu?Jp(;(0=smWDf%|oxM8jzCTC^jvSBUmZEV zsFIQKR4M&3ZYYlxMZUE~cX4YvU?B93;N8)S)Qj8MzVi}ywVD@DmxFKQM(Xq3RV z!*P`i{X+LvK(N;JLv*JirVne5)564eS8&Si$DR3Z!RRA-^8S+sAR{aOhH7WY@$!-8 zXE9#ag+1gChS(+Rw|P8dPjFQ+z5m+6V>E7q1AkU!IzU~v!|g9^nC|?pzbFQ|z4yXD zoF>TA_*@=>(k&Fzhc{B$riJ6Oc2Hu;T}JCw%*u~n??&nsBl4U_GD@o+e7&=Oxx3ff zdFJRiO;}|t4G7gz4L497K#bKjWlebdbG3rH#sAnLD6%-zK#x~8r)7kjlTtbq>hX2v zBpxhI%4zGYXA1z*(lI>yx|=kbvLf}%(>gYeY3GPFeYPz>v!c6mEFraqCCTMiR0U8{ z6k2RQAOEuD!*Y7@KqL1#kd`pfG&Gm6=h~3pEzp{4FhLEUk zz%+9PI$S8WK2d5`zA^J0fk9h@Jd8@_bD`(hZLigasYcj0gcs-A)L%)pZ#&u^O=!#= zJj4z~tbD%!nJA1U=m86Ou3Qy%>8;) zAjZl}>B^jLzysPhjk{)5K2h)8tr`ol_V_ETZ!(UR(MRb*+B3ogNQh84f_znjXK^XM z_dUU6KBptijL>-Q79M^`|2Yzkn!$`rkts^-qw{|5IqM&80QkWp!fn)Hj11j`9`t31 zEgkkaNO7n1iL!LL>rpz78z6&=lqG47LKSZ;gd+}9@a!5m8cwNZZ z7zaegUB59cy(3&TruLWHkXF-mSJzd!V1+YKn5U9C2oE}8fYHU@JW;7Tkx4xg+iR!y zZXQ8=?%9H9oXA>BT^p8G9vZ4R)B^QFyRuG3cY2Ropi)NM%?do#Ke5YsTK5VT`0Uo^ zdy#amtE;i9c74I}w*bVHQ`Q~+F+CoDtQJ{CRJ|y9=$mLC_J-OnDlxTF z6|T3NlJVCfB09_RH7si54ZL5=0-Y)&BNNLw{ExL|Wwq_j_^ZhDtfr=*S3P|1!F+*H zLapW8wV1}e*Dh+7j+v{?5wfH(_szbOu4AjAT>`3CKV^Db_~AEI-lo7cCqu}il0#e% zU>x`3CDuox_W-aw3i2UL4EeOAg?_1!{azRq(sZVaETRd(a!H3e5&1>Ih8Au zV0@TSS+#=~oLo{;Xv-;dv9*S7e2jQ!c_1a;RxCNaUU8sHR$5Ap6rW7aeUV)lZ+ZBh z@v)SY6r*@f+okvE(is=}Tl(I#CeGXtqebDolQ%dw8JrP+qECSaP&?<+ASB&4xETy6 z2wdkuK7zMy*Ws|Pj1L{|F8_T}ScY~5P;?#obfDQxfOm1pG%`h2QDB%(Tsc&gP8YSj z;;f6uZ8D0{HDojU`v*yrmKUb$;E;`?C6ggt;{YmwTCOLuYtwqLd_Jkg&b`4Y@AOtk&#f^YsNj2a+X%}C5`|*f)%Mw$`Xk~>yw?8xdZ|#&2j&G|=6^E(hxfX(3 zYwz{SOx)Q9;m*(9&ZPQ0V!VVLvY-%bA!mu1MS&=eVxodbYdl>!#Km{1+W=(gYSlcv2H&* zISVm0gLaq-I_&oICSACOw+Adg4jY`c*{zpRww?CGmNXpbNUyQKU)Qz{aRSTu^$Mts zbAG?>CIMtPUv8P;c06I8qNb;p;QufuOtwCmesTNoS;S!WjS*N2X`+R*%NYBSHc=I` z=Ky|&jEpF)ZJr~yZxqlpK{_Ra-tewLVs7n;J)`z!XkzRVXEBCi@O6vHaUiVz6(lt| zan@jPa7bQJ85o4_>N+`@b!N(svi+&Q@vQXWI{nw4f3iL$D5qm!K+G+0{U|6*Q4<)A zcKoI`+u_l!X1%@$t_ealw;=7|X~DIaT=)w8^vSa- z>$d*NqIk=!bKXx5H6HHn0Xy}!PW|kX#vEpX$@NB&#Vju@<)}(C5Qng9(cFNc8z9G% zdPnVa%^Zn7*b`FogN1(_SnX?*U83r3)b>*tURF7G6nt4ePK|mxehrB4!Fa17=-+$W zjs-(pMN*xYu4Y2lwy#MGvQ6{U@;tuIOx2~!&$tx85w!e$qTU!^MT-+4_0Q7ygFAMuwanIfRMGsUb?DDm}L?xha4znB8 zJE%7tyt~3&H)ujl5NGp8;r1KFssSU%tm5WbHF;?4@!wxGGDr34v=mwOV7M1Nf{>~4 zMve&%=I%ljPb~ehk>kI#lU>D)^EVb4hJ%6W?mF(4zaCyyd#C5Xlw9-U&rq{J3HmhQ zN4ZsJG?;}QwIE{`#{5hdqYqxZG0q8;7=EJcQfB`nYjJsrk6nB#>8+_|J89kqZWF`V z|3-K`QBYhd9pf1s3{wj)PR9mUrA1P#Gx(hr;K;;+mG7;1q&UC^HYsXGIj**{lF2b0 z@xL|*%h+_gfIU?{)Tm)^he82RVCs?N4NvQI+Pt5XJH?af%sKu37y3UihI3vXu&eY% zkq#(py6f64_pSWaYR7=y+UaNzomU6Anx3@_V=#U73m)06-P?u-H@uO6O4Pv(W(9MDa`uQqcmfd*T5fTeMY1G>nj#}yjSu|T_C1^ zxC1W#u*RKIVJ>|sfF%E)7jv-`asz_iz7qox+`x=CO~}zKlW(V{FDph>=7A#|(k6{; zNul=uk|@33TCRWuTAlURQ0fx5&fPk?rbMD0#@1E=pc&~|2$ljO%fDUfi2jtD;vq8s z-w}*0R4JKi?~=^L65OPIM{C8P@?G$HcP!d#ZY3l!Ua%nBw9AjGY0>ksDTg$JnIdz5 cJ%e}xPD=20UV3S23am*WQ^R`z$99bV5A1`waR2}S literal 0 HcmV?d00001 diff --git a/source/tutorial/usecase/img/rta-preagg2.png b/source/tutorial/usecase/img/rta-preagg2.png new file mode 100644 index 0000000000000000000000000000000000000000..a8af8112f0ea4bbcb44ddd786781ed140224ea9f GIT binary patch literal 5874 zcmZ`-WmuHK+Fp7Q5mXSQ8|f11&IJ~f6lsMeWkEWmK@<=H>6WeqSwd<_L8L>v6_9SE zIm7q;J?GkMV`hK6^UU1O9rOOFrV8;bs#_2Ugczo(s11Q&rGVe!P<-(D{&l7p_`-70 zR*{FG`f1j{0RN@BiX!Ct=9Se{5C>)mom357AP}PaH*YLRN*X1YByfenl?Y~V?vp;I z2tQ{ehd>^@fho%AdQNSn`#k@wN7p5T>}~&q%M|hlkC}DyyX*HVc#k5hK%>QakOe0v z%#PD;&Phq5@lPnfa5s_dvx+J0g#49I7FGqhE*9q)B~(tFsnZOJ2NZ{xJlu%3^tk$*kdB?s~_(z&*J3Vq9pb$D(rSGsM(ST zceo2p!SWI_geasI4T@qNZq6DsLjT1K$G#lqEhdxoNFRh$T#qWnNMOqewUh#DRD?#!GLS+;2Eiy$_WuXC@-{aEh7e^ZY9i|s zG_UI-Z99%{$^doJ<_aH){L0KG_LhV?mjbBF4sD*9#O5EPO z_6u!|>0?^TH3=-0f{y|2o!bQxH4D-4tYAsw8groVq$ZDLug#@@Ao?VpMFxbQtzDIRib3Ihw$z^wfEy9du$lGkiSt8CwO zks^INo3pRozeqz1^tdD%-*#>6bT__9F65!!&3r;opqYNkgb;yH!SMv>m6nc;kqM|t zU9PgL427>~4}MYWSN8AVToNA~jR{)(r)a{Wk4GYIXI-Dm#HT2~bzc-05|CfP7vyuI zQ4+ajJB`61R3TY9SEJXCy?3-!A?(YV=#MdXn%`3}yW8{@eRXZrZg5X<2mYm|boE`x zZ)4rOQ8g~G@~ng;*Yyi)2+H`%kW$fnxbVQGyQjO@&E_$A7*8ZgTe6t*<1&XWXc{_c zyU81Ep~T;c#yK2M`*5A`Iade&pGk8*Vm`wLWu}AkDc1Evy3rWgYiqwjf63JU6`LvZ z+uQwlEUenkvpTDdNFY-+JF??+O^1sBt4#W#3!_Csvp&-Eg{Lep9xDfMimj2jad*QI|5%|KJD?Y z>IGdWo*mf8rN5&wF|}1muAY(`8Ww>m4-yg-H)n$DhUKAfm*39kW4GyrQ2iBUi(JGn zSfA8{!I46t*z<%s$rPd=b@gt!m(gB1)$@7X9usLsx<#!G3b&TMn4(NG;EacSPMjy} z6f#xLHgWp*kb!oBNOA5}+ICB6il{#pM!BJ4wdqjOgW$pFh0lFaXlc>uIY!7DM){=1 zy&BL?W8lve{k~*+Oo6se>!W5^Sm-$yvyFvqchL!QgY~id@tvc4$)LLsXVKa9#UJ|U zyMA#nElx|9*m}KY7(|L1Nq;O^=A@8+wN&Ni8(Sn7;b&m8|E6-?LDR#1>y|8{bJ?3P zs{Q`fHq~6d2|?5O!q~57VX89!@t^g1{29*s()ssYHqGlYl;5-1+^?0~%@eW~Ner;zJBA5;FKQ~|{kgbuDz!&+Vspo>47Ul5o97=2 z5B8+i>tp6?*;5)fOU}k{WlN@UyEU>~^5bcjtwc5q=`Qx-kNH^G?~?I9;Mz#oyxMTiw!V{b3n=kY}pClAN)cnH;f6FBN*oZ79%m1e5y zzOsO{WZ=VoUQ`FIOiv7YjYcLBM*lAn3xB*ybzNNx7ArqpwZYKk>k`Y{ZIBz1`6iFC z{EVq*-z>m`iqzT%7W>3V8gcZaW&)=pvze3CxHqO#D?OCEM{6wHDo)m5v3O&en~zV( z#$v{=&;Q!UWP@&*-qrp){hU_TJioZ8o8`I8ozb%^P|4c4Ba&Q(V%!n56}2!)$Y2sF zYIa6dOFgeDcVgk9PDlo_aAK*_WTZa6KIq)E@S+Ba~ zctm@W_F?P7Vou0Luc-5|{tNHX;&_Rw*X*#^Lsy(7*RbmbC$}Un9?ze3LY3UpSkb~P znzPU@Bj=&~oI$A`g@HlM^tBza&rv0bZy0zJ8q>J_Bp(h8r}TS~l;8ckFR2$VzlkF0 z?(a1R3pnNx=M{HA`y*=>y4B&`DrfUJ|=* zqs@QridsA8tOVY`jV4|E9!4tcyHLR(5XUEFn~%7kj?B9}%HF75FE^?b49asUSlTN^ zA4j-3v+~;rIA^}q-FaNlZ>p$|z{ip{y%7X)nZ#zyL?&L(I8iVqdtv~EPV7BmyD325 zfN|v$mNTg-*;m2Tnb6OkpC{va5tS@3uS)Lm+GsY~g@S3e2l@tj*=m~DuVu6mI^Cn> zJFnR#xswJ@g$hS?&GVlwEH3v=GnO2o&_!s9m~^kuU;ddT6osrMOwg1iqw*>sWn%+Z z(NmS#`X{>re&0x_g3hxTLzEn>+dr5uNl)i#xx5<9P~OVK`V`=gqsd{Iw{+h=2mtrl zBQlRoGqA=szzK32H#7qn56IR>P1+3y{KUlgm|`{MqNkn7&$Fi7IdLBO8_*8Ag5^bRT^o4eB zr!Zr^`~h=I^T+d#INKy8jU}VHPE89`(bKh<-=s@GxH*N%qvi^{MF96b<&2L?SmBXy z;Kh*-1qMsr#*sZYKcGb;``fQ~^f-~P%`0_SVN={am6NZXJ8{Y(?)28YeSK<|>UjL43U)Q7}b`DdjshOK{F_Rf?T zpM=ecWC|Z(!aEw}sA6xrt(i1W!vyAUwRK`a%E~wXt}x;xB}d>id&NeB7Yj+~BK*3e zPzbm^QfT&#B@85``oF=t;=Iaz0Ng)ez>7Wz=_MvhcA~F)n|GJ+2SEm@Av5V?E}$v0 z2$$bNEe1)~fickApCFT?nYu-~`$>en@7|zuzm26>Vdwm2HXm0|{l`w%vfc?N#T&%D zzeq4_*+hI=)#8wCP&{0R()o-1tSC zH~ghviX6974)9f6_5APSGWb3-88qeo%hBZj<#~4(yv!r)^zAsqBPQX+oDW^{@G{z! zlP%eiZC@1>Uz-l}50Ipex-2hS?)}OD`i6Rms%j}SKuW%gYSXZ|*r)`p<9uc*Ilg~T zQRxVq56tkG;ZkI<;oNcf(J102wNTcuef>H1^A8Gv2I#Fj?VV$q{+PeYjG5arZAyB# z0E|}h0?YX;5CsE9J)09I^s%v@Rf&4K{z^FU)0Ics42oR;2>$uGWjqWGoR|LClRwLr zcNcyu0s4XeJmG-7=T4Kxa`QvMq{c)4uyi(F&=CCF`u=k&tPNXOeHDMbHbNNj!I5l9 zq?BGxr*nTGDH@tClfKC3#=iAM?K2{XP2F?i-to1Na$>#(zv`v@{*i9W%Q^an6 zwBKp@=Jfq2bOhJ0U%$AM1Y&k8SkLSMiR{uZ!#dVUfhM*CjkSnCe02)DF7qK2E_={7 z_ctAQf{F!EN;C9KY}RcbH`kF3myn*V3)R)D zY^BK@`pvc|bb^POI_H?f{7yBESQ*4L9r|L*O)Vi8Y+5-$jV=~1m5~>PJc|y~C%otf zVM`)j^{}Ri_>^0q-y`WnCR@@U+e|!yXb-ScA05Oe#tV4QSf+`);kW#(XU%9=iOWb( z^D<0+G>HqpIf{0(bxqS_y*8$r`J*Wu=dWp}j9AyUF7}~yN9%pPp*_Sko*q6Z5y)in zLr9wF88K?so6;3|Q~JQ%Td{9QD#SUvBiZ(y_ZPwnE^f_OhYWqcXavMt{-Li?CZAMUI>h`?@hhB$wG-;$%r&I}BQ^R-3aRcO0+Jy=#gPAS!S0OpVvW*E{jy#vOUdjd zW+3MG*B(TNSEwomDIPt_^ZrO|4E7Z`|E4u)t)7T=kKTy1XpxUtN0e<;Rhi3gogNpaMUhRLB4MZ@Jc| z_~h0Vw;jkJkUYj6#})bg=*WL`0rM4pkK)yQ*)+cQ&t65CvQ*WM`~dGkhtu{ zX^D25R-JZB?^!zusing*~a`jak5l`yj0wOSlpYeX08rP)0 z&Y4jR=HEZyb)E4ooXdvf+efde>aotR6#w`FjBzdz zw+^F(P%wx8L7TB{Ha)(>Ex>vAPo{QX_&Mk#g?MLD5?36~_6?Wi9O^gLrgxQ=TA6rC zJ!AxmePOkuR*%sH0?}%*GgR{ycws6~%DW$x4yT=^c(IP7pOfWVtGxUR?4wN~!2)s= zDYZMe3CIaMr$`zJtN=0fHW1EL_{7x_T19TxN7&VV#V|lKA@6uvjco&gy9MSP;NMB` z%E(*V?E3vNiK30^QF=>q2ku9@lY4___~W6xE&X$W7#ykZtJ58ZqV1XM>zsg7vWIMb z^mGr?cS-^p2+=3=;QIQEW~n0iYl z6_vI1II6m1Zt^?1VFFK4b4iI*OJ+m;FU`$s9qu%EpQvM#w~#|sW@~oI50bN+BUiX=RBC*_BF_vp)C5D(PXNsAB9A7(`yT(G2r%Xn{sY!50>MHK}mVvJ?P_Ddg&gMo-DWVWD7b^ z7YzKdyuRIi9@ucI`YLWpnJ1E3%I6;|faWq0V%Ny44)?L8#z|r<*Ot-7Kr$Ckc=(h;F=sMYqJL+vMk{)t`MJI3%sTwp2|Ps^(P>p$ z`r7mOl!99Fr|w>s%sLf)QH$yzyj0ir;L}wFjfdW&DGS@|sCgSXa+T)COK#o^3OqdsRR{-hjYVb>k z$GM+L!_lDDVB@@1&EF$3DDII(6nD=&R$QQU08LeU`l}trev?Y>%<}hW_FW<;(g`eG zVgCNhN^Y58_m5gyS1czMj>}`V)qqY%TT5)YHe(X{uY3Q+ z{<(dDeDi}}Hk9~MG1ucs2=ST;)-9i72@Y55U;#vH^!tfV16hwrjI(-E!`0{q|16CS z(R}L>a$!8;)8*zf4N(o?*5DzPe4H>;!PtoXo1OCc(Cv(oXMygsnd};eSbYJSEqGN5 zZ15Of%F{pShKYUx8>~Or&B=wA9ruR%5}{S??eJtKhE@M2epx(d`H(8OkcPmNG!;?u H=0X1f9AQ|r literal 0 HcmV?d00001 diff --git a/source/tutorial/usecase/real-time-analytics-hierarchical-aggregation.txt b/source/tutorial/usecase/real-time-analytics-hierarchical-aggregation.txt index 0624b67fa59..0526418e463 100644 --- a/source/tutorial/usecase/real-time-analytics-hierarchical-aggregation.txt +++ b/source/tutorial/usecase/real-time-analytics-hierarchical-aggregation.txt @@ -25,11 +25,10 @@ hourly, daily, weekly, monthly, and yearly. We will use a hierarchical approach to running our map-reduce jobs. The input and output of each job is illustrated below: -.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=syuQgkoNVdeOo7UC4WepaPQ&rev=1&h=208&w=268&ac=1 +.. figure:: img/rta-hierarchy1.png :align: center :alt: Hierarchy - Hierarchy Note that the events rolling into the hourly collection is qualitatively different than the hourly statistics rolling into the daily collection. diff --git a/source/tutorial/usecase/real-time-analytics-preaggregated-reports.txt b/source/tutorial/usecase/real-time-analytics-preaggregated-reports.txt index 13e0c0b31cb..b1ff7006043 100644 --- a/source/tutorial/usecase/real-time-analytics-preaggregated-reports.txt +++ b/source/tutorial/usecase/real-time-analytics-preaggregated-reports.txt @@ -123,10 +123,13 @@ sequence of (key, value) pairs, *not* as a hash table. What this means for us is that writing to stats.mn.0 is *much* faster than writing to stats.mn.1439. -.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sg_d2tpKfXUsecEyvpgRg8w&rev=1&h=82&w=410&ac=1 +.. figure:: img/rta-preagg1.png :align: center :alt: + In order to update the value in cell #1349, MongoDB must skip over all 1349 + entries before it. + In order to speed this up, you can introduce some intra-document hierarchy. In particular, you can split the 'mn' field up into 24 hourly fields: @@ -165,10 +168,14 @@ This allows MongoDB to "skip forward" when updating the minute statistics later in the day, making your performance more uniform and generally faster. -.. figure:: https://docs.google.com/a/arborian.com/drawings/image?id=sGv9KIXyF_XZvpnNPVyojcg&rev=21&h=148&w=410&ac=1 +.. figure:: img/rta-preagg2.png :align: center :alt: + To update the value in cell #1349, MongoDB first skips the first 23 hours and + then skips 59 minutes for only 82 skips as opposed to 1439 skips in the + previous schema. + Design #2: Create separate documents for different granularities ---------------------------------------------------------------- From ba9b16070b079014f3549b290b4c03e2b662b669 Mon Sep 17 00:00:00 2001 From: Rick Copeland Date: Mon, 19 Mar 2012 12:09:56 -0700 Subject: [PATCH 08/20] Fix sphinx warning Signed-off-by: Rick Copeland --- source/tutorial/usecase/ecommerce-product-catalog.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/source/tutorial/usecase/ecommerce-product-catalog.txt b/source/tutorial/usecase/ecommerce-product-catalog.txt index c04de86184a..0a5f4a112f1 100644 --- a/source/tutorial/usecase/ecommerce-product-catalog.txt +++ b/source/tutorial/usecase/ecommerce-product-catalog.txt @@ -466,8 +466,8 @@ benefits from sharding due to a) the larger amount of memory available to store our indexes and b) the fact that searches will be parallelized across shards, reducing search latency. -Scaling queries with read\_preference -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Scaling Queries With ``read_preference`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Although sharding is the best way to scale reads and writes, it's not always possible to partition our data so that the queries can be routed From 095d264a8327d10bc717c747297f42b838728e8e Mon Sep 17 00:00:00 2001 From: Rick Copeland Date: Mon, 19 Mar 2012 12:10:14 -0700 Subject: [PATCH 09/20] Fix style problems in rta-preagg Signed-off-by: Rick Copeland --- ...l-time-analytics-preaggregated-reports.txt | 154 +++++++++--------- 1 file changed, 78 insertions(+), 76 deletions(-) diff --git a/source/tutorial/usecase/real-time-analytics-preaggregated-reports.txt b/source/tutorial/usecase/real-time-analytics-preaggregated-reports.txt index b1ff7006043..6f13b9a088b 100644 --- a/source/tutorial/usecase/real-time-analytics-preaggregated-reports.txt +++ b/source/tutorial/usecase/real-time-analytics-preaggregated-reports.txt @@ -8,13 +8,13 @@ Problem You have one or more servers generating events for which you want real-time statistical information in a MongoDB collection. -Solution overview +Solution Overview ================= This solution assumes the following: - There is no need to retain transactional event data in MongoDB, or - that retention is handled outside the scope of this use case + that retention is handled outside the scope of this use case. - You need statistical data to be up-to-the minute (or up-to-the-second, if possible.) - The queries to retrieve time series of statistical data need to be as @@ -24,15 +24,12 @@ The general approach is to use upserts and increments to generate the statistics and simple range-based queries and filters to draw the time series charts of the aggregated data. -To help anchor this solution, it will examine a simple scenario where you +This use case assumes a simple scenario where you want to count the number of hits to a collection of web site at various levels of time-granularity (by minute, hour, day, week, and month) as -well as by path. It is assumed that either you have some code that can -run as part of your web app when it is rendering the page, or you have -some set of logfile post-processors that can run in order to integrate -the statistics. +well as by path. -Schema design +Schema Design ============= There are two important considerations when designing the schema for a @@ -42,26 +39,26 @@ performance-killing circumstances: - documents changing in size significantly, causing reallocations on disk -- queries that require large numbers of disk seeks to be satisfied +- queries that require large numbers of disk seeks in order to be satisfied - document structures that make accessing a particular field slow One approach you *could* use to make updates easier would be to keep your hit counts in individual documents, one document per -minute/hour/day/etc. This approach, however, requires us to query -several documents for nontrivial time range queries, slowing down our -queries significantly. In order to keep your queries fast, you will -instead use somewhat more complex documents, keeping several aggregate +minute/hour/day/etc. This approach, however, requires you to query +several documents for nontrivial time range queries, slowing down your +queries significantly. In order to keep your queries fast, it's actually better +to use somewhat more complex documents, keeping several aggregate values in each document. -In order to illustrate some of the other issues you might encounter, here are -several schema designs that you might try as well as discussion of -the problems with them. +In order to illustrate some of the other issues you might encounter, the +following are several schema designs that you might try as well as discussion of +the problems each of with them. -Design 0: one document per page/day +Design 0: One Document Per Page/Day ----------------------------------- -The initial approach will be to simply put all the statistics in which -you're interested into a single document per page: +Initially, you might try putting all the statistics you need into a single +document per page: .. code-block:: javascript @@ -84,54 +81,57 @@ you're interested into a single document per page: "1439": 2819 } } -This approach has a couple of advantages: a) it only requires a single -update per hit to the website, b) intra-day reports for a single page -require fetching only a single document. There are, however, significant +This approach has a couple of advantages: + +- It only requires a single update per hit to the website. +- Intra-day reports for a single page require fetching only a single document. + +There are, however, significant problems with this approach. The biggest problem is that, as you upsert -data into the 'hy' and 'mn' properties, the document grows. Although +data into the ``hy`` and ``mn`` properties, the document grows. Although MongoDB attempts to pad the space required for documents, it will still end up needing to reallocate these documents multiple times throughout -the day, copying the documents to areas with more space. +the day, copying the documents to areas with more space and impacting performance. -Design #0.5: Preallocate documents +Design #0.5: Preallocate Documents ---------------------------------- In order to mitigate the repeated copying of documents, you can tweak your approach slightly by adding a process which will preallocate a document with initial zeros during the previous day. In order to avoid a -situation where you preallocate documents *en masse* at midnight, you will -(with a low probability) randomly upsert the next day's document each +situation where you preallocate documents *en masse* at midnight, it's best to +randomly (with a low probability) upsert the next day's document each time you update the current day's statistics. This requires some tuning; you'd like to have almost all the documents preallocated by the end of the day, without spending much time on extraneous upserts (preallocating a document that's already there). A reasonable first guess would be to -look at your average number of hits per day (call it *hits* ) and -preallocate with a probability of *1/hits* . +look at the average number of hits per day (call it *hits*) and +preallocate with a probability of *1/hits*. -Preallocating helps us mainly by ensuring that all the various 'buckets' +Preallocating helps performance mainly by ensuring that all the various 'buckets' are initialized with 0 hits. Once the document is initialized, then, it will never dynamically grow, meaning a) there is no need to perform the reallocations that could slow us down in design #0 and b) MongoDB doesn't need to pad the records, leading to a more compact representation and better usage of your memory. -Design #1: Add intra-document hierarchy +Design #1: Add Intra-Document Hierarchy --------------------------------------- One thing to be aware of with BSON is that documents are stored as a sequence of (key, value) pairs, *not* as a hash table. What this means -for us is that writing to stats.mn.0 is *much* faster than writing to -stats.mn.1439. +for us is that writing to ``stats.mn.0`` is *much* faster than writing to +``stats.mn.1439``. .. figure:: img/rta-preagg1.png :align: center - :alt: + :alt: BSON memory layout (unoptimized) - In order to update the value in cell #1349, MongoDB must skip over all 1349 + In order to update the value in minute #1349, MongoDB must skip over all 1349 entries before it. In order to speed this up, you can introduce some intra-document -hierarchy. In particular, you can split the 'mn' field up into 24 hourly +hierarchy. In particular, you can split the ``mn`` field up into 24 hourly fields: .. code-block:: javascript @@ -170,9 +170,9 @@ generally faster. .. figure:: img/rta-preagg2.png :align: center - :alt: + :alt: BSON memory layout (optimized) - To update the value in cell #1349, MongoDB first skips the first 23 hours and + To update the value in minute #1349, MongoDB first skips the first 23 hours and then skips 59 minutes for only 82 skips as opposed to 1439 skips in the previous schema. @@ -186,7 +186,7 @@ documents containing or daily statistics. A better approach would be to store daily statistics in a separate document, aggregated to the month. This does introduce a second upsert to the statistics generation side of your system, but the reduction in disk seeks on the query side should -more than make up for it. At this point, your document structure is as +more than make up for it. At this point, the document structure is as follows: Daily Statistics @@ -237,20 +237,23 @@ Monthly Statistics ... } } +This is actually the schema design that this use case uses, since it allows for a +good balance between update efficiency and query performance. + Operations ========== -In this system, you want balance between read performance and write +In this system, you'd like to balance between read performance and write (upsert) performance. This section will describe each of the major -operations you perform, using the Python programming language and the +operations you might perform, using the Python programming language and the pymongo MongoDB driver. These operations would be similar in other languages as well. -Log a hit to a page +Log a Hit to a Page ------------------- -Logging a hit to a page in your website is the main 'write' activity in -your system. In order to maximize performance, you will be doing in-place +Logging a hit to a page in your website is the main "write" activity in +your system. In order to maximize performance, you'll be doing in-place updates with the upsert operation: .. code-block:: python @@ -294,8 +297,8 @@ updates with the upsert operation: Since you're using the upsert operation, this function will perform correctly whether the document is already present or not, which is important, as your preallocation (the next operation) will only -preallocate documents with a high probability. Note however, that -without preallocation, you end up with a dynamically growing document, +preallocate documents with a high probability, not 100% certainty. Note however, +that without preallocation, you end up with a dynamically growing document, slowing down your upserts significantly as documents are moved in order to grow them. @@ -304,8 +307,8 @@ Preallocate In order to keep your documents from growing, you can preallocate them before they are needed. When preallocating, you set all the statistics to -zero for all time periods so that later, the document doesn't need to -grow to accomodate the upserts. Here, you add this preallocation as its +zero for all time periods so that later, so that the document doesn't need to +grow to accomodate the upserts. Here, is this preallocation as its own function: .. code-block:: python @@ -353,19 +356,18 @@ own function: { '$set': { 'm': monthly_metadata }}, upsert=True) -In this case, note that you went ahead and preallocated the monthly -document while you were preallocating the daily document. While you could -have split this into its own function and preallocated monthly documents +In this case, note that the function went ahead and preallocated the monthly +document while preallocating the daily document. While you could +split this into its own function and preallocated monthly documents less frequently that daily documents, the performance difference is -negligible, so you opted to simply combine monthly preallocation with -daily preallocation. +negligible, so the above code is reasonable solution. -The next question you must answer is when you should preallocate. You would +The next question you must answer is *when* you should preallocate. You'd like to have a high likelihood of the document being preallocated before it is needed, but you don't want to preallocate all at once (say at midnight) to ensure you don't create a spike in activity and a -corresponding increase in latency. Your solution here is to -probabilistically preallocate each time you log a hit, with a probability +corresponding increase in latency. The solution here is to +probabilistically preallocate each time a hit is logged, with a probability tuned to make preallocation likely without performing too many unnecessary calls to preallocate: @@ -383,17 +385,17 @@ unnecessary calls to preallocate: if random.random() < prob_preallocate: preallocate(db, dt_utc + timedelta(days=1), site_page) # Update daily stats doc - … + ... -Now with a high probability, you will preallocate each document before +Now with a high probability, you'll preallocate each document before it's used, preventing the midnight spike as well as eliminating the movement of dynamically growing documents. -Get data for a real-time chart +Get Data for a Real-Time Chart ------------------------------ -One chart that you may be interested in seeing would be the number of -hits to a particular page over the last hour. In that case, your query is +One chart that you might be interested in seeing would be the number of +hits to a particular page over the last hour. In that case, the query is fairly straightforward: .. code-block:: python @@ -424,12 +426,12 @@ following query: ... { 'metadata.date': 1, 'hourly': 1 } }, ... sort=[('metadata.date', 1)]) -In this case, you are retrieving the date along with the statistics since +In this case, you're retrieving the date along with the statistics since it's possible (though highly unlikely) that you could have a gap of one day where a) you didn't happen to preallocate that day and b) there were no hits to the document on that day. -Index support +Index Support ~~~~~~~~~~~~~ These operations would benefit significantly from indexes on the @@ -442,12 +444,12 @@ metadata of the daily statistics: ... ('metadata.page', 1), ... ('metadata.date', 1)]) -Note in particular that you indexed on the page first, date second. This -allows us to perform the third query above (a single page over a range +Note in particular that the index is first on site, then page, then date. This +allows you to perform the third query above (a single page over a range of days) quite efficiently. Having any compound index on page and date, -of course, allows us to look up a single day's statistics efficiently. +of course, allows you to look up a single day's statistics efficiently. -Get data for a historical chart +Get Data for a Historical Chart ------------------------------- In order to retrieve daily data for a single month, you can perform the @@ -488,18 +490,18 @@ the metadata of the monthly statistics: ... ('metadata.page', 1), ... ('metadata.date', 1)]) -The order of your index is once again designed to efficiently support +The order of the index is once again designed to efficiently support range queries for a single page over several months, as above. Sharding ======== -Your performance in this system will be limited by the number of shards -in your cluster as well as the choice of your shard key. Your ideal shard -key will balance upserts beteen your shards evenly while routing any +The performance of this system will be limited by the number of shards +in your cluster as well as the choice of your shard key. The ideal shard +key balances upserts beteen the shards evenly while routing any individual query to a single shard (or a small number of shards). A -reasonable shard key for us would thus be ('metadata.site', -'metadata.page'), the site-page combination for which you are calculating +reasonable shard key for this solution would thus be (``metadata.site``, +``metadata.page``), the site-page combination for which you're calculating statistics: .. code-block:: python @@ -511,13 +513,13 @@ statistics: ... key : { 'metadata.site': 1, 'metadata.page' : 1 } }) { "collectionsharded" : "stats.monthly", "ok" : 1 } -One downside to using ('metadata.site', 'metadata.page') as your shard +One downside to using (``metadata.site``,``metadata.page``) as your shard key is that, if one page dominates all your traffic, all updates to that page will go to a single shard. The problem, however, is largely unavoidable, since all update for a single page are going to a single -*document.* +*document*. -You also have the problem using only ('metadata.site', 'metadata.page') +You also have the problem using only (``metadata.site``, ``metadata.page``) shard key that, if a high percentage of your queries go to the same page, these will all be handled by the same shard. A (slightly) better shard key would the include the date as well as the site/page so that you could From 2e348219ba7fc6c7b90cf56d6dea41720b9ca4a4 Mon Sep 17 00:00:00 2001 From: Rick Copeland Date: Mon, 19 Mar 2012 12:19:48 -0700 Subject: [PATCH 10/20] Fix headings and code blocks for rta-hierarchical Signed-off-by: Rick Copeland --- ...ime-analytics-hierarchical-aggregation.txt | 174 +++++++++--------- 1 file changed, 90 insertions(+), 84 deletions(-) diff --git a/source/tutorial/usecase/real-time-analytics-hierarchical-aggregation.txt b/source/tutorial/usecase/real-time-analytics-hierarchical-aggregation.txt index 0526418e463..e4069ff1e81 100644 --- a/source/tutorial/usecase/real-time-analytics-hierarchical-aggregation.txt +++ b/source/tutorial/usecase/real-time-analytics-hierarchical-aggregation.txt @@ -1,14 +1,15 @@ +============================================= Real Time Analytics: Hierarchical Aggregation ============================================= Problem -------- +======= You have a large amount of event data that you want to analyze at multiple levels of aggregation. -Solution overview ------------------ +Solution Overview +================= For this solution we will assume that the incoming event data is already stored in an incoming 'events' collection. For details on how we could @@ -32,56 +33,55 @@ job is illustrated below: Note that the events rolling into the hourly collection is qualitatively different than the hourly statistics rolling into the daily collection. -Aside: Map-Reduce Algorithm -~~~~~~~~~~~~~~~~~~~~~~~~~~~ +.. note:: -Map/reduce is a popular aggregation algorithm that is optimized for -embarrassingly parallel problems. The psuedocode (in Python) of the -map/reduce algorithm appears below. Note that we are providing the -psuedocode for a particular type of map/reduce where the results of the -map/reduce operation are *reduced* into the result collection, allowing -us to perform incremental aggregation which we'll need in this case. + **Map/reduce** is a popular aggregation algorithm that is optimized for + embarrassingly parallel problems. The psuedocode (in Python) of the + map/reduce algorithm appears below. Note that we are providing the + psuedocode for a particular type of map/reduce where the results of the + map/reduce operation are *reduced* into the result collection, allowing + us to perform incremental aggregation which we'll need in this case. -:: + .. code-block:: python - def map_reduce(icollection, query, - mapf, reducef, finalizef, ocollection): - '''Psuedocode for map/reduce with output type="reduce" in MongoDB''' - map_results = defaultdict(list) - def emit(key, value): - '''helper function used inside mapf''' - map_results[key].append(value) + def map_reduce(icollection, query, + mapf, reducef, finalizef, ocollection): + '''Psuedocode for map/reduce with output type="reduce" in MongoDB''' + map_results = defaultdict(list) + def emit(key, value): + '''helper function used inside mapf''' + map_results[key].append(value) - # The map phase - for doc in icollection.find(query): - mapf(doc) + # The map phase + for doc in icollection.find(query): + mapf(doc) - # Pull in documents from the output collection for - # output type='reduce' - for doc in ocollection.find({'_id': {'$in': map_results.keys() } }): - map_results[doc['_id']].append(doc['value']) + # Pull in documents from the output collection for + # output type='reduce' + for doc in ocollection.find({'_id': {'$in': map_results.keys() } }): + map_results[doc['_id']].append(doc['value']) - # The reduce phase - for key, values in map_results.items(): - reduce_results[key] = reducef(key, values) + # The reduce phase + for key, values in map_results.items(): + reduce_results[key] = reducef(key, values) - # Finalize and save the results back - for key, value in reduce_results.items(): - final_value = finalizef(key, value) - ocollection.save({'_id': key, 'value': final_value}) + # Finalize and save the results back + for key, value in reduce_results.items(): + final_value = finalizef(key, value) + ocollection.save({'_id': key, 'value': final_value}) -The embarrassingly parallel part of the map/reduce algorithm lies in the -fact that each invocation of mapf, reducef, and finalizef are -independent of each other and can, in fact, be distributed to different -servers. In the case of MongoDB, this parallelism can be achieved by -using sharding on the collection on which we are performing map/reduce. + The embarrassingly parallel part of the map/reduce algorithm lies in the + fact that each invocation of mapf, reducef, and finalizef are + independent of each other and can, in fact, be distributed to different + servers. In the case of MongoDB, this parallelism can be achieved by + using sharding on the collection on which we are performing map/reduce. -Schema design -------------- +Schema Design +============= When designing the schema for event storage, we need to keep in mind the necessity to differentiate between events which have been included in @@ -92,9 +92,9 @@ our event logging process as it has to fetch event keys one-by one. If we are able to batch up our inserts into the event table, we can still use an auto-increment primary key by using the find\_and\_modify -command to generate our \_id values: +command to generate our ``_id`` values: -:: +.. code-block:: python >>> obj = db.my_sequence.find_and_modify( ... query={'_id':0}, @@ -110,7 +110,7 @@ we'll assume that we are calculating average session length for logged-in users on a website. Our event format will thus be the following: -:: +.. code-block:: javascript { "userid": "rick", @@ -124,7 +124,7 @@ the number of sessions to enable us to incrementally recompute the average session times. Each of our aggregate documents, then, looks like the following: -:: +.. code-block:: javascript { _id: { u: "rick", d: ISODate("2010-10-10T14:00:00Z") }, @@ -140,7 +140,7 @@ document. This will help us as we incrementally update the various levels of the hierarchy. Operations ----------- +========== In the discussion below, we will assume that all the events have been inserted and appropriately timestamped, so our main operations are @@ -150,8 +150,8 @@ case, we will assume that the last time the particular aggregation was run is stored in a last\_run variable. (This variable might be loaded from MongoDB or another persistence mechanism.) -Aggregate from events to the hourly level -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Aggregate From Events to the Hourly Level +----------------------------------------- Here, we want to load all the events since our last run until one minute ago (to allow for some lag in logging events). The first thing we need @@ -160,7 +160,7 @@ and PyMongo to interface with the MongoDB server, note that the various functions (map, reduce, and finalize) that we pass to the mapreduce command must be Javascript functions. The map function appears below: -:: +.. code-block:: python mapf_hour = bson.Code('''function() { var key = { @@ -188,7 +188,7 @@ was performed. Our reduce function is also fairly straightforward: -:: +.. code-block:: python reducef = bson.Code('''function(key, values) { var r = { total: 0, count: 0, mean: 0, ts: null }; @@ -207,7 +207,7 @@ finalize results can lead to difficult-to-debug errors. Also note that we are ignoring the 'mean' and 'ts' values. These will be provided in the 'finalize' step: -:: +.. code-block:: python finalizef = bson.Code('''function(key, value) { if(value.count > 0) { @@ -221,7 +221,7 @@ Here, we compute the mean value as well as the timestamp we will use to write back to the output collection. Now, to bind it all together, here is our Python code to invoke the mapreduce command: -:: +.. code-block:: python cutoff = datetime.utcnow() - timedelta(seconds=60) query = { 'ts': { '$gt': last_run, '$lt': cutoff } } @@ -241,14 +241,14 @@ Because we used the 'reduce' option on our output, we are able to run this aggregation as often as we like as long as we update the last\_run variable. -Index support -^^^^^^^^^^^^^ +Index Support +~~~~~~~~~~~~~ Since we are going to be running the initial query on the input events frequently, we would benefit significantly from and index on the timestamp of incoming events: -:: +.. code-block:: python >>> db.stats.hourly.ensure_index('ts') @@ -257,14 +257,14 @@ index has the advantage of being right-aligned, which basically means we only need a thin slice of the index (the most recent values) in RAM to achieve good performance. -Aggregate from hour to day -~~~~~~~~~~~~~~~~~~~~~~~~~~ +Aggregate from Hour to Day +-------------------------- In calculating the daily statistics, we will use the hourly statistics as input. Our map function looks quite similar to our hourly map function: -:: +.. code-block:: python mapf_day = bson.Code('''function() { var key = { @@ -295,7 +295,7 @@ hourly aggregations, we can, in fact, use the same reduce and finalize functions. The actual Python code driving this level of aggregation is as follows: -:: +.. code-block:: python cutoff = datetime.utcnow() - timedelta(seconds=60) query = { 'value.ts': { '$gt': last_run, '$lt': cutoff } } @@ -318,27 +318,26 @@ aggregating from the stats.hourly collection into the stats.daily collection. Index support -^^^^^^^^^^^^^ +~~~~~~~~~~~~~ Since we are going to be running the initial query on the hourly statistics collection frequently, an index on 'value.ts' would be nice to have: -:: +.. code-block:: python >>> db.stats.hourly.ensure_index('value.ts') Once again, this is a right-aligned index that will use very little RAM for efficient operation. -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Other aggregations -~~~~~~~~~~~~~~~~~~ +Other Aggregations +------------------ Once we have our daily statistics, we can use them to calculate our weekly and monthly statistics. Our weekly map function is as follows: -:: +.. code-block:: python mapf_week = bson.Code('''function() { var key = { @@ -360,7 +359,7 @@ subtracting days until we get to the beginning of the week. In our weekly map function, we will choose the first day of the month as our group key: -:: +.. code-block:: python mapf_month = bson.Code('''function() { d: new Date( @@ -381,7 +380,7 @@ are identical to one another except for the date calculation. We can use Python's string interpolation to refactor our map function definitions as follows: -:: +.. code-block:: python mapf_hierarchical = '''function() { var key = { @@ -426,7 +425,7 @@ as follows: Our Python driver can also be refactored so we have much less code duplication: -:: +.. code-block:: python def aggregate(icollection, ocollection, mapf, cutoff, last_run): query = { 'value.ts': { '$gt': last_run, '$lt': cutoff } } @@ -439,7 +438,7 @@ duplication: Once this is defined, we can perform all our aggregations as follows: -:: +.. code-block:: python cutoff = datetime.utcnow() - timedelta(seconds=60) aggregate(db.events, db.stats.hourly, mapf_hour, cutoff, last_run) @@ -455,20 +454,20 @@ So long as we save/restore our 'last\_run' variable between aggregations, we can run these aggregations as often as we like since each aggregation individually is incremental. -Index support -^^^^^^^^^^^^^ +Index Support +~~~~~~~~~~~~~ Our indexes will continue to be on the value's timestamp to ensure efficient operation of the next level of the aggregation (and they continue to be right-aligned): -:: +.. code-block:: python >>> db.stats.daily.ensure_index('value.ts') >>> db.stats.monthly.ensure_index('value.ts') Sharding --------- +======== To take advantage of distinct shards when performing map/reduce, our input collections should be sharded. In order to achieve good balancing @@ -479,22 +478,29 @@ makes sense as the most significant part of the shard key. In order to prevent a single, active user from creating a large, unsplittable chunk, we will use a compound shard key with (username, -timestamp) on each of our collections: >>> db.command('shardcollection', -'events', { ... key : { 'userid': 1, 'ts' : 1} } ) { "collectionsharded" -: "events", "ok" : 1 } >>> db.command('shardcollection', 'stats.daily', -{ ... key : { '\_id': 1} } ) { "collectionsharded" : "stats.daily", "ok" -: 1 } >>> db.command('shardcollection', 'stats.weekly', { ... key : { -'\_id': 1} } ) { "collectionsharded" : "stats.weekly", "ok" : 1 } >>> -db.command('shardcollection', 'stats.monthly', { ... key : { '\_id': 1} -} ) { "collectionsharded" : "stats.monthly", "ok" : 1 } >>> -db.command('shardcollection', 'stats.yearly', { ... key : { '\_id': 1} } -) { "collectionsharded" : "stats.yearly", "ok" : 1 } +timestamp) on each of our collections: + +.. code-block:: python + + >>> db.command('shardcollection','events', { + ... key : { 'userid': 1, 'ts' : 1} } ) + { "collectionsharded": "events", "ok" : 1 } + >>> db.command('shardcollection', 'stats.daily') + { "collectionsharded" : "stats.daily", "ok": 1 } + >>> db.command('shardcollection', 'stats.weekly') + { "collectionsharded" : "stats.weekly", "ok" : 1 } + >>> db.command('shardcollection', 'stats.monthly') + { "collectionsharded" : "stats.monthly", "ok" : 1 } + >>> db.command('shardcollection', 'stats.yearly') + { "collectionsharded" : "stats.yearly", "ok" : 1 } We should also update our map/reduce driver so that it notes the output should be sharded. This is accomplished by adding 'sharded':True to the output argument: -… out={ 'reduce': ocollection.name, 'sharded': True }) … +.. code-block:: python + + ... out={ 'reduce': ocollection.name, 'sharded': True })... Note that the output collection of a mapreduce command, if sharded, must -be sharded using \_id as the shard key. +be sharded using ``_id`` as the shard key. From 6509d0975111c93e50fd2b8aaf033d5e23c91583 Mon Sep 17 00:00:00 2001 From: Rick Copeland Date: Mon, 19 Mar 2012 12:44:05 -0700 Subject: [PATCH 11/20] writing style updates for rta-hier Signed-off-by: Rick Copeland --- ...ime-analytics-hierarchical-aggregation.txt | 199 +++++++++--------- 1 file changed, 103 insertions(+), 96 deletions(-) diff --git a/source/tutorial/usecase/real-time-analytics-hierarchical-aggregation.txt b/source/tutorial/usecase/real-time-analytics-hierarchical-aggregation.txt index e4069ff1e81..cc4b13d6571 100644 --- a/source/tutorial/usecase/real-time-analytics-hierarchical-aggregation.txt +++ b/source/tutorial/usecase/real-time-analytics-hierarchical-aggregation.txt @@ -11,25 +11,27 @@ multiple levels of aggregation. Solution Overview ================= -For this solution we will assume that the incoming event data is already -stored in an incoming 'events' collection. For details on how we could -get the event data into the events collection, please see "Real Time -Analytics: Storing Log Data." - -Once the event data is in the events collection, we need to aggregate -event data to the finest time granularity we're interested in. Once that -data is aggregated, we will use it to aggregate up to the next level of -the hierarchy, and so on. To perform the aggregations, we will use -MongoDB's mapreduce command. Our schema will use several collections: +This solution assumes that the incoming event data is already +stored in an incoming ``events`` collection. For details on how you might +get the event data into the events collection, please see :doc:`Real Time +Analytics: Storing Log Data `. + +Once the event data is in the events collection, you need to aggregate +event data to the finest time granularity you're interested in. Once that +data is aggregated, you'll use it to aggregate up to the next level of +the hierarchy, and so on. To perform the aggregations, you'll use +MongoDB's ``mapreduce`` command. The schema will use several collections: the raw data (event) logs and collections for statistics aggregated -hourly, daily, weekly, monthly, and yearly. We will use a hierarchical -approach to running our map-reduce jobs. The input and output of each +hourly, daily, weekly, monthly, and yearly. This solution uses a hierarchical +approach to running your map-reduce jobs. The input and output of each job is illustrated below: .. figure:: img/rta-hierarchy1.png :align: center :alt: Hierarchy + Hierarchy of statistics collected + Note that the events rolling into the hourly collection is qualitatively different than the hourly statistics rolling into the daily collection. @@ -37,10 +39,10 @@ different than the hourly statistics rolling into the daily collection. **Map/reduce** is a popular aggregation algorithm that is optimized for embarrassingly parallel problems. The psuedocode (in Python) of the - map/reduce algorithm appears below. Note that we are providing the - psuedocode for a particular type of map/reduce where the results of the + map/reduce algorithm appears below. Note that this psuedocode is for a + particular type of map/reduce where the results of the map/reduce operation are *reduced* into the result collection, allowing - us to perform incremental aggregation which we'll need in this case. + you to perform incremental aggregation which you'll need in this case. .. code-block:: python @@ -83,16 +85,15 @@ different than the hourly statistics rolling into the daily collection. Schema Design ============= -When designing the schema for event storage, we need to keep in mind the -necessity to differentiate between events which have been included in -our aggregations and events which have not yet been included. A simple -approach in a relational database would be to use an auto-increment +When designing the schema for event storage, it's important to track whichevents +which have been included in your aggregations and events which have not yet been +included. A simple approach in a relational database would be to use an auto-increment integer primary key, but this introduces a big performance penalty to -our event logging process as it has to fetch event keys one-by one. +your event logging process as it has to fetch event keys one-by one. -If we are able to batch up our inserts into the event table, we can -still use an auto-increment primary key by using the find\_and\_modify -command to generate our ``_id`` values: +If you're able to batch up your inserts into the event table, you can +still use an auto-increment primary key by using the ``find_and_modify`` +command to generate your ``_id`` values: .. code-block:: python @@ -103,11 +104,13 @@ command to generate our ``_id`` values: ... new=True) >>> batch_of_ids = range(obj['inc']-50, obj['inc']) -In most cases, however, it is sufficient to include a timestamp with -each event that we can use as a marker of which events have been -processed and which ones remain to be processed. For this use case, -we'll assume that we are calculating average session length for -logged-in users on a website. Our event format will thus be the +In most cases, however, it's sufficient to include a timestamp with +each event that you can use as a marker of which events have been +processed and which ones remain to be processed. + +This use case assumes that you +are calculating average session length for +logged-in users on a website. Your event format will thus be the following: .. code-block:: javascript @@ -118,10 +121,10 @@ following: "length":95 } -We want to calculate total and average session times for each user at -the hour, day, week, month, and year. In each case, we will also store -the number of sessions to enable us to incrementally recompute the -average session times. Each of our aggregate documents, then, looks like +You want to calculate total and average session times for each user at +the hour, day, week, month, and year. In each case, you will also store +the number of sessions to enable MongoDB to incrementally recompute the +average session times. Each of your aggregate documents, then, looks like the following: .. code-block:: javascript @@ -135,30 +138,29 @@ the following: mean: 25.4 } } -Note in particular that we have added a timestamp to the aggregate -document. This will help us as we incrementally update the various -levels of the hierarchy. +Note in particular the timestamp field in the aggregate document. This allows you +to incrementally update the various levels of the hierarchy. Operations ========== -In the discussion below, we will assume that all the events have been -inserted and appropriately timestamped, so our main operations are +In the discussion below, it is assume that all the events have been +inserted and appropriately timestamped, so your main operations are aggregating from events into the smallest aggregate (the hourly totals) and aggregating from smaller granularity to larger granularity. In each -case, we will assume that the last time the particular aggregation was -run is stored in a last\_run variable. (This variable might be loaded -from MongoDB or another persistence mechanism.) +case, the last time the particular aggregation is run is stored in a ``last_run`` +variable. (This variable might be loaded from MongoDB or another persistence +mechanism.) Aggregate From Events to the Hourly Level ----------------------------------------- -Here, we want to load all the events since our last run until one minute -ago (to allow for some lag in logging events). The first thing we need -to do is create our map function. Even though we will be using Python -and PyMongo to interface with the MongoDB server, note that the various -functions (map, reduce, and finalize) that we pass to the mapreduce -command must be Javascript functions. The map function appears below: +Here, you want to load all the events since your last run until one minute +ago (to allow for some lag in logging events). The first thing you +need to do is create your map function. Even though this solution uses Python +and ``pymongo`` to interface with the MongoDB server, note that the various +functions (``mapf``, ``reducef``, and ``finalizef``) that we pass to the +``mapreduce`` command must be Javascript functions. The map function appears below: .. code-block:: python @@ -180,13 +182,13 @@ command must be Javascript functions. The map function appears below: ts: new Date(); }); }''') -In this case, we are emitting key, value pairs which contain the -statistics we want to aggregate as you'd expect, but we are also -emitting 'ts' value. This will be used in the cascaded aggregations +In this case, it emits key, value pairs which contain the +statistics you want to aggregate as you'd expect, but it also emits a `ts` +value. This will be used in the cascaded aggregations (hour to day, etc.) to determine when a particular hourly aggregation was performed. -Our reduce function is also fairly straightforward: +The reduce function is also fairly straightforward: .. code-block:: python @@ -200,12 +202,12 @@ Our reduce function is also fairly straightforward: }''') A few things are notable here. First of all, note that the returned -document from our reduce function has the same format as the result of -our map. This is a characteristic of our map/reduce that we would like +document from the reduce function has the same format as the result of +our map. This is a characteristic of our map/reduce that it's nice to maintain, as differences in structure between map, reduce, and finalize results can lead to difficult-to-debug errors. Also note that -we are ignoring the 'mean' and 'ts' values. These will be provided in -the 'finalize' step: +the ``mean`` and ``ts`` values are ignored in the ``reduce`` function. These will be +computed in the 'finalize' step: .. code-block:: python @@ -217,9 +219,9 @@ the 'finalize' step: return value; }''') -Here, we compute the mean value as well as the timestamp we will use to -write back to the output collection. Now, to bind it all together, here -is our Python code to invoke the mapreduce command: +The finalize function computes the mean value as well as the timestamp you'll +use to write back to the output collection. Now, to bind it all together, here +is the Python code to invoke the ``mapreduce`` command: .. code-block:: python @@ -237,31 +239,31 @@ is our Python code to invoke the mapreduce command: last_run = cutoff -Because we used the 'reduce' option on our output, we are able to run -this aggregation as often as we like as long as we update the last\_run -variable. +Through the use you the 'reduce' option on your output, you can safely run this +aggregation as often as you like so long as you update the ``last_run`` variable +each time. Index Support ~~~~~~~~~~~~~ -Since we are going to be running the initial query on the input events -frequently, we would benefit significantly from and index on the +Since you'll be running the initial query on the input events +frequently, you'd benefit significantly from an index on the timestamp of incoming events: .. code-block:: python >>> db.stats.hourly.ensure_index('ts') -Since we are always reading and writing the most recent events, this -index has the advantage of being right-aligned, which basically means we -only need a thin slice of the index (the most recent values) in RAM to +Since you're always reading and writing the most recent events, this +index has the advantage of being right-aligned, which basically means MongoDB +only needs a thin slice of the index (the most recent values) in RAM to achieve good performance. Aggregate from Hour to Day -------------------------- -In calculating the daily statistics, we will use the hourly statistics -as input. Our map function looks quite similar to our hourly map +In calculating the daily statistics, you'll use the hourly statistics +as input. The daily map function looks quite similar to the hourly map function: .. code-block:: python @@ -283,15 +285,15 @@ function: ts: null }); }''') -There are a few differences to note here. First of all, the key to which -we aggregate is the (userid, date) rather than (userid, hour) to allow -for daily aggregation. Secondly, note that the keys and values we emit -are actually the total and count values from our hourly aggregates +There are a few differences to note here. First of all, the aggregation key is +the (userid, date) rather than (userid, hour) to allow +for daily aggregation. Secondly, note that the keys and values ``emit``\ ted +are actually the total and count values from the hourly aggregates rather than properties from event documents. This will be the case in -all our higher-level hierarchical aggregations. +all the higher-level hierarchical aggregations. -Since we are using the same format for map output as we used in the -hourly aggregations, we can, in fact, use the same reduce and finalize +Since you're using the same format for map output as we used in the +hourly aggregations, you can, in fact, use the same reduce and finalize functions. The actual Python code driving this level of aggregation is as follows: @@ -311,16 +313,16 @@ as follows: last_run = cutoff -There are a couple of things to note here. First of all, our query is -not on 'ts' now, but 'value.ts', the timestamp we wrote during the -finalization of our hourly aggregates. Also note that we are, in fact, -aggregating from the stats.hourly collection into the stats.daily +There are a couple of things to note here. First of all, the query is +not on ``ts`` now, but ``value.ts``, the timestamp written during the +finalization of the hourly aggregates. Also note that you are, in fact, +aggregating from the ``stats.hourly`` collection into the ``stats.daily`` collection. Index support ~~~~~~~~~~~~~ -Since we are going to be running the initial query on the hourly +Since you're going to be running the initial query on the hourly statistics collection frequently, an index on 'value.ts' would be nice to have: @@ -334,8 +336,8 @@ for efficient operation. Other Aggregations ------------------ -Once we have our daily statistics, we can use them to calculate our -weekly and monthly statistics. Our weekly map function is as follows: +Once you have your daily statistics, you can use them to calculate your +weekly and monthly statistics. The weekly map function is as follows: .. code-block:: python @@ -354,9 +356,9 @@ weekly and monthly statistics. Our weekly map function is as follows: ts: null }); }''') -Here, in order to get our group key, we are simply taking the date and -subtracting days until we get to the beginning of the week. In our -weekly map function, we will choose the first day of the month as our +Here, in order to get the group key, you simply takes the date and subtracts days +until you get to the beginning of the week. In the +weekly map function, you'll use the first day of the month as the group key: .. code-block:: python @@ -376,7 +378,7 @@ group key: }''') One thing in particular to notice about these map functions is that they -are identical to one another except for the date calculation. We can use +are identical to one another except for the date calculation. You can use Python's string interpolation to refactor our map function definitions as follows: @@ -422,7 +424,7 @@ as follows: this._id.d.getFullYear(), 1, 1, 0, 0, 0, 0)''') -Our Python driver can also be refactored so we have much less code +The Python driver can also be refactored so there is much less code duplication: .. code-block:: python @@ -436,7 +438,7 @@ duplication: query=query, out={ 'reduce': ocollection.name }) -Once this is defined, we can perform all our aggregations as follows: +Once this is defined, you can perform all the aggregations as follows: .. code-block:: python @@ -450,14 +452,14 @@ Once this is defined, we can perform all our aggregations as follows: last_run) last_run = cutoff -So long as we save/restore our 'last\_run' variable between -aggregations, we can run these aggregations as often as we like since +So long as you save/restore the ``last_run`` variable between +aggregations, you can run these aggregations as often as we like since each aggregation individually is incremental. Index Support ~~~~~~~~~~~~~ -Our indexes will continue to be on the value's timestamp to ensure +Your indexes will continue to be on the value's timestamp to ensure efficient operation of the next level of the aggregation (and they continue to be right-aligned): @@ -469,22 +471,29 @@ continue to be right-aligned): Sharding ======== -To take advantage of distinct shards when performing map/reduce, our +To take advantage of distinct shards when performing map/reduce, your input collections should be sharded. In order to achieve good balancing -between nodes, we should make sure that the shard key we use is not +between nodes, you should make sure that the shard key is not simply the incoming timestamp, but rather something that varies significantly in the most recent documents. In this case, the username makes sense as the most significant part of the shard key. In order to prevent a single, active user from creating a large, unsplittable chunk, we will use a compound shard key with (username, -timestamp) on each of our collections: +timestamp) on the events collection. .. code-block:: python >>> db.command('shardcollection','events', { ... key : { 'userid': 1, 'ts' : 1} } ) { "collectionsharded": "events", "ok" : 1 } + +In order to take advantage of sharding on +the aggregate collections, you *must* shard on the ``_id`` field (if you decide +to shard these collections:) + +.. code-block:: python + >>> db.command('shardcollection', 'stats.daily') { "collectionsharded" : "stats.daily", "ok": 1 } >>> db.command('shardcollection', 'stats.weekly') @@ -494,7 +503,7 @@ timestamp) on each of our collections: >>> db.command('shardcollection', 'stats.yearly') { "collectionsharded" : "stats.yearly", "ok" : 1 } -We should also update our map/reduce driver so that it notes the output +You should also update your map/reduce driver so that it notes the output should be sharded. This is accomplished by adding 'sharded':True to the output argument: @@ -502,5 +511,3 @@ output argument: ... out={ 'reduce': ocollection.name, 'sharded': True })... -Note that the output collection of a mapreduce command, if sharded, must -be sharded using ``_id`` as the shard key. From 2d131d8c722891be375ff3f520f0ab62ae96369a Mon Sep 17 00:00:00 2001 From: Rick Copeland Date: Mon, 19 Mar 2012 13:31:36 -0700 Subject: [PATCH 12/20] Final first person removal from rta-hier Signed-off-by: Rick Copeland --- ...eal-time-analytics-hierarchical-aggregation.txt | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/source/tutorial/usecase/real-time-analytics-hierarchical-aggregation.txt b/source/tutorial/usecase/real-time-analytics-hierarchical-aggregation.txt index cc4b13d6571..799fe365ee6 100644 --- a/source/tutorial/usecase/real-time-analytics-hierarchical-aggregation.txt +++ b/source/tutorial/usecase/real-time-analytics-hierarchical-aggregation.txt @@ -80,7 +80,7 @@ different than the hourly statistics rolling into the daily collection. fact that each invocation of mapf, reducef, and finalizef are independent of each other and can, in fact, be distributed to different servers. In the case of MongoDB, this parallelism can be achieved by - using sharding on the collection on which we are performing map/reduce. + using sharding on the collection on you're are performing map/reduce. Schema Design ============= @@ -159,7 +159,7 @@ Here, you want to load all the events since your last run until one minute ago (to allow for some lag in logging events). The first thing you need to do is create your map function. Even though this solution uses Python and ``pymongo`` to interface with the MongoDB server, note that the various -functions (``mapf``, ``reducef``, and ``finalizef``) that we pass to the +functions (``mapf``, ``reducef``, and ``finalizef``) that is passed to the ``mapreduce`` command must be Javascript functions. The map function appears below: .. code-block:: python @@ -203,7 +203,7 @@ The reduce function is also fairly straightforward: A few things are notable here. First of all, note that the returned document from the reduce function has the same format as the result of -our map. This is a characteristic of our map/reduce that it's nice +map. This is a characteristic of map/reduce that it's nice to maintain, as differences in structure between map, reduce, and finalize results can lead to difficult-to-debug errors. Also note that the ``mean`` and ``ts`` values are ignored in the ``reduce`` function. These will be @@ -292,7 +292,7 @@ are actually the total and count values from the hourly aggregates rather than properties from event documents. This will be the case in all the higher-level hierarchical aggregations. -Since you're using the same format for map output as we used in the +Since you're using the same format for map output as was used in the hourly aggregations, you can, in fact, use the same reduce and finalize functions. The actual Python code driving this level of aggregation is as follows: @@ -379,7 +379,7 @@ group key: One thing in particular to notice about these map functions is that they are identical to one another except for the date calculation. You can use -Python's string interpolation to refactor our map function definitions +Python's string interpolation to refactor the map function definitions as follows: .. code-block:: python @@ -453,7 +453,7 @@ Once this is defined, you can perform all the aggregations as follows: last_run = cutoff So long as you save/restore the ``last_run`` variable between -aggregations, you can run these aggregations as often as we like since +aggregations, you can run these aggregations as often as you like since each aggregation individually is incremental. Index Support @@ -479,7 +479,7 @@ significantly in the most recent documents. In this case, the username makes sense as the most significant part of the shard key. In order to prevent a single, active user from creating a large, -unsplittable chunk, we will use a compound shard key with (username, +unsplittable chunk, it's best to use a compound shard key with (username, timestamp) on the events collection. .. code-block:: python From 22d76c192ec74d0d0a428e032cdbf68154ea9262 Mon Sep 17 00:00:00 2001 From: Rick Copeland Date: Mon, 19 Mar 2012 17:08:30 -0400 Subject: [PATCH 13/20] ecommerce-product-catalog style updates Signed-off-by: Rick Copeland --- .../usecase/ecommerce-product-catalog.txt | 324 +++++++++--------- 1 file changed, 164 insertions(+), 160 deletions(-) diff --git a/source/tutorial/usecase/ecommerce-product-catalog.txt b/source/tutorial/usecase/ecommerce-product-catalog.txt index 0a5f4a112f1..d225bfd2b67 100644 --- a/source/tutorial/usecase/ecommerce-product-catalog.txt +++ b/source/tutorial/usecase/ecommerce-product-catalog.txt @@ -1,105 +1,104 @@ +=========================== E-Commerce: Product Catalog =========================== Problem -------- +======= You have a product catalog that you would like to store in MongoDB with products of various types and various relevant attributes. -Solution overview ------------------ +Solution Overview +================= In the relational database world, there are several solutions of varying -performance characteristics used to solve this problem. In this section -we will examine a few options and then describe the solution that -MongoDB enables. +performance characteristics used to solve this problem. This section +examines a few options and then describes the solution enabled by MongoDB. One approach ("concrete table inheritance") to solving this problem is to create a table for each product category: -:: +.. code-block:: sql CREATE TABLE `product_audio_album` ( `sku` char(8) NOT NULL, - … + ... `artist` varchar(255) DEFAULT NULL, `genre_0` varchar(255) DEFAULT NULL, `genre_1` varchar(255) DEFAULT NULL, - …, + ..., PRIMARY KEY(`sku`)) - … + ... CREATE TABLE `product_film` ( `sku` char(8) NOT NULL, - … + ... `title` varchar(255) DEFAULT NULL, `rating` char(8) DEFAULT NULL, - …, + ..., PRIMARY KEY(`sku`)) - … + ... The main problem with this approach is a lack of flexibility. Each time -we add a new product category, we need to create a new table. +you add a new product category, you need to create a new table. Furthermore, queries must be tailored to the exact type of product expected. -Another approach ("single table inheritance") would be to use a single -table for all products and add new columns each time we needed to store +Another approach ("single table inheritance") is to use a single +table for all products and add new columns each time you need to store a new type of product: -:: +.. code-block:: sql CREATE TABLE `product` ( `sku` char(8) NOT NULL, - … + ... `artist` varchar(255) DEFAULT NULL, `genre_0` varchar(255) DEFAULT NULL, `genre_1` varchar(255) DEFAULT NULL, - … + ... `title` varchar(255) DEFAULT NULL, `rating` char(8) DEFAULT NULL, - …, + ..., PRIMARY KEY(`sku`)) -This is more flexible, allowing us to query across different types of +This is more flexible, allowing queries to span different types of product, but it's quite wasteful of space. One possible space -optimization would be to name our columns generically (str\_0, str\_1, -etc), but then we lose visibility into the meaning of the actual data in +optimization would be to name the columns generically (``str_0``, ``str_1``, +etc.,) but then you lose visibility into the meaning of the actual data in the columns. -Multiple table inheritance is yet another approach where we represent -common attributes in a generic 'product' table and the variations in -individual category product tables: +Multiple table inheritance is yet another approach where common attributes are +represented in a generic 'product' table and the variations in individual +category product tables: -:: +.. code-block:: sql CREATE TABLE `product` ( `sku` char(8) NOT NULL, `title` varchar(255) DEFAULT NULL, `description` varchar(255) DEFAULT NULL, - `price` …, + `price`, ... PRIMARY KEY(`sku`)) - CREATE TABLE `product_audio_album` ( `sku` char(8) NOT NULL, - … + ... `artist` varchar(255) DEFAULT NULL, `genre_0` varchar(255) DEFAULT NULL, `genre_1` varchar(255) DEFAULT NULL, - …, + ..., PRIMARY KEY(`sku`), FOREIGN KEY(`sku`) REFERENCES `product`(`sku`)) - … + ... CREATE TABLE `product_film` ( `sku` char(8) NOT NULL, - … + ... `title` varchar(255) DEFAULT NULL, `rating` char(8) DEFAULT NULL, - …, + ..., PRIMARY KEY(`sku`), FOREIGN KEY(`sku`) REFERENCES `product`(`sku`)) - … + ... This is more space-efficient than single-table inheritance and somewhat more flexible than concrete-table inheritance, but it does require a @@ -108,27 +107,27 @@ product. Entity-attribute-value schemas are yet another solution, basically creating a meta-model for your product data. In this approach, you -maintain a table with (entity\_id, attribute\_id, value) triples that -describe your product. For instance, suppose you are describing an audio +maintain a table with (``entity_id``, ``attribute_id``, ``value``) triples that +describe each product. For instance, suppose you are describing an audio album. In that case you might have a series of rows representing the following relationships: +-----------------+-------------+------------------+ | Entity | Attribute | Value | +=================+=============+==================+ -| sku\_00e8da9b | type | Audio Album | +| sku_00e8da9b | type | Audio Album | +-----------------+-------------+------------------+ -| sku\_00e8da9b | title | A Love Supreme | +| sku_00e8da9b | title | A Love Supreme | +-----------------+-------------+------------------+ -| sku\_00e8da9b | … | … | +| sku_00e8da9b | ... | ... | +-----------------+-------------+------------------+ -| sku\_00e8da9b | artist | John Coltrane | +| sku_00e8da9b | artist | John Coltrane | +-----------------+-------------+------------------+ -| sku\_00e8da9b | genre | Jazz | +| sku_00e8da9b | genre | Jazz | +-----------------+-------------+------------------+ -| sku\_00e8da9b | genre | General | +| sku_00e8da9b | genre | General | +-----------------+-------------+------------------+ -| … | … | … | +| ... | ... | ... | +-----------------+-------------+------------------+ This schema has the advantage of being completely flexible; any entity @@ -138,26 +137,26 @@ schema is that any nontrivial query requires large numbers of join operations, which results in a large performance penalty. One other approach that has been used in relational world is to "punt" -so to speak on the product details and serialize them all into a BLOB +so to speak on the product details and serialize them all into a ``BLOB`` column. The problem with this approach is that the details become -difficult to search and sort by. (One exception is with Oracle's XMLTYPE +difficult to search and sort by. (One exception is with Oracle's ``XMLTYPE`` columns, which actually resemble a NoSQL document database.) -Our approach in MongoDB will be to use a single collection to store all +The approach best suited to MongoDB is to use a single collection to store all the product data, similar to single-table inheritance. Due to MongoDB's -dynamic schema, however, we need not conform each document to the same -schema. This allows us to tailor each product's document to only contain +dynamic schema, however, you need not conform each document to the same +schema. This allows you to tailor each product's document to only contain attributes relevant to that product category. -Schema design -------------- +Schema Design +============= -Our schema will contain general product information that needs to be +Your schema should contain general product information that needs to be searchable across all products at the beginning of each document, with properties that vary from category to category encapsulated in a 'details' property. Thus an audio album might look like the following: -:: +.. code-block:: javascript { sku: "00e8da9b", @@ -166,7 +165,6 @@ properties that vary from category to category encapsulated in a description: "by John Coltrane", asin: "B0000A118M", - shipping: { weight: 6, dimensions: { @@ -176,7 +174,6 @@ properties that vary from category to category encapsulated in a }, }, - pricing: { list: 1200, retail: 1100, @@ -184,12 +181,11 @@ properties that vary from category to category encapsulated in a pct_savings: 8 }, - details: { title: "A Love Supreme [Original Recording Reissued]", artist: "John Coltrane", genre: [ "Jazz", "General" ], - … + ... tracks: [ "A Love Supreme Part I: Acknowledgement", "A Love Supreme Part II - Resolution", @@ -201,94 +197,102 @@ properties that vary from category to category encapsulated in a A movie title would have the same fields stored for general product information, shipping, and pricing, but have quite a different details -attribute: { sku: "00e8da9d", type: "Film", … asin: "B000P0J0AQ", - -:: +attribute: - shipping: { … }, +.. code-block:: javascript + { + sku: "00e8da9d", + type: "Film", + ..., + asin: "B000P0J0AQ", - pricing: { … }, + shipping: { ... }, + pricing: { ... }, details: { title: "The Matrix", director: [ "Andy Wachowski", "Larry Wachowski" ], writer: [ "Andy Wachowski", "Larry Wachowski" ], - … + ..., aspect_ratio: "1.66:1" }, } -Another thing to note in the MongoDB schema is that we can have +Another thing to note in the MongoDB schema is that you can have multi-valued attributes without any arbitrary restriction on the number -of attributes (as we might have if we had ``genre_0`` and ``genre_1`` +of attributes (as you might have if you had ``genre_0`` and ``genre_1`` columns in a relational database, for instance) and without the need for -a join (as we might have if we normalize the many-to-many "genre" +a join (as you might have if you normalized the many-to-many "genre" relation). Operations ----------- +========== -We will be using the product catalog mainly to perform search -operations. Thus our focus in this section will be on the various types -of queries we might want to support in an e-commerce site. These +You'll be primarily using the product catalog mainly to perform search +operations. Thus the focus in this section will be on the various types +of queries you might want to support in an e-commerce site. These examples will be written in the Python programming language using the -pymongo driver, but other language/driver combinations should be +``pymongo`` driver, but other language/driver combinations should be similar. -Find all jazz albums, sorted by year produced -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Find All Jazz Albums, Sorted by Year Produced +--------------------------------------------- -Here, we would like to see a group of products with a particular genre, +Here, you'd like to see a group of products with a particular genre, sorted by the year in which they were produced: -:: +.. code-block:: python query = db.products.find({'type':'Audio Album', 'details.genre': 'jazz'}) query = query.sort([('details.issue_date', -1)]) -Index support -^^^^^^^^^^^^^ +Index Support +~~~~~~~~~~~~~ -In order to efficiently support this type of query, we need to create a +In order to efficiently support this type of query, you need to create a compound index on all the properties used in the filter and in the sort: -:: +.. code-block:: python db.products.ensure_index([ ('type', 1), ('details.genre', 1), ('details.issue_date', -1)]) -Again, notice that the final component of our index is the sort field. +Note here that the final component of the index is the sort field. This allows +MongoDB to traverse the index in the order in which the data is to be returned, +rather than performing a slow in-memory sort of the data. -Find all products sorted by percentage discount descending -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Find All Products Sorted by Percentage Discount Descending +---------------------------------------------------------- While most searches would be for a particular type of product (audio -album or movie, for instance), there may be cases where we would like to -find all products in a certain price range, perhaps for a 'best daily -deals' of our website. In this case, we will use the pricing information +album or movie, for instance), there may be cases where you'd like to +find all products in a certain price range, perhaps for a "best daily +deals" of your website. In this case, you'll use the pricing information that exists in all products to find the products with the highest percentage discount: -:: +.. code-block:: python query = db.products.find( { 'pricing.pct_savings': {'$gt': 25 }) query = query.sort([('pricing.pct_savings', -1)]) -Index support -^^^^^^^^^^^^^ +Index Support +~~~~~~~~~~~~~ + +In order to efficiently support this type of query, you'll need an index on the +percentage savings: -In order to efficiently support this type of query, we need to have an -index on the percentage savings: +.. code-block:: python -\`db.products.ensure\_index('pricing.pct\_savings') + db.products.ensure_index('pricing.pct_savings') Since the index is only on a single key, it does not matter in which -order the index is sorted. Note that, had we wanted to perform a range +order the index is sorted. Note that, had you wanted to perform a range query (say all products over $25 retail) and sort by another property (perhaps percentage savings), MongoDB would not have been able to use an index as effectively. Range queries or sorts must always be the *last* @@ -296,45 +300,44 @@ property in a compound index in order to avoid scanning entirely. Thus using a different property for a range query and a sort requires some degree of scanning, slowing down your query. -Find all movies in which Keanu Reeves acted -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Find All Movies in Which Keanu Reeves Acted +------------------------------------------- -In this case, we want to search inside the details of a particular type +In this case, you want to search inside the details of a particular type of product (a movie) to find all movies containing Keanu Reeves, sorted by date descending: -:: +.. code-block:: python query = db.products.find({'type': 'Film', 'details.actor': 'Keanu Reeves'}) query = query.sort([('details.issue_date', -1)]) -Index support -^^^^^^^^^^^^^ +Index Support +~~~~~~~~~~~~~ -Here, we wish to once again index by type first, followed the details -we're interested in: +Here, you wish to once again index by type first, followed the details +you're interested in: -:: +.. code-block:: python db.products.ensure_index([ ('type', 1), ('details.actor', 1), ('details.issue_date', -1)]) -And once again, the final component of our index is the sort field. +And once again, the final component of the index is the sort field. -Find all movies with the word "hacker" in the title -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Find All Movies With the Word "Hacker" in the Title +--------------------------------------------------- Those experienced with relational databases may shudder at this operation, since it implies an inefficient LIKE query. In fact, without a full-text search engine, some scanning will always be required to -satisfy this query. In the case of MongoDB, we will use a regular -expression. First, we will see how we might do this using Python's re -module: +satisfy this query. In the case of MongoDB, the solution is to use a regular +expression. In Python, you can use the ``re`` module to construct the query: -:: +.. code-block:: python import re re_hacker = re.compile(r'.*hacker.*', re.IGNORECASE) @@ -343,45 +346,45 @@ module: query = db.products.find({'type': 'Film', 'title': re_hacker}) query = query.sort([('details.issue_date', -1)]) -Although this is fairly convenient, MongoDB also gives us the option to -use a special syntax in our query instead of importing the Python re -module: +Although this is fairly convenient, MongoDB also provides the option to +use a special syntax rather than importing the Python ``re`` module: -:: +.. code-block:: python query = db.products.find({ 'type': 'Film', 'title': {'$regex': '.*hacker.*', '$options':'i'}}) query = query.sort([('details.issue_date', -1)]) -Index support -^^^^^^^^^^^^^ +Index Support +~~~~~~~~~~~~~ -Here, we will diverge a bit from our typical index order: +Here, the best index diverges a bit from the previous index orders: -:: +.. code-block:: python db.products.ensure_index([ ('type', 1), ('details.issue_date', -1), ('title', 1)]) -You may be wondering why we are including the title field in the index -if we have to scan anyway. The reason is that there are two types of +You may be wondering why you should include the title field in the index +if MongoDB has to scan anyway. The reason is that there are two types of scans: index scans and document scans. Document scans require entire documents to be loaded into memory, while index scans only require index entries to be loaded. So while an index scan on title isn't as efficient as a direct lookup, it is certainly faster than a document scan. -The order in which we include our index keys is also different than what -you might expect. This is once again due to the fact that we are -scanning. Since our results need to be in sorted order by -'details.issue\_date', we should make sure that's the order in which -we're scanning titles. You can observe the difference looking at the -query plans we get for different orderings. If we use the (type, title, -details.issue\_date) index, we get the following plan: +The order in which you include the index keys is also different than what +you might expect. This is once again due to the fact that you're +scanning. Since the results need to be in sorted order by +``'details.issue_date``, you should make sure that's the order in which +MongoDB scans titles. You can observe the difference looking at the +query plans for different orderings. If you use the (``type``, ``title``, +``details.issue_date``) index, you get the following plan: -:: +.. code-block:: python + :emphasize-lines: 11,17 {u'allPlans': [...], u'cursor': u'BtreeCursor type_1_title_1_details.issue_date_-1 multi', @@ -401,10 +404,11 @@ details.issue\_date) index, we get the following plan: u'nscannedObjects': 0, u'scanAndOrder': True} -If, however, we use the (type, details.issue\_date, title) index, we get +If, however, you use the (``type``, ``details.issue_date``, ``title``) index, you get the following plan: -:: +.. code-block:: python + :emphasize-lines: 11 {u'allPlans': [...], u'cursor': u'BtreeCursor type_1_details.issue_date_-1_title_1 multi', @@ -424,83 +428,83 @@ the following plan: u'nscannedObjects': 0} The two salient features to note are a) the absence of the -'scanAndOrder: True' in the optmal query and b) the difference in time +``scanAndOrder: True`` in the optmal query and b) the difference in time (208ms for the suboptimal query versus 157ms for the optimal one). The lesson learned here is that if you absolutely have to scan, you should make the elements you're scanning the *least* significant part of the index (even after the sort). Sharding --------- +======== -Though our performance in this system is highly dependent on the indexes -we maintain, sharding can enhance that performance further by allowing -us to keep larger portions of those indexes in RAM. In order to maximize -our read scaling, we would also like to choose a shard key that allows +Though the performance in this system is highly dependent on the indexes, +sharding can enhance that performance further by allowing +MongoDB to keep larger portions of those indexes in RAM. In order to maximize +your read scaling, it's also nice to choose a shard key that allows mongos to route queries to only one or a few shards rather than all the shards globally. -Since most of the queries in our system include type, we should probably -also include that in our shard key. You may note that most of the -queries also included 'details.issue\_date', so there may be a -temptation to include it in our shard key, but this actually wouldn't -help us much since none of the queries were *selective* by date. +Since most of the queries in this system include type, it should probably be +included in the shard key. You may note that most of the +queries also included ``details.issue_date``, so there may be a +temptation to include it in the shard key, but this actually wouldn't +help much since none of the queries were *selective* by date. -Since our schema is so flexible, it's hard to say *a priori* what the +Since this schema is so flexible, it's hard to say *a priori* what the ideal shard key would be, but a reasonable guess would be to include the -'type' field, one or more detail fields that are commonly queried, and -one final random-ish field to ensure we don't get large unsplittable -chunks. For this example, we will assume that 'details.genre' is our -second-most queried field after 'type', and thus our sharding setup +``type`` field, one or more detail fields that are commonly queried, and +one final random-ish field to ensure you don't get large unsplittable +chunks. For this example, assuming that ``details.genre`` is the +second-most queried field after ``type``, the sharding setup would be as follows: -:: +.. code-block:: python >>> db.command('shardcollection', 'product', { ... key : { 'type': 1, 'details.genre' : 1, 'sku':1 } }) { "collectionsharded" : "details.genre", "ok" : 1 } -One important note here is that, even if we choose a shard key that -requires all queries to be broadcast to all shards, we still get some +One important note here is that, even if you choose a shard key that +requires all queries to be broadcast to all shards, you still get some benefits from sharding due to a) the larger amount of memory available -to store our indexes and b) the fact that searches will be parallelized +to store indexes and b) the fact that searches will be parallelized across shards, reducing search latency. -Scaling Queries With ``read_preference`` -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Scaling Queries with ``read_preference`` +---------------------------------------- Although sharding is the best way to scale reads and writes, it's not -always possible to partition our data so that the queries can be routed -by mongos to a subset of shards. In this case, mongos will broadcast the +always possible to partition your data so that the queries can be routed +by mongos to a subset of shards. In this case, ``mongos`` will broadcast the query to all shards and then accumulate the results before returning to -the client. In cases like this, we can still scale our query performance -by allowing mongos to read from the secondary servers in a replica set. -This is achieved via the 'read\_preference' argument, and can be set at +the client. In cases like this, you can still scale query performance +by allowing ``mongos`` to read from the secondary servers in a replica set. +This is achieved via the ``read_preference`` argument, and can be set at the connection or individual query level. For instance, to allow all reads on a connection to go to a secondary, the syntax is: -:: +.. code-block:: python conn = pymongo.Connection(read_preference=pymongo.SECONDARY) or -:: +.. code-block:: python conn = pymongo.Connection(read_preference=pymongo.SECONDARY_ONLY) In the first instance, reads will be distributed among all the secondaries and the primary, whereas in the second reads will only be sent to the secondary. To allow queries to go to a secondary on a -per-query basis, we can also specify a read\_preference: +per-query basis, you can also specify a ``read_preference``: -:: +.. code-block:: python results = db.product.find(..., read_preference=pymongo.SECONDARY) or -:: +.. code-block:: python results = db.product.find(..., read_preference=pymongo.SECONDARY_ONLY) From 7c7b00ed144c41972d100ab3821dbfd651613650 Mon Sep 17 00:00:00 2001 From: Rick Copeland Date: Mon, 19 Mar 2012 17:12:45 -0400 Subject: [PATCH 14/20] Fix headings and code blocks for ecommerce-inventory Signed-off-by: Rick Copeland --- .../ecommerce-inventory-management.txt | 67 ++++++++++--------- 1 file changed, 34 insertions(+), 33 deletions(-) diff --git a/source/tutorial/usecase/ecommerce-inventory-management.txt b/source/tutorial/usecase/ecommerce-inventory-management.txt index 637911a6a8c..4934580167c 100644 --- a/source/tutorial/usecase/ecommerce-inventory-management.txt +++ b/source/tutorial/usecase/ecommerce-inventory-management.txt @@ -1,15 +1,16 @@ +================================ E-Commerce: Inventory Management ================================ Problem -------- +======= You have a product catalog and you would like to maintain an accurate inventory count as users shop your online store, adding and removing things from their cart. -Solution overview ------------------ +Solution Overview +================= In an ideal world, consumers would begin browsing an online store, add items to their shopping cart, and proceed in a timely manner to checkout @@ -28,15 +29,15 @@ transition diagram for a shopping cart is below: :align: center :alt: -Schema design -------------- +Schema Design +============= In our inventory collection, we will maintain the current available inventory of each stock-keeping unit (SKU) as well as a list of 'carted' items that may be released back to available inventory if their shopping cart times out: -:: +.. code-block:: javascript { _id: '00e8da9b', @@ -59,7 +60,7 @@ for a total of 19 unsold items of merchandise. For our shopping cart model, we will maintain a list of (sku, quantity, price) line items: -:: +.. code-block:: javascript { _id: 42, @@ -77,13 +78,13 @@ without needing a second query back to the catalog collection to display the details. Operations ----------- +========== Here, we will describe the various inventory-related operations we will perform during the course of operation. -Add an item to a shopping cart -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Add an Item to a Shopping Cart +------------------------------ Our most basic operation is moving an item off the 'shelf' in to the 'cart'. Our constraint is that we would like to guarantee that we never @@ -91,7 +92,7 @@ move an unavailable item off the shelf into the cart. To solve this problem, we will ensure that inventory is only updated if there is sufficient inventory to satisfy the request: -:: +.. code-block:: python def add_item_to_cart(cart_id, sku, qty, details): now = datetime.utcnow() @@ -136,21 +137,21 @@ these two updates allows us to report back an error to the user if the cart has become inactive or available quantity is insufficient to satisfy the request. -Index support -^^^^^^^^^^^^^ +Index Support +~~~~~~~~~~~~~ To support this query efficiently, all we really need is an index on \_id, which MongoDB provides us by default. -Modifying the quantity in the cart -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Modifying the Quantity in the Cart +---------------------------------- Here, we want to allow the user to adjust the quantity of items in their cart. We must make sure that when they adjust the quantity upward, there is sufficient inventory to cover the quantity, as well as updating the particular 'carted' entry for the user's cart. -:: +.. code-block:: python def update_quantity(cart_id, sku, old_qty, new_qty): now = datetime.utcnow() @@ -192,19 +193,19 @@ we need to 'rollback' the cart in a single atomic operation. We will also ensure the cart is active and timestamp it as in the case of adding items to the cart. -Index support -^^^^^^^^^^^^^ +Index Support +~~~~~~~~~~~~~ To support this query efficiently, all we really need is an index on \_id, which MongoDB provides us by default. -Checking out -~~~~~~~~~~~~ +Checking Out +------------ During checkout, we want to validate the method of payment and remove the various 'carted' items after the transaction has succeeded. -:: +.. code-block:: python def checkout(cart_id): now = datetime.utcnow() @@ -242,20 +243,20 @@ inventory and set the cart to 'complete'. If payment is unsuccessful, we unlock the cart by setting its status back to 'active' and report a payment error. -Index support -^^^^^^^^^^^^^ +Index Support +~~~~~~~~~~~~~ To support this query efficiently, all we really need is an index on \_id, which MongoDB provides us by default. -Returning timed-out items to inventory -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Returning Timed-Out Items to Inventory +-------------------------------------- Periodically, we want to expire carts that have been inactive for a given number of seconds, returning their line items to available inventory: -:: +.. code-block:: python def expire_carts(timeout): now = datetime.utcnow() @@ -284,14 +285,14 @@ Here, we first find all carts to be expired and then, for each cart, return its items to inventory. Once all items have been returned to inventory, the cart is moved to the 'expired' state. -Index support -^^^^^^^^^^^^^ +Index Support +~~~~~~~~~~~~~ In this case, we need to be able to efficiently query carts based on their status and last\_modified values, so an index on these would help the performance of our periodic expiration process: -:: +.. code-block:: python >>> db.cart.ensure_index([('status', 1), ('last_modified', 1)]) @@ -302,7 +303,7 @@ define an index on the 'status' field alone, as any queries for status can use the compound index we have defined here. Error Handling -~~~~~~~~~~~~~~ +-------------- There is one failure mode above that we have not handled adequately: the case of an exception that occurs after updating the inventory collection @@ -312,7 +313,7 @@ items in the inventory have not been returned to available inventory. To account for this case, we will run a cleanup method periodically that will find old 'carted' items and check the status of their cart: -:: +.. code-block:: python def cleanup_inventory(timeout): now = datetime.utcnow() @@ -360,7 +361,7 @@ slowing down other updates and queries, so it should be used infrequently. Sharding --------- +======== If we choose to shard this system, the use of an \_id field for most of our updates makes \_id an ideal sharding candidate, for both carts and @@ -383,7 +384,7 @@ minimize server load. The sharding commands we would use to shard the cart and inventory collections, then, would be the following: -:: +.. code-block:: python >>> db.command('shardcollection', 'inventory') { "collectionsharded" : "inventory", "ok" : 1 } From 7aa2903a153ba0b8f5b3e18569d63a1b742050d3 Mon Sep 17 00:00:00 2001 From: Rick Copeland Date: Mon, 19 Mar 2012 17:28:59 -0400 Subject: [PATCH 15/20] Fix style of ecommerce-inventory Signed-off-by: Rick Copeland --- .../ecommerce-inventory-management.txt | 165 +++++++++--------- ...ime-analytics-hierarchical-aggregation.txt | 2 +- 2 files changed, 82 insertions(+), 85 deletions(-) diff --git a/source/tutorial/usecase/ecommerce-inventory-management.txt b/source/tutorial/usecase/ecommerce-inventory-management.txt index 4934580167c..935cb707b9c 100644 --- a/source/tutorial/usecase/ecommerce-inventory-management.txt +++ b/source/tutorial/usecase/ecommerce-inventory-management.txt @@ -19,7 +19,7 @@ charged. In the real world, however, customers often add or remove items from their shopping cart, change quantities, abandon the cart, and have problems at checkout time. -In this solution, we will keep the metaphor of the shopping cart, but +This solution keeps the traditional metaphor of the shopping cart, but the shopping cart will *age* . Once a shopping cart has not been active for a certain period of time, all the items in the cart once again become part of available inventory and the cart is cleared. The state @@ -32,7 +32,7 @@ transition diagram for a shopping cart is below: Schema Design ============= -In our inventory collection, we will maintain the current available +In your inventory collection, you need to maintain the current available inventory of each stock-keeping unit (SKU) as well as a list of 'carted' items that may be released back to available inventory if their shopping cart times out: @@ -50,15 +50,15 @@ cart times out: ] } -(Note that, while in an actual implementation, we might choose to merge -this schema with the product catalog schema described in "E-Commerce: -Product Catalog", we've simplified the inventory schema here for -brevity.) If we continue the metaphor of the brick-and-mortar store, -then our SKU has 16 items on the shelf, 1 in one cart, and 2 in another -for a total of 19 unsold items of merchandise. +(Note that, while in an actual implementation, you might choose to merge +this schema with the product catalog schema described in +:doc:`E-Commerce: Product Catalog `, the inventory +schema is simplified here for brevity.) Continuing the metaphor of the +brick-and-mortar store, your SKU above has 16 items on the shelf, 1 in one cart, +and 2 in another for a total of 19 unsold items of merchandise. -For our shopping cart model, we will maintain a list of (sku, quantity, -price) line items: +For the shopping cart model, you need to maintain a list of (``sku``, +``quantity``, ``price``) line items: .. code-block:: javascript @@ -72,24 +72,26 @@ price) line items: ] } -Note in the cart model that we have included item details in each line -item. This allows us to display the contents of the cart to the user -without needing a second query back to the catalog collection to display +Note that the cart model includes item details in each line +item. This allows your app to display the contents of the cart to the user +without needing a second query back to the catalog collection to fetch the details. Operations ========== -Here, we will describe the various inventory-related operations we will -perform during the course of operation. +Here, the various inventory-related operations in an ecommerce site are described +as they would occur using our schema above. The examples use the Python +programming language and the ``pymongo`` MongoDB driver, but implementations +would be similar in other languages as well. Add an Item to a Shopping Cart ------------------------------ -Our most basic operation is moving an item off the 'shelf' in to the -'cart'. Our constraint is that we would like to guarantee that we never +The most basic operation is moving an item off the "shelf" in to the +"cart." The constraint is that you would like to guarantee that you never move an unavailable item off the shelf into the cart. To solve this -problem, we will ensure that inventory is only updated if there is +problem, this solution ensures that inventory is only updated if there is sufficient inventory to satisfy the request: .. code-block:: python @@ -97,7 +99,6 @@ sufficient inventory to satisfy the request: def add_item_to_cart(cart_id, sku, qty, details): now = datetime.utcnow() - # Make sure the cart is still active and add the line item result = db.cart.update( {'_id': cart_id, 'status': 'active' }, @@ -109,7 +110,6 @@ sufficient inventory to satisfy the request: if not result['updatedExisting']: raise CartInactive() - # Update the inventory result = db.inventory.update( {'_id':sku, 'qty': {'$gte': qty}}, @@ -126,30 +126,30 @@ sufficient inventory to satisfy the request: ) raise InadequateInventory() -Note here in particular that we do not trust that the request is -satisfiable. Our first check makes sure that the cart is still 'active' -(more on inactive carts below) before adding a line item. Our next check +Note here in particular that the system does not trust that the request is +satisfiable. The first check makes sure that the cart is still "active" +(more on inactive carts below) before adding a line item. The next check verifies that sufficient inventory exists to satisfy the request before -decrementing inventory. In the case of inadequate inventory, we -*compensate* for the non-transactional nature of MongoDB by removing our +decrementing inventory. In the case of inadequate inventory, the system +*compensates* for the non-transactional nature of MongoDB by removing the cart update. Using safe=True and checking the result in the case of -these two updates allows us to report back an error to the user if the +these two updates allows you to report back an error to the user if the cart has become inactive or available quantity is insufficient to satisfy the request. Index Support ~~~~~~~~~~~~~ -To support this query efficiently, all we really need is an index on -\_id, which MongoDB provides us by default. +To support this query efficiently, all you really need is an index on +``_id``, which MongoDB provides us by default. Modifying the Quantity in the Cart ---------------------------------- -Here, we want to allow the user to adjust the quantity of items in their -cart. We must make sure that when they adjust the quantity upward, there +Here, you'd like to allow the user to adjust the quantity of items in their +cart. The system must ensure that when they adjust the quantity upward, there is sufficient inventory to cover the quantity, as well as updating the -particular 'carted' entry for the user's cart. +particular ``carted`` entry for the user's cart. .. code-block:: python @@ -157,7 +157,6 @@ particular 'carted' entry for the user's cart. now = datetime.utcnow() delta_qty = new_qty - old_qty - # Make sure the cart is still active and add the line item result = db.cart.update( {'_id': cart_id, 'status': 'active', 'items.sku': sku }, @@ -169,7 +168,6 @@ particular 'carted' entry for the user's cart. if not result['updatedExisting']: raise CartInactive() - # Update the inventory result = db.inventory.update( {'_id':sku, @@ -186,29 +184,29 @@ particular 'carted' entry for the user's cart. }) raise InadequateInventory() -Note in particular here that we are using the positional operator '$' to -update the particular 'carted' entry and line item that matched for our -query. This allows us to update the inventory and keep track of the data -we need to 'rollback' the cart in a single atomic operation. We will -also ensure the cart is active and timestamp it as in the case of adding +Note in particular here the use of the positional operator '$' to +update the particular ``carted`` entry and line item that matched for the +query. This allows the system to update the inventory and keep track of the data +necessary need to "rollback" the cart in a single atomic operation. The code above +also ensures the cart is active and timestamp it as in the case of adding items to the cart. Index Support ~~~~~~~~~~~~~ -To support this query efficiently, all we really need is an index on -\_id, which MongoDB provides us by default. +To support this query efficiently, again all we really need is an index on ``_id``. Checking Out ------------ -During checkout, we want to validate the method of payment and remove -the various 'carted' items after the transaction has succeeded. +During checkout, you'd like to validate the method of payment and remove +the various ``carted`` items after the transaction has succeeded. .. code-block:: python def checkout(cart_id): now = datetime.utcnow() + # Make sure the cart is still active and set to 'pending'. Also # fetch the cart details so we can calculate the checkout price cart = db.cart.find_and_modify( @@ -217,9 +215,9 @@ the various 'carted' items after the transaction has succeeded. if cart is None: raise CartInactive() - # Validate payment details; collect payment - if payment_is_successful(cart): + try: + collect_payment(cart) db.cart.update( {'_id': cart_id }, {'$set': { 'status': 'complete' } } ) @@ -227,32 +225,31 @@ the various 'carted' items after the transaction has succeeded. {'carted.cart_id': cart_id}, {'$pull': {'cart_id': cart_id} }, multi=True) - else: + except: db.cart.update( {'_id': cart_id }, {'$set': { 'status': 'active' } } ) - raise PaymentError() - -Here, we first 'lock' the cart by setting its status to 'pending' -(disabling any modifications) and then collect payment data, verifying -at the same time that the cart is still active. We use MongoDB's -'findAndModify' command to atomically update the cart and return its -details so we can capture payment information. If the payment is -successful, we remove the 'carted' items from individual items' -inventory and set the cart to 'complete'. If payment is unsuccessful, we -unlock the cart by setting its status back to 'active' and report a + raise + +Here, the cart is first "locked" by setting its status to "pending" +(disabling any modifications.) Then the system collects payment data, verifying +at the same time that the cart is still active. MongoDB's +``findAndModify`` command is used to atomically update the cart and return its +details so you can capture payment information. If the payment is +successful, you then remove the ``carted`` items from individual items' +inventory and set the cart to "complete." If payment is unsuccessful, you +unlock the cart by setting its status back to "active" and report a payment error. Index Support ~~~~~~~~~~~~~ -To support this query efficiently, all we really need is an index on -\_id, which MongoDB provides us by default. +Once again the ``_id`` default index is enough to make this operation efficient. Returning Timed-Out Items to Inventory -------------------------------------- -Periodically, we want to expire carts that have been inactive for a +Periodically, you'd like to expire carts that have been inactive for a given number of seconds, returning their line items to available inventory: @@ -261,13 +258,16 @@ inventory: def expire_carts(timeout): now = datetime.utcnow() threshold = now - timedelta(seconds=timeout) + # Lock and find all the expiring carts db.cart.update( {'status': 'active', 'last_modified': { '$lt': threshold } }, {'$set': { 'status': 'expiring' } }, multi=True ) + # Actually expire each cart for cart in db.cart.find({'status': 'expiring'}): + # Return all line items to inventory for item in cart['items']: db.inventory.update( @@ -277,41 +277,42 @@ inventory: }, {'$inc': { 'qty': item['qty'] }, '$pull': { 'carted': { 'cart_id': cart['id'] } } }) + db.cart.update( {'_id': cart['id'] }, {'$set': { status': 'expired' }) -Here, we first find all carts to be expired and then, for each cart, +Here, you first find all carts to be expired and then, for each cart, return its items to inventory. Once all items have been returned to inventory, the cart is moved to the 'expired' state. Index Support ~~~~~~~~~~~~~ -In this case, we need to be able to efficiently query carts based on -their status and last\_modified values, so an index on these would help -the performance of our periodic expiration process: +In this case, you need to be able to efficiently query carts based on +their ``status`` and ``last_modified`` values, so an index on these would help +the performance of the periodic expiration process: .. code-block:: python >>> db.cart.ensure_index([('status', 1), ('last_modified', 1)]) -Note in particular the order in which we defined the index: in order to +Note in particular the order in which the index is defined: in order to efficiently support range queries ('$lt' in this case), the ranged item must be the last item in the index. Also note that there is no need to -define an index on the 'status' field alone, as any queries for status +define an index on the ``status`` field alone, as any queries for status can use the compound index we have defined here. Error Handling -------------- -There is one failure mode above that we have not handled adequately: the +There is one failure mode above that thusfar has not been handled adequately: the case of an exception that occurs after updating the inventory collection but before updating the shopping cart. The result of this failure mode is a shopping cart that may be absent or expired where the 'carted' items in the inventory have not been returned to available inventory. To -account for this case, we will run a cleanup method periodically that -will find old 'carted' items and check the status of their cart: +account for this case, you'll need to run a cleanup method periodically that +will find old ``carted`` items and check the status of their cart: .. code-block:: python @@ -319,19 +320,16 @@ will find old 'carted' items and check the status of their cart: now = datetime.utcnow() threshold = now - timedelta(seconds=timeout) - # Find all the expiring carted items for item in db.inventory.find( {'carted.timestamp': {'$lt': threshold }}): - # Find all the carted items that matched carted = dict( (carted_item['cart_id'], carted_item) for carted_item in item['carted'] if carted_item['timestamp'] < threshold) - # Find any carts that are active and refresh the carted items for cart in db.cart.find( { '_id': {'$in': carted.keys() }, @@ -343,7 +341,6 @@ will find old 'carted' items and check the status of their cart: { '$set': {'carted.$.timestamp': now } }) del carted[cart['_id']] - # All the carted items left in the dict need to now be # returned to inventory for cart_id, carted_item in carted.items(): @@ -363,25 +360,25 @@ infrequently. Sharding ======== -If we choose to shard this system, the use of an \_id field for most of -our updates makes \_id an ideal sharding candidate, for both carts and -products. Using \_id as our shard key allows all updates that query on -\_id to be routed to a single mongod process. There are two potential -drawbacks with using \_id as a shard key, however. +If you choose to shard this system, the use of an ``_id`` field for most of +our updates makes ``_id`` an ideal sharding candidate, for both carts and +products. Using ``_id`` as your shard key allows all updates that query on +``_id`` to be routed to a single mongod process. There are two potential +drawbacks with using ``_id`` as a shard key, however. -- If the cart collection's \_id is generated in a generally increasing +- If the cart collection's ``_id`` is generated in a generally increasing order, new carts will all initially be assigned to a single shard. - Cart expiration and inventory adjustment requires several broadcast - queries and updates if \_id is used as a shard key. + queries and updates if ``_id`` is used as a shard key. -It turns out we can mitigate the first pitfall by choosing a random -value (perhaps the sha-1 hash of on ObjectId) as the \_id of each cart +It turns out you can mitigate the first pitfall by choosing a random +value (perhaps the sha-1 hash of an ``ObjectId``) as the ``_id`` of each cart as it is created. The second objection is valid, but relatively -unimportant, as our expiration process is an infrequent one and in fact -can be slowed down by the judicious use of sleep() calls in order to +unimportant, as the expiration function runs relatively infrequently and can be +slowed down by the judicious use of ``sleep()`` calls in order to minimize server load. -The sharding commands we would use to shard the cart and inventory +The sharding commands you'd use to shard the cart and inventory collections, then, would be the following: .. code-block:: python @@ -392,4 +389,4 @@ collections, then, would be the following: { "collectionsharded" : "cart", "ok" : 1 } Note that there is no need to specify the shard key, as MongoDB will -default to using \_id as a shard key. +default to using ``_id`` as a shard key. diff --git a/source/tutorial/usecase/real-time-analytics-hierarchical-aggregation.txt b/source/tutorial/usecase/real-time-analytics-hierarchical-aggregation.txt index 799fe365ee6..7f149827e48 100644 --- a/source/tutorial/usecase/real-time-analytics-hierarchical-aggregation.txt +++ b/source/tutorial/usecase/real-time-analytics-hierarchical-aggregation.txt @@ -144,7 +144,7 @@ to incrementally update the various levels of the hierarchy. Operations ========== -In the discussion below, it is assume that all the events have been +In the discussion below, it is assumed that all the events have been inserted and appropriately timestamped, so your main operations are aggregating from events into the smallest aggregate (the hourly totals) and aggregating from smaller granularity to larger granularity. In each From 41c3ed5e5e4a57f4bec6caf6b725072b9f70311f Mon Sep 17 00:00:00 2001 From: Rick Copeland Date: Mon, 19 Mar 2012 17:56:09 -0400 Subject: [PATCH 16/20] ecommerce-inventory-management and ecommerce-category-hierarchy style updates Signed-off-by: Rick Copeland --- .../usecase/ecommerce-category-hierarchy.txt | 171 ++++++++++-------- .../ecommerce-inventory-management.txt | 2 +- 2 files changed, 92 insertions(+), 81 deletions(-) diff --git a/source/tutorial/usecase/ecommerce-category-hierarchy.txt b/source/tutorial/usecase/ecommerce-category-hierarchy.txt index 2766ee18e44..9b03bae98f8 100644 --- a/source/tutorial/usecase/ecommerce-category-hierarchy.txt +++ b/source/tutorial/usecase/ecommerce-category-hierarchy.txt @@ -1,38 +1,41 @@ +============================== E-Commerce: Category Hierarchy ============================== Problem -------- +======= You have a product hierarchy for an e-commerce site that you want to query frequently and update somewhat frequently. -Solution overview ------------------ +Solution Overview +================= -We will keep each category in its own document, along with a list of its -ancestors. The category hierarchy we will use in this solution will be +This solution keeps each category in its own document, along with a list of its +ancestors. The category hierarchy used in this example will be based on different categories of music: .. figure:: img/ecommerce-category1.png :align: center - :alt: + :alt: Initial category hierarchy -Since categories change relatively infrequently, we will focus mostly in -this solution on the operations needed to keep the hierarchy up-to-date -and less on the performance aspects of updating the hierarchy. + Initial category hierarchy -Schema design -------------- +Since categories change relatively infrequently, the focus here will be on the +operations needed to keep the hierarchy up-to-date and less on the performance +aspects of updating the hierarchy. -Each category in our hierarchy will be represented by a document. That -document will be identified by an ObjectId for internal +Schema Design +============= + +Each category in the hierarchy will be represented by a document. That +document will be identified by an ``ObjectId`` for internal cross-referencing as well as a human-readable name and a url-friendly -'slug' property. Additionally, we will store an ancestors list along +``slug`` property. Additionally, the schema stores an ancestors list along with each document to facilitate displaying a category along with all its ancestors in a single query. -:: +.. code-block:: javascript { "_id" : ObjectId("4f5ec858eb03303a11000002"), "name" : "Modal Jazz", @@ -48,55 +51,59 @@ its ancestors in a single query. } Operations ----------- +========== -Here, we will describe the various queries and updates we will use -during the lifecycle of our hierarchy. +Here, the various category manipulations you may need in an ecommerce site are +described as they would occur using the schema above. The examples use the Python +programming language and the ``pymongo`` MongoDB driver, but implementations +would be similar in other languages as well. -Read and display a category -~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Read and Display a Category +--------------------------- The simplest operation is reading and displaying a hierarchy. In this -case, we might want to display a category along with a list of 'bread -crumbs' leading back up the hierarchy. In an E-commerce site, we will -most likely have the slug of the category available for our query. +case, you might want to display a category along with a list of "bread +crumbs" leading back up the hierarchy. In an E-commerce site, you'll +most likely have the slug of the category available for your query, as it can be +parsed from the URL. -:: +.. code-block:: python category = db.categories.find( {'slug':slug}, {'_id':0, 'name':1, 'ancestors.slug':1, 'ancestors.name':1 }) -Here, we use the slug to retrieve the category and retrieve only those -fields we wish to display. +Here, the slug is used to retrieve the category, fetching only those +fields needed for display. Index Support -^^^^^^^^^^^^^ +~~~~~~~~~~~~~ -In order to support this common operation efficiently, we need an index -on the 'slug' field. Since slug is also intended to be unique, we will -add that constraint to our index as well: +In order to support this common operation efficiently, you'll need an index +on the 'slug' field. Since slug is also intended to be unique, the index over it +should be unique as well: -:: +.. code-block:: python db.categories.ensure_index('slug', unique=True) -Add a category to the hierarchy -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Add a Category to the Hierarchy +------------------------------- -Adding a category to a hierarchy is relatively simple. Suppose we wish +Adding a category to a hierarchy is relatively simple. Suppose you wish to add a new category 'Swing' as a child of 'Ragtime': .. figure:: img/ecommerce-category2.png :align: center - :alt: + :alt: Adding a category + + Adding a category In this case, the initial insert is simple enough, but after this -insert, we are still missing the ancestors array in the 'Swing' -category. To define this, we will add a helper function to build our -ancestor list: +insert, the "Swing" category is still missing its ancestors array. To define +this, you'll need a helper function to build the ancestor list: -:: +.. code-block:: python def build_ancestors(_id, parent_id): parent = db.categories.find_one( @@ -108,47 +115,49 @@ ancestor list: {'_id': _id}, {'$set': { 'ancestors': ancestors } }) -Note that we only need to travel one level in our hierarchy to get the -ragtime's ancestors and build swing's entire ancestor list. Now we can +Note that you only need to travel one level in the hierarchy to get the +ragtime's ancestors and build swing's entire ancestor list. Now you can actually perform the insert and rebuild the ancestor list: -:: +.. code-block:: python doc = dict(name='Swing', slug='swing', parent=ragtime_id) swing_id = db.categories.insert(doc) build_ancestors(swing_id, ragtime_id) Index Support -^^^^^^^^^^^^^ +~~~~~~~~~~~~~ -Since these queries and updates all selected based on \_id, we only need -the default MongoDB-supplied index on \_id to support this operation +Since these queries and updates all selected based on ``_id``, you only need +the default MongoDB-supplied index on ``_id`` to support this operation efficiently. -Change the ancestry of a category -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Change the Ancestry of a Category +--------------------------------- -Our goal here is to reorganize the hierarchy by moving 'bop' under +Suppose you wish to reorganize the hierarchy by moving 'bop' under 'swing': .. figure:: img/ecommerce-category3.png :align: center - :alt: + :alt: Change the parent of a category + + Change the parent of a category The initial update is straightforward: -:: +.. code-block:: python db.categories.update( {'_id':bop_id}, {'$set': { 'parent': swing_id } } ) -Now, we need to update the ancestor list for bop and all its -descendants. In this case, we can't guarantee that the ancestor list of -the parent category is always correct, however (since we may be -processing the categories out-of-order), so we will need a new +Now, you still need to update the ancestor list for bop and all its +descendants. In this case, you can't guarantee that the ancestor list of +the parent category is always correct, since MongoDB may +process the categories out-of-order. To handle this, you'll need a new ancestor-building function: -:: +.. code-block:: python def build_ancestors_full(_id, parent_id): ancestors = [] @@ -162,10 +171,10 @@ ancestor-building function: {'_id': _id}, {'$set': { 'ancestors': ancestors } }) -Now, at the expense of a few more queries up the hierarchy, we can +Now, at the expense of a few more queries up the hierarchy, you can easily reconstruct all the descendants of 'bop': -:: +.. code-block:: python for cat in db.categories.find( {'ancestors._id': bop_id}, @@ -173,66 +182,68 @@ easily reconstruct all the descendants of 'bop': build_ancestors_full(cat['_id'], cat['parent_id']) Index Support -^^^^^^^^^^^^^ +~~~~~~~~~~~~~ -In this case, an index on 'ancestors.\_id' would be helpful in +In this case, an index on ``ancestors._id`` would be helpful in determining which descendants need to be updated: -:: +.. code-block:: python db.categories.ensure_index('ancestors._id') -Renaming a category -~~~~~~~~~~~~~~~~~~~ +Rename a Category +----------------- Renaming a category would normally be an extremely quick operation, but -in this case due to our denormalization, we also need to update the -descendants. Here, we will rename 'Bop' to 'BeBop': +in this case due to denormalization, you also need to update the +descendants. Suppose you need to rename "Bop" to "BeBop:" .. figure:: img/ecommerce-category4.png :align: center - :alt: + :alt: Rename a category + + Rename a category -First, we need to update the category name itself: +First, you need to update the category name itself: -:: +.. code-block:: python db.categories.update( {'_id':bop_id}, {'$set': { 'name': 'BeBop' } } ) -Next, we need to update each descendant's ancestors list: +Next, you need to update each descendant's ancestors list: -:: +.. code-block:: python db.categories.update( {'ancestors._id': bop_id}, {'$set': { 'ancestors.$.name': 'BeBop' } }, multi=True) -Here, we use the positional operation '$' to match the exact 'ancestor' -entry that matches our query, as well as the 'multi' option on our +Here, you can use the positional operation ``$`` to match the exact "ancestor" +entry that matches the query, as well as the ``multi`` option on the update to ensure the rename operation occurs in a single server round-trip. Index Support -^^^^^^^^^^^^^ +~~~~~~~~~~~~~ -In this case, the index we have already defined on 'ancestors.\_id' is +In this case, the index you have already defined on ``ancestors._id`` is sufficient to ensure good performance. Sharding --------- +======== -In this solution, it is unlikely that we would want to shard the -collection since it's likely to be quite small. If we *should* decide to -shard, the use of an \_id field for most of our updates makes \_id an -ideal sharding candidate. The sharding commands we would use to shard +In this solution, it is unlikely that you would want to shard the +collection since it's likely to be quite small. If you *should* decide to +shard, the use of an ``_id`` field for most updates makes it an +ideal sharding candidate. The sharding commands you'd use to shard the category collection would then be the following: -:: +.. code-block:: python >>> db.command('shardcollection', 'categories') { "collectionsharded" : "categories", "ok" : 1 } Note that there is no need to specify the shard key, as MongoDB will -default to using \_id as a shard key. +default to using ``_id`` as a shard key. diff --git a/source/tutorial/usecase/ecommerce-inventory-management.txt b/source/tutorial/usecase/ecommerce-inventory-management.txt index 935cb707b9c..0c1b065a27e 100644 --- a/source/tutorial/usecase/ecommerce-inventory-management.txt +++ b/source/tutorial/usecase/ecommerce-inventory-management.txt @@ -81,7 +81,7 @@ Operations ========== Here, the various inventory-related operations in an ecommerce site are described -as they would occur using our schema above. The examples use the Python +as they would occur using the schema above. The examples use the Python programming language and the ``pymongo`` MongoDB driver, but implementations would be similar in other languages as well. From 9bbc0e27d751f098773099f1a1bdf8d8fc82e866 Mon Sep 17 00:00:00 2001 From: Rick Copeland Date: Mon, 19 Mar 2012 20:45:03 -0400 Subject: [PATCH 17/20] fix headers and code blocks for cms-metadata Signed-off-by: Rick Copeland --- .../cms-metadata-and-asset-management.txt | 100 +++++++++--------- 1 file changed, 50 insertions(+), 50 deletions(-) diff --git a/source/tutorial/usecase/cms-metadata-and-asset-management.txt b/source/tutorial/usecase/cms-metadata-and-asset-management.txt index 3ab74be575f..295195e626d 100644 --- a/source/tutorial/usecase/cms-metadata-and-asset-management.txt +++ b/source/tutorial/usecase/cms-metadata-and-asset-management.txt @@ -35,7 +35,7 @@ Photo description, author, and date along with the actual photo binary data. -Schema design +Schema Design ============= Your node collection will contain documents of various formats, but they @@ -68,7 +68,7 @@ For the basic page above, the detail field might simply contain the text of the page. In the case of a blog entry, the document might resemble the following instead: -:: +.. code-block:: javascript { … @@ -95,7 +95,7 @@ our case, we will call the two collections 'cms.assets.files' and collection to store the normal GridFS metadata as well as our node metadata: -:: +.. code-block:: javascript { _id: ObjectId(…), @@ -123,20 +123,20 @@ Here, we have embedded the schema for our 'normal' nodes so we can share node-manipulation code among all types of nodes. Operations ----------- +========== Here, we will describe common queries and updates used in our CMS, paying particular attention to 'tweaks' we need to make for our various node types. -Create and edit content nodes -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Create and Edit Content Nodes +----------------------------- The content producers using our CMS will be creating and editing content most of the time. Most content-creation activities are relatively straightforward: -:: +.. code-block:: python db.cms.nodes.insert({ 'nonce': ObjectId(), @@ -159,7 +159,7 @@ multiple editors. In order to support this, we use the special 'nonce' value to detect when another editor may have modified the document and allow the application to resolve any conflicts: -:: +.. code-block:: python def update_text(section, slug, nonce, text): result = db.cms.nodes.update( @@ -174,7 +174,7 @@ allow the application to resolve any conflicts: We might also want to perform metadata edits to the item such as adding tags: -:: +.. code-block:: python db.cms.nodes.update( { 'metadata.section': section, 'metadata.slug': slug }, @@ -183,14 +183,14 @@ tags: In this case, we don't actually need to supply the nonce (nor update it) since we are using the atomic $addToSet modifier in MongoDB. -Index support -^^^^^^^^^^^^^ +Index Support +~~~~~~~~~~~~~ Our updates in this case are based on equality queries containing the (section, slug, and nonce) values. To support these queries, we might use the following index: -:: +.. code-block:: python >>> db.cms.nodes.ensure_index([ ... ('metadata.section', 1), ('metadata.slug', 1), ('nonce', 1) ]) @@ -199,7 +199,7 @@ Also note, however, that we would like to ensure that two editors don't create two documents with the same section and slug. To support this, we will use a second index with a unique constraint: -:: +.. code-block:: python >>> db.cms.nodes.ensure_index([ ... ('metadata.section', 1), ('metadata.slug', 1)], unique=True) @@ -209,13 +209,13 @@ going to be unique, we don't actually get much benefit from the first index and can use only the second one to satisfy our update queries as well. -Upload a photo -~~~~~~~~~~~~~~ +Upload a Photo +-------------- Uploading photos to our site shares some things in common with node update, but it also has some extra nuances: -:: +.. code-block:: python def upload_new_photo( input_file, section, slug, title, author, tags, details): @@ -247,7 +247,7 @@ record. This lets us detect when a file upload may be stalled, which is helpful when working with multiple editors. In this case, we will assume that the last update wins: -:: +.. code-block:: python def update_photo_content(input_file, section, slug): fs = GridFS(db, 'cms.assets') @@ -284,15 +284,15 @@ that the last update wins: We can, of course, perform metadata edits to the item such as adding tags without the extra complexity: -:: +.. code-block:: python db.cms.assets.files.update( { 'metadata.section': section, 'metadata.slug': slug }, { '$addToSet': { 'metadata.tags': { '$each': [ 'interesting', 'funny' ] } } }) -Index support -^^^^^^^^^^^^^ +Index Support +~~~~~~~~~~~~~ Our updates here are also based on equality queries containing the (section, slug) values, so we can use the same types of indexes as we @@ -301,74 +301,74 @@ unique constraint on (section, slug) to ensure that one of the calls to GridFS.new\_file() will fail multiple editors try to create or update the file simultaneously. -:: +.. code-block:: python >>> db.cms.assets.files.ensure_index([ ... ('metadata.section', 1), ('metadata.slug', 1)], unique=True) -Locate and render a node -~~~~~~~~~~~~~~~~~~~~~~~~ +Locate and Render a Node +------------------------ We want to be able to locate a node based on its section and slug, which we assume have been extracted from the page definition and URL by some other technology. -:: +.. code-block:: python node = db.nodes.find_one( {'metadata.section': section, 'metadata.slug': slug }) -Index support -^^^^^^^^^^^^^ +Index Support +~~~~~~~~~~~~~ The same indexes we have defined above on (section, slug) would efficiently render this node. -Locate and render a file -~~~~~~~~~~~~~~~~~~~~~~~~ +Locate and Render a File +------------------------ We want to be able to locate an image based on its section and slug, which we assume have been extracted from the page definition and URL just as with other nodes. -:: +.. code-block:: python fs = GridFS(db, 'cms.assets') with fs.get_version( **{'metadata.section': section, 'metadata.slug': slug }) as img_fp: # do something with our image file -Index support -^^^^^^^^^^^^^ +Index Support +~~~~~~~~~~~~~ The same indexes we have defined above on (section, slug) would also efficiently render this image. -Search for nodes by tag -~~~~~~~~~~~~~~~~~~~~~~~ +Search for Nodes by Tag +----------------------- Here we would like to retrieve a list of nodes based on their tag: -:: +.. code-block:: python nodes = db.nodes.find({'metadata.tags': tag }) -Index support -^^^^^^^^^^^^^ +Index Support +~~~~~~~~~~~~~ To support searching efficiently, we should define indexes on any fields we intend on using in our query: -:: +.. code-block:: python >>> db.cms.nodes.ensure_index('tags') -Search for images by tag -~~~~~~~~~~~~~~~~~~~~~~~~ +Search for Images by Tag +------------------------ Here we would like to retrieve a list of images based on their tag: -:: +.. code-block:: python image_file_objects = db.cms.assets.files.find({'metadata.tags': tag }) fs = GridFS(db, 'cms.assets') @@ -377,23 +377,23 @@ Here we would like to retrieve a list of images based on their tag: image_file = fs.get(image_file_object['_id']) # do something with the image file -Index support -^^^^^^^^^^^^^ +Index Support +~~~~~~~~~~~~~ As above, in order to support searching efficiently, we should define indexes on any fields we intend on using in our query: -:: +.. code-block:: python >>> db.cms.assets.files.ensure_index('tags') -Generate a feed of recently published blog articles -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Generate a Feed of Recently Published Blog Articles +--------------------------------------------------- Here, we wish to generate an .rss or .atom feed for our recently published blog articles, sorted by date descending: -:: +.. code-block:: python articles = db.nodes.find({ 'metadata.section': 'my-blog' @@ -406,13 +406,13 @@ where we are sorting or using range queries, as here, the field on which we're sorting or performing a range query must be the final field in our index: -:: +.. code-block:: python >>> db.cms.nodes.ensure_index( ... [ ('metadata.section', 1), ('metadata.published', -1) ]) Sharding --------- +======== In a CMS system, our read performance is generally much more important than our write performance. As such, we will optimize the sharding setup @@ -424,7 +424,7 @@ defined in order to get the same semantics as we have described. Given these constraints, sharding the nodes and assets on (section, slug) seems to be a reasonable approach: -:: +.. code-block:: python >>> db.command('shardcollection', 'cms.nodes', { ... key : { 'metadata.section': 1, 'metadata.slug' : 1 } }) @@ -437,7 +437,7 @@ If we wish to shard our 'cms.assets.chunks' collection, we need to shard on the \_id field (none of our metadata is available on the chunks collection in gridfs): -:: +.. code-block:: python >>> db.command('shardcollection', 'cms.assets.chunks' { "collectionsharded" : "cms.assets.chunks", "ok" : 1 } From 60171275eced81f95236d8e5670cfc480b1833b4 Mon Sep 17 00:00:00 2001 From: Rick Copeland Date: Mon, 19 Mar 2012 21:01:59 -0400 Subject: [PATCH 18/20] fix style for cms-metadata --- .../cms-metadata-and-asset-management.txt | 171 +++++++++--------- 1 file changed, 88 insertions(+), 83 deletions(-) diff --git a/source/tutorial/usecase/cms-metadata-and-asset-management.txt b/source/tutorial/usecase/cms-metadata-and-asset-management.txt index 295195e626d..e3a4886c3e0 100644 --- a/source/tutorial/usecase/cms-metadata-and-asset-management.txt +++ b/source/tutorial/usecase/cms-metadata-and-asset-management.txt @@ -16,10 +16,11 @@ open source CMS written in PHP on relational databases that is available at `http://www.drupal.org `_. In this case, you will take advantage of MongoDB's dynamically typed collections to *polymorphically* store all your content nodes in the same collection. -Your navigational information will be stored in its own collection since -it has relatively little in common with our content nodes. +Navigational information will be stored in its own collection since +it has relatively little in common with the content nodes, and is not covered in +this use case. -The main node types with which this use case is concerned are: +The main node types which are covered here are: Basic page Basic pages are useful for displaying @@ -38,14 +39,15 @@ Photo Schema Design ============= -Your node collection will contain documents of various formats, but they +The node collection contains documents of various formats, but they will all share a similar structure, with each document including an -\_id, type, section, slug, title, creation date, author, and tags. The -`section` property is used to identify groupings of items (grouped to a -particular blog or photo gallery, for instance). The `slug` property is +``_id``, ``type``, ``section``, ``slug``, ``title``, ``created`` date, +``author``, and ``tags``. The +``section`` property is used to identify groupings of items (grouped to a +particular blog or photo gallery, for instance). The ``slug`` property is a url-friendly representation of the node that is unique within its section, and is used for mapping URLs to nodes. Each document also -contains a `detail` field which will vary per document type: +contains a ``detail`` field which will vary per document type: .. code-block:: javascript @@ -85,14 +87,14 @@ the following instead: } } -Photos present something of a special case. Since we will need to store -potentially very large photos, we would like separate our binary storage -of photo data from the metadata storage. GridFS provides just such a -mechanism, splitting a 'filesystem' of potentially very large files into -two collections, the 'files' collection and the 'chunks' collection. In -our case, we will call the two collections 'cms.assets.files' and -'cms.assets.chunks'. We will use documents in the 'assets.files' -collection to store the normal GridFS metadata as well as our node +Photos present something of a special case. Since you'll need to store +potentially very large photos, it's nice to be able to separate the binary +storage of photo data from the metadata storage. GridFS provides just such a +mechanism, splitting a "filesystem" of potentially very large "files" into +two collections, the ``files`` collection and the ``chunks`` collection. In +this case, the two collections will be called ``cms.assets.files`` and +``cms.assets.chunks``. Documents in the ``cms.assets.files`` +collection will be used to store the normal GridFS metadata as well as CMS node metadata: .. code-block:: javascript @@ -119,20 +121,22 @@ metadata: } } -Here, we have embedded the schema for our 'normal' nodes so we can share -node-manipulation code among all types of nodes. +NOte that the "normal" node schema is embedded here in the photo schema, allowing +the use of the same code to manipulate nodes of all types. Operations ========== -Here, we will describe common queries and updates used in our CMS, -paying particular attention to 'tweaks' we need to make for our various -node types. +Here, some common queries and updates that you might need for your CMS are +described, paying particular attention to any "tweaks" necessary for the various +node types. The examples use the Python +programming language and the ``pymongo`` MongoDB driver, but implementations +would be similar in other languages as well. Create and Edit Content Nodes ----------------------------- -The content producers using our CMS will be creating and editing content +The content producers using your CMS will be creating and editing content most of the time. Most content-creation activities are relatively straightforward: @@ -154,8 +158,8 @@ straightforward: } }) -Once the node is in the database, we have a potential problem with -multiple editors. In order to support this, we use the special 'nonce' +Once the node is in the database, there is a potential problem with +multiple editors. In order to support this, the schema uses the special ``nonce`` value to detect when another editor may have modified the document and allow the application to resolve any conflicts: @@ -171,7 +175,7 @@ allow the application to resolve any conflicts: if not result['updatedExisting']: raise ConflictError() -We might also want to perform metadata edits to the item such as adding +You might also want to perform metadata edits to the item such as adding tags: .. code-block:: python @@ -180,39 +184,39 @@ tags: { 'metadata.section': section, 'metadata.slug': slug }, { '$addToSet': { 'tags': { '$each': [ 'interesting', 'funny' ] } } }) -In this case, we don't actually need to supply the nonce (nor update it) -since we are using the atomic $addToSet modifier in MongoDB. +In this case, you don't actually need to supply the nonce (nor update it) +since you're using the atomic ``$addToSet`` modifier in MongoDB. Index Support ~~~~~~~~~~~~~ -Our updates in this case are based on equality queries containing the -(section, slug, and nonce) values. To support these queries, we might -use the following index: +Updates in this case are based on equality queries containing the +(``section``, ``slug``, and ``nonce``) values. To support these queries, you +*might* use the following index: .. code-block:: python >>> db.cms.nodes.ensure_index([ ... ('metadata.section', 1), ('metadata.slug', 1), ('nonce', 1) ]) -Also note, however, that we would like to ensure that two editors don't -create two documents with the same section and slug. To support this, we -will use a second index with a unique constraint: +Also note, however, that you'd like to ensure that two editors don't +create two documents with the same section and slug. To support this, you need a +second index with a unique constraint: .. code-block:: python >>> db.cms.nodes.ensure_index([ ... ('metadata.section', 1), ('metadata.slug', 1)], unique=True) -In fact, since we expect that most of the time (section, slug, nonce) is -going to be unique, we don't actually get much benefit from the first -index and can use only the second one to satisfy our update queries as +In fact, since the expectation is that most of the time (``section``, ``slug``, +``nonce``) is going to be unique, you don't actually get much benefit from the +first index and can use only the second one to satisfy the update queries as well. Upload a Photo -------------- -Uploading photos to our site shares some things in common with node +Uploading photos shares some things in common with node update, but it also has some extra nuances: .. code-block:: python @@ -241,18 +245,17 @@ update, but it also has some extra nuances: {'_id': upload_file._id}, {'$set': { 'locked': None } } ) -Here, since uploading the photo is a non-atomic operation, we have -locked the file during upload by writing the current datetime into the -record. This lets us detect when a file upload may be stalled, which is -helpful when working with multiple editors. In this case, we will assume -that the last update wins: +Here, since uploading the photo is a non-atomic operation, you need to +"lock" the file during upload by writing the current datetime into the +record. This lets the application detect when a file upload may be stalled, which +is helpful when working with multiple editors. This solution assumes that, for +photo upload, the last update wins: .. code-block:: python def update_photo_content(input_file, section, slug): fs = GridFS(db, 'cms.assets') - # Delete the old version if it's unlocked or was locked more than 5 # minutes ago file_obj = db.cms.assets.find_one( @@ -268,7 +271,6 @@ that the last update wins: if file_obj is None: raise FileDoesNotExist() fs.delete(file_obj['_id']) - # update content, keep metadata unchanged file_obj['locked'] = datetime.utcnow() with fs.new_file(**file_obj): @@ -281,7 +283,7 @@ that the last update wins: {'_id': upload_file._id}, {'$set': { 'locked': None } } ) -We can, of course, perform metadata edits to the item such as adding +You can, of course, perform metadata edits to the item such as adding tags without the extra complexity: .. code-block:: python @@ -294,11 +296,11 @@ tags without the extra complexity: Index Support ~~~~~~~~~~~~~ -Our updates here are also based on equality queries containing the -(section, slug) values, so we can use the same types of indexes as we -used in the 'regular' node case. Note in particular that we need a -unique constraint on (section, slug) to ensure that one of the calls to -GridFS.new\_file() will fail multiple editors try to create or update +Updates here are also based on equality queries containing the +(``section``, ``slug``) values, so you can use the same types of indexes as were +used in the "regular" node case. Note in particular that you need a +unique constraint on (``section``, ``slug``) to ensure that one of the calls to +``GridFS.new_file()`` will fail if multiple editors try to create or update the file simultaneously. .. code-block:: python @@ -309,8 +311,8 @@ the file simultaneously. Locate and Render a Node ------------------------ -We want to be able to locate a node based on its section and slug, which -we assume have been extracted from the page definition and URL by some +You need to be able to locate a node based on its section and slug, which +have been extracted from the page definition and URL by some other technology. .. code-block:: python @@ -321,14 +323,14 @@ other technology. Index Support ~~~~~~~~~~~~~ -The same indexes we have defined above on (section, slug) would +The same indexes defined above on (``section``, ``slug``) would efficiently render this node. -Locate and Render a File ------------------------- +Locate and Render a Photo +------------------------- -We want to be able to locate an image based on its section and slug, -which we assume have been extracted from the page definition and URL +You want to locate an image based on its section and slug, +which have been extracted from the page definition and URL just as with other nodes. .. code-block:: python @@ -336,18 +338,18 @@ just as with other nodes. fs = GridFS(db, 'cms.assets') with fs.get_version( **{'metadata.section': section, 'metadata.slug': slug }) as img_fp: - # do something with our image file + # do something with the image file Index Support ~~~~~~~~~~~~~ -The same indexes we have defined above on (section, slug) would also +The same indexes defined above on (``section``, ``slug``) would also efficiently render this image. Search for Nodes by Tag ----------------------- -Here we would like to retrieve a list of nodes based on their tag: +You'd like to retrieve a list of nodes based on their tags: .. code-block:: python @@ -356,8 +358,8 @@ Here we would like to retrieve a list of nodes based on their tag: Index Support ~~~~~~~~~~~~~ -To support searching efficiently, we should define indexes on any fields -we intend on using in our query: +To support searching efficiently, you should define indexes on any fields +you intend on using in your query: .. code-block:: python @@ -366,7 +368,7 @@ we intend on using in our query: Search for Images by Tag ------------------------ -Here we would like to retrieve a list of images based on their tag: +Here, you'd like to retrieve a list of images based on their tags: .. code-block:: python @@ -380,8 +382,8 @@ Here we would like to retrieve a list of images based on their tag: Index Support ~~~~~~~~~~~~~ -As above, in order to support searching efficiently, we should define -indexes on any fields we intend on using in our query: +As above, in order to support searching efficiently, you should define +indexes on any fields you expect to use in the query: .. code-block:: python @@ -390,7 +392,7 @@ indexes on any fields we intend on using in our query: Generate a Feed of Recently Published Blog Articles --------------------------------------------------- -Here, we wish to generate an .rss or .atom feed for our recently +Here, you need to generate an .rss or .atom feed for your recently published blog articles, sorted by date descending: .. code-block:: python @@ -400,10 +402,13 @@ published blog articles, sorted by date descending: 'metadata.published': { '$lt': datetime.utcnow() } }) articles = articles.sort({'metadata.published': -1}) -In order to support this operation, we will create an index on (section, -published) so the items are 'in order' for our query. Note that in cases -where we are sorting or using range queries, as here, the field on which -we're sorting or performing a range query must be the final field in our +Index Support +~~~~~~~~~~~~~ + +In order to support this operation, you'll need to create an index on (``section``, +``published``) so the items are 'in order' for the query. Note that in cases +where you're sorting or using range queries, as here, the field on which +you're sorting or performing a range query must be the final field in the index: .. code-block:: python @@ -414,15 +419,15 @@ index: Sharding ======== -In a CMS system, our read performance is generally much more important -than our write performance. As such, we will optimize the sharding setup -for read performance. In order to achieve the best read performance, we +In a CMS system, read performance is generally much more important +than write performance. As such, you'll want to optimize the sharding setup +for read performance. In order to achieve the best read performance, you need to ensure that queries are *routeable* by the mongos process. A second consideration when sharding is that unique indexes do not span -shards. As such, our shard key must include the unique indexes we have -defined in order to get the same semantics as we have described. Given -these constraints, sharding the nodes and assets on (section, slug) -seems to be a reasonable approach: +shards. As such, the shard key must include the unique indexes in order to get +the same semantics as described above. Given +these constraints, sharding the nodes and assets on (``section``, ``slug``) +is a reasonable approach: .. code-block:: python @@ -433,16 +438,16 @@ seems to be a reasonable approach: ... key : { 'metadata.section': 1, 'metadata.slug' : 1 } }) { "collectionsharded" : "cms.assets.files", "ok" : 1 } -If we wish to shard our 'cms.assets.chunks' collection, we need to shard -on the \_id field (none of our metadata is available on the chunks -collection in gridfs): +If you wish to shard the ``cms.assets.chunks`` collection, you need to shard +on the ``_id`` field (none of the node metadata is available on the +``cms.assets.chunks`` collection in GridFS:) .. code-block:: python >>> db.command('shardcollection', 'cms.assets.chunks' { "collectionsharded" : "cms.assets.chunks", "ok" : 1 } -This actually still maintains our query-routability constraint, since -all reads from gridfs must first look up the document in 'files' and +This actually still maintains the query-routability constraint, since +all reads from GridFS must first look up the document in ``cms.assets.files`` and then look up the chunks separately (though the GridFS API sometimes -hides this detail from us.) +hides this detail.) From 97e17a5d067d8afe73c58ff7d636b2594b43e9fb Mon Sep 17 00:00:00 2001 From: Rick Copeland Date: Mon, 19 Mar 2012 21:10:04 -0400 Subject: [PATCH 19/20] fix headers and code blocks for cms-comments Signed-off-by: Rick Copeland --- .../tutorial/usecase/cms-storing-comments.txt | 148 +++++++++--------- 1 file changed, 76 insertions(+), 72 deletions(-) diff --git a/source/tutorial/usecase/cms-storing-comments.txt b/source/tutorial/usecase/cms-storing-comments.txt index 0062111cb4c..a028cbca89d 100644 --- a/source/tutorial/usecase/cms-storing-comments.txt +++ b/source/tutorial/usecase/cms-storing-comments.txt @@ -1,25 +1,28 @@ +===================== CMS: Storing Comments ===================== Problem -------- +======= In your content management system (CMS) you would like to store user-generated comments on the various types of content you generate. -Solution overview ------------------ +Solution Overview +================= Rather than describing the One True Way to implement comments in this solution, we will explore different options and the trade-offs with each. The three major designs we will discuss here are: -- **One document per comment** - This provides the greatest degree of +One document per comment + This provides the greatest degree of flexibility, as it is relatively straightforward to display the comments as either threaded or chronological. There are also no restrictions on the number of comments that can participate in a discussion. -- **All comments embedded** - In this design, all the comments are +All comments embedded + In this design, all the comments are embedded in their parent document, whether that be a blog article, news story, or forum topic. This can be the highest performance design, but is also the most restrictive, as the display format of @@ -27,7 +30,8 @@ each. The three major designs we will discuss here are: potential problems with extremely active discussions where the total data (topic data + comments) exceeds the 16MB limit of MongoDB documents. -- **Hybrid design** - Here, we store comments separately from their +Hybrid design + Here, we store comments separately from their parent topic, but we aggregate comments together into a few documents, each containing many comments. @@ -37,12 +41,12 @@ parent comment). We will explore how this threaded comment support decision affects our schema design and operations as well. Schema design: One Document Per Comment ---------------------------------------- +======================================= A comment in the one document per comment format might have a structure similar to the following: -:: +.. code-block:: javascript { _id: ObjectId(…), @@ -60,7 +64,7 @@ and author, and the comment text. If we want to support threading in this format, we need to maintain some notion of hierarchy in the comment model as well: -:: +.. code-block:: javascript { _id: ObjectId(…), @@ -81,19 +85,19 @@ consisting of the parent's slug plus the comment's unique slug portion. The full\_slug is also included to facilitate sorting documents in a threaded discussion by posting date. -Operations: One comment per document ------------------------------------- +Operations: One Comment Per Document +==================================== Here, we describe the various operations we might perform with the above single comment per document schema. -Post a new comment -~~~~~~~~~~~~~~~~~~ +Post a New Comment +------------------ In order to post a new comment in a chronologically ordered (unthreaded) system, all we need to do is the following: -:: +.. code-block:: python slug = generate_psuedorandom_slug() db.comments.insert({ @@ -106,7 +110,7 @@ system, all we need to do is the following: In the case of a threaded discussion, we have a bit more work to do in order to generate a 'pathed' slug and full\_slug: -:: +.. code-block:: python posted = datetime.utcnow() @@ -136,13 +140,13 @@ order to generate a 'pathed' slug and full\_slug: 'author': author_info, 'text': comment_text }) -View the (paginated) comments for a discussion -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +View the (Paginated) Comments for a Discussion +---------------------------------------------- To actually view the comments in the non-threaded design, we need merely to select all comments participating in a discussion, sorted by date: -:: +.. code-block:: python cursor = db.comments.find({'discussion_id': discussion_id}) cursor = cursor.sort('posted') @@ -153,21 +157,21 @@ Since the full\_slug embeds both hierarchical information via the path and chronological information, we can use a simple sort on the full\_slug property to retrieve a threaded view: -:: +.. code-block:: python cursor = db.comments.find({'discussion_id': discussion_id}) cursor = cursor.sort('full_slug') cursor = cursor.skip(page_num * page_size) cursor = cursor.limit(page_size) -Index support -^^^^^^^^^^^^^ +Index Support +~~~~~~~~~~~~~ In order to efficiently support the queries above, we should maintain two compound indexes, one on (discussion\_id, posted), and the other on (discussion\_id, full\_slug): -:: +.. code-block:: python >>> db.comments.ensure_index([ ... ('discussion_id', 1), ('posted', 1)]) @@ -178,14 +182,14 @@ Note that we must ensure that the final element in a compound index is the field by which we are sorting to ensure efficient performance of these queries. -Retrieve a comment via slug ("permalink") -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Retrieve a Comment Via Slug ("Permalink") +----------------------------------------- Here, we wish to directly retrieve a comment (e.g. *not* requiring paging through all preceeding pages of commentary). In this case, we simply use the slug: -:: +.. code-block:: python comment = db.comments.find_one({ 'discussion_id': discussion_id, @@ -195,33 +199,33 @@ We can also retrieve a sub-discussion (a comment and all of its descendants recursively) by performing a prefix query on the full\_slug field: -:: +.. code-block:: python subdiscussion = db.comments.find_one({ 'discussion_id': discussion_id, 'full_slug': re.compile('^' + re.escape(parent_slug)) }) subdiscussion = subdiscussion.sort('full_slug') -Index support -^^^^^^^^^^^^^ +Index Support +~~~~~~~~~~~~~ Since we already have indexes on (discussion\_id, full\_slug) to support retrieval of subdiscussion, all we need is an index on (discussion\_id, slug) to efficiently support retrieval of a comment by 'permalink': -:: +.. code-block:: python >>> db.comments.ensure_index([ ... ('discussion_id', 1), ('slug', 1)]) -Schema design: All comments embedded ------------------------------------- +Schema Design: All Comments Embedded +==================================== In this design, we wish to embed an entire discussion within its topic document, be it a blog article, news story, or discussion thread. A topic document, then, might look something like the following: -:: +.. code-block:: python { _id: ObjectId(…), @@ -240,7 +244,7 @@ comments in sorted order, there is no need to maintain a slug per comment. If we want to support threading in the embedded format, we need to embed comments within comments: -:: +.. code-block:: python { _id: ObjectId(…), @@ -271,8 +275,8 @@ run into scaling issues, particularly in the threaded design, as documents need to be frequently moved on disk as they outgrow the space allocated to them. -Operations: All comments embedded ---------------------------------- +Operations: All Comments Embedded +================================= Here, we describe the various operations we might perform with the above single comment per document schema. Note that, in all the cases below, @@ -281,12 +285,12 @@ intra-document, and the document itself (the 'discussion') is retrieved by its \_id field, which is automatically indexed by MongoDB. Post a new comment -~~~~~~~~~~~~~~~~~~ +------------------ In order to post a new comment in a chronologically ordered (unthreaded) system, all we need to do is the following: -:: +.. code-block:: python db.discussion.update( { 'discussion_id': discussion_id }, @@ -301,7 +305,7 @@ discussion, we have a good bit more work to do. In order to reply to a comment, we will assume that we have the 'path' to the comment we are replying to as a list of positions: -:: +.. code-block:: python if path != []: str_path = '.'.join('replies.%d' % part for part in path) @@ -320,13 +324,13 @@ Here, we first construct a field name of the form 'replies.0.replies.2...' as str\_path and then use that to $push the new comment into its parent comment's 'replies' property. -View the (paginated) comments for a discussion -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +View the (Paginated) Comments for a Discussion +----------------------------------------------- To actually view the comments in the non-threaded design, we need to use the $slice operator: -:: +.. code-block:: python discussion = db.discussion.find_one( {'discussion_id': discussion_id}, @@ -337,7 +341,7 @@ the $slice operator: If we wish to view paginated comments for the threaded design, we need to do retrieve the whole document and paginate in our application: -:: +.. code-block:: python discussion = db.discussion.find_one({'discussion_id': discussion_id}) @@ -354,15 +358,15 @@ to do retrieve the whole document and paginate in our application: page_size * page_num, page_size * (page_num + 1)) -Retrieve a comment via position or path ("permalink") -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Retrieve a Comment Via Position or Path ("Permalink") +----------------------------------------------------- Instead of using slugs as above, here we retrieve comments by their position in the comment list or tree. In the case of the chronological (non-threaded) design, we need simply to use the $slice operator to extract the correct comment: -:: +.. code-block:: python discussion = db.discussion.find_one( {'discussion_id': discussion_id}, @@ -372,7 +376,7 @@ extract the correct comment: In the case of the threaded design, we are faced with the task of finding the correct path through the tree in our application: -:: +.. code-block:: python discussion = db.discussion.find_one({'discussion_id': discussion_id}) current = discussion @@ -384,13 +388,13 @@ Note that, since the replies to comments are embedded in their parents, we have actually retrieved the entire sub-discussion rooted in the comment we were looking for as well. -Schema design: Hybrid ---------------------- +Schema Design: Hybrid +===================== Comments in the hybrid format are stored in 'buckets' of about 100 comments each: -:: +.. code-block:: python { _id: ObjectId(…), @@ -419,13 +423,13 @@ comments is slightly larger than 100, but this does not affect the correctness of the design. Operations: Hybrid ------------------- +================== Here, we describe the various operations we might perform with the above 100-comment 'pages'. -Post a new comment -~~~~~~~~~~~~~~~~~~ +Post a New Comment +------------------ In order to post a new comment, we need to $push the comment onto the last page and $inc its comment count. If the page has more than 100 @@ -434,7 +438,7 @@ assume that we already have a reference to the discussion document, and that the discussion document has a property that tracks the number of pages: -:: +.. code-block:: python page = db.comment_pages.find_and_modify( { 'discussion_id': discussion['_id'], @@ -452,7 +456,7 @@ create it for us, initialized with appropriate values for 'count' and 'comments'. Since we are limiting the number of comments per page, we also need to create new pages as they become necessary: -:: +.. code-block:: python if page['count'] > 100: db.discussion.update( @@ -466,25 +470,25 @@ double-incremented, resulting in a nearly or totally empty page. If some other process has incremented the number of pages in the discussion, then update above simply does nothing. -Index support -^^^^^^^^^^^^^ +Index Support +~~~~~~~~~~~~~ In order to efficiently support our find\_and\_modify and update operations above, we need to maintain a compound index on (discussion\_id, page) in the comment\_pages collection: -:: +.. code-block:: python >>> db.comment_pages.ensure_index([ ... ('discussion_id', 1), ('page', 1)]) -View the (paginated) comments for a discussion -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +View the (Paginated) Comments for a Discussion +---------------------------------------------- In order to paginate our comments with a fixed page size, we need to do a bit of extra work in Python: -:: +.. code-block:: python def find_comments(discussion_id, skip, limit): result = [] @@ -522,22 +526,22 @@ continue 0 26 There are no more pages; terminate loop -Index support -^^^^^^^^^^^^^ +Index Support +~~~~~~~~~~~~~ -SInce we already have an index on (discussion\_id, page) in our +Since we already have an index on (discussion\_id, page) in our comment\_pages collection, we will be able to satisfy these queries efficiently. -Retrieve a comment via slug ("permalink") -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Retrieve a Comment Via Slug ("Permalink") +----------------------------------------- Here, we wish to directly retrieve a comment (e.g. *not* requiring paging through all preceeding pages of commentary). In this case, we can use the slug to find the correct page, and then use our application to find the correct comment: -:: +.. code-block:: python page = db.comment_pages.find_one( { 'discussion_id': discussion_id, @@ -547,19 +551,19 @@ find the correct comment: if comment['slug'] = comment_slug: break -Index support -^^^^^^^^^^^^^ +Index Support +~~~~~~~~~~~~~ Here, we need a new index on (discussion\_id, comments.slug) to efficiently support retrieving the page number of the comment by slug: -:: +.. code-block:: python >>> db.comment_pages.ensure_index([ ... ('discussion_id', 1), ('comments.slug', 1)]) Sharding --------- +======== In each of the cases above, it's likely that our discussion\_id will at least participate in the shard key if we should choose to shard. @@ -568,7 +572,7 @@ In the case of the one document per comment approach, it would be nice to use our slug (or full\_slug, in the case of threaded comments) as part of the shard key to allow routing of requests by slug: -:: +.. code-block:: python >>> db.command('shardcollection', 'comments', { ... key : { 'discussion_id' : 1, 'full_slug': 1 } }) @@ -581,7 +585,7 @@ determined by concerns outside the scope of this document. In the case of hybrid documents, we want to use the page number of the comment page in our shard key: -:: +.. code-block:: python >>> db.command('shardcollection', 'comment_pages', { ... key : { 'discussion_id' : 1, ``'page'``: 1 } }) From ba11fdf12d5883bfdcff924dfae5bb53a7193ce6 Mon Sep 17 00:00:00 2001 From: Rick Copeland Date: Mon, 19 Mar 2012 21:50:03 -0400 Subject: [PATCH 20/20] style updates for cms-comments Signed-off-by: Rick Copeland --- .../tutorial/usecase/cms-storing-comments.txt | 338 +++++++++--------- 1 file changed, 173 insertions(+), 165 deletions(-) diff --git a/source/tutorial/usecase/cms-storing-comments.txt b/source/tutorial/usecase/cms-storing-comments.txt index a028cbca89d..fd3551fa696 100644 --- a/source/tutorial/usecase/cms-storing-comments.txt +++ b/source/tutorial/usecase/cms-storing-comments.txt @@ -5,15 +5,15 @@ CMS: Storing Comments Problem ======= -In your content management system (CMS) you would like to store +In your content management system (CMS), you would like to store user-generated comments on the various types of content you generate. Solution Overview ================= Rather than describing the One True Way to implement comments in this -solution, we will explore different options and the trade-offs with -each. The three major designs we will discuss here are: +solution, this use case explores different options and the trade-offs with +each. The three major designs discussed here are: One document per comment This provides the greatest degree of @@ -31,14 +31,14 @@ All comments embedded data (topic data + comments) exceeds the 16MB limit of MongoDB documents. Hybrid design - Here, we store comments separately from their - parent topic, but we aggregate comments together into a few + Here, you store comments separately from their + parent topic, but aggregate comments together into a few documents, each containing many comments. -Another decision that needs to be considered in desniging a commenting +Another decision that needs to be considered in designing a commenting system is whether to support threaded commenting (explicit replies to a -parent comment). We will explore how this threaded comment support -decision affects our schema design and operations as well. +parent comment). This threaded comment support +decision will also be discussed below. Schema design: One Document Per Comment ======================================= @@ -49,53 +49,55 @@ similar to the following: .. code-block:: javascript { - _id: ObjectId(…), - discussion_id: ObjectId(…), + _id: ObjectId(...), + discussion_id: ObjectId(...), slug: '34db', - posted: ISODateTime(…), - author: { id: ObjectId(…), name: 'Rick' }, - text: 'This is so bogus … ' + posted: ISODateTime(...), + author: { id: ObjectId(...), name: 'Rick' }, + text: 'This is so bogus ... ' } The format above is really only suitable for chronological display of -commentary. We maintain a reference to the discussion in which this -comment participates, a url-friendly 'slug' to identify it, posting time -and author, and the comment text. If we want to support threading in -this format, we need to maintain some notion of hierarchy in the comment +commentary. It maintains a reference to the discussion in which this +comment participates, a url-friendly ``slug`` to identify it, ``posted`` time +and ``author``, and the comment's ``text``. If you want to support threading in +this format, you need to maintain some notion of hierarchy in the comment model as well: .. code-block:: javascript { - _id: ObjectId(…), - discussion_id: ObjectId(…), - parent_id: ObjectId(…), + _id: ObjectId(...), + discussion_id: ObjectId(...), + parent_id: ObjectId(...), slug: '34db/8bda', full_slug: '34db:2012.02.08.12.21.08/8bda:2012.02.09.22.19.16', - posted: ISODateTime(…), - author: { id: ObjectId(…), name: 'Rick' }, - text: 'This is so bogus … ' + posted: ISODateTime(...), + author: { id: ObjectId(...), name: 'Rick' }, + text: 'This is so bogus ... ' } -Here, we have stored some extra information into the document that +Here, the schema includes some extra information into the document that represents this document's position in the hierarchy. In addition to -maintaining the parent\_id for the comment, we have modified the slug -format and added a new field, full\_slug. The slug is now a path +maintaining the ``parent_id`` for the comment, the slug format has been modified +and a new field ``full_slug`` has been added. The slug is now a path consisting of the parent's slug plus the comment's unique slug portion. -The full\_slug is also included to facilitate sorting documents in a +The ``full_slug`` is also included to facilitate sorting documents in a threaded discussion by posting date. Operations: One Comment Per Document ==================================== -Here, we describe the various operations we might perform with the above -single comment per document schema. +Here, some common operations that you might need for your CMS are +described in the context of the single comment per document schema. All of the +following examples use the Python programming language and the ``pymongo`` +MongoDB driver, but implementations would be similar in other languages as well. Post a New Comment ------------------ In order to post a new comment in a chronologically ordered (unthreaded) -system, all we need to do is the following: +system, all you need to do is ``insert()``: .. code-block:: python @@ -107,20 +109,18 @@ system, all we need to do is the following: 'author': author_info, 'text': comment_text }) -In the case of a threaded discussion, we have a bit more work to do in -order to generate a 'pathed' slug and full\_slug: +In the case of a threaded discussion, there is a bit more work to do in +order to generate a "pathed" ``slug`` and ``full_slug``: .. code-block:: python posted = datetime.utcnow() - # generate the unique portions of the slug and full_slug slug_part = generate_psuedorandom_slug() full_slug_part = slug_part + ':' + posted.strftime( '%Y.%m.%d.%H.%M.%S') - # load the parent comment (if any) if parent_slug: parent = db.comments.find_one( @@ -131,7 +131,6 @@ order to generate a 'pathed' slug and full\_slug: slug = slug_part full_slug = full_slug_part - # actually insert the comment db.comments.insert({ 'discussion_id': discussion_id, @@ -143,8 +142,8 @@ order to generate a 'pathed' slug and full\_slug: View the (Paginated) Comments for a Discussion ---------------------------------------------- -To actually view the comments in the non-threaded design, we need merely -to select all comments participating in a discussion, sorted by date: +To actually view the comments in the non-threaded design, you need merely +to select all comments participating in a discussion, sorted by ``posted``: .. code-block:: python @@ -153,9 +152,9 @@ to select all comments participating in a discussion, sorted by date: cursor = cursor.skip(page_num * page_size) cursor = cursor.limit(page_size) -Since the full\_slug embeds both hierarchical information via the path -and chronological information, we can use a simple sort on the -full\_slug property to retrieve a threaded view: +Since the ``full_slug`` embeds both hierarchical information via the path +and chronological information, you can use a simple sort on the +``full_slug`` property to retrieve a threaded view: .. code-block:: python @@ -167,9 +166,9 @@ full\_slug property to retrieve a threaded view: Index Support ~~~~~~~~~~~~~ -In order to efficiently support the queries above, we should maintain -two compound indexes, one on (discussion\_id, posted), and the other on -(discussion\_id, full\_slug): +In order to efficiently support the queries above, you should maintain +two compound indexes, one on (``discussion_id``, ``posted``), and the other on +(``discussion_id``, ``full_slug``): .. code-block:: python @@ -178,16 +177,16 @@ two compound indexes, one on (discussion\_id, posted), and the other on >>> db.comments.ensure_index([ ... ('discussion_id', 1), ('full_slug', 1)]) -Note that we must ensure that the final element in a compound index is -the field by which we are sorting to ensure efficient performance of +Note that you must ensure that the final element in a compound index is +the field by which you are sorting to ensure efficient performance of these queries. Retrieve a Comment Via Slug ("Permalink") ----------------------------------------- -Here, we wish to directly retrieve a comment (e.g. *not* requiring -paging through all preceeding pages of commentary). In this case, we -simply use the slug: +Suppose you wish to directly retrieve a comment (e.g. *not* requiring +paging through all preceeding pages of commentary). In this case, you'd +simply use the ``slug``: .. code-block:: python @@ -195,8 +194,8 @@ simply use the slug: 'discussion_id': discussion_id, 'slug': comment_slug}) -We can also retrieve a sub-discussion (a comment and all of its -descendants recursively) by performing a prefix query on the full\_slug +You can also retrieve a sub-discussion (a comment and all of its +descendants recursively) by performing a prefix query on the ``full_slug`` field: .. code-block:: python @@ -209,9 +208,10 @@ field: Index Support ~~~~~~~~~~~~~ -Since we already have indexes on (discussion\_id, full\_slug) to support -retrieval of subdiscussion, all we need is an index on (discussion\_id, -slug) to efficiently support retrieval of a comment by 'permalink': +Since you already have indexes on (``discussion_id``, ``full_slug``) necessary to +support retrieval of subdiscussions, all you need to add here is an index on +(``discussion_id``, ``slug``) to efficiently support retrieval of a comment by +'permalink': .. code-block:: python @@ -221,56 +221,56 @@ slug) to efficiently support retrieval of a comment by 'permalink': Schema Design: All Comments Embedded ==================================== -In this design, we wish to embed an entire discussion within its topic +In this design, you wish to embed an entire discussion within its topic document, be it a blog article, news story, or discussion thread. A topic document, then, might look something like the following: .. code-block:: python { - _id: ObjectId(…), - … lots of topic data … + _id: ObjectId(...), + ... lots of topic data ... comments: [ - { posted: ISODateTime(…), - author: { id: ObjectId(…), name: 'Rick' }, - text: 'This is so bogus … ' }, - … ] + { posted: ISODateTime(...), + author: { id: ObjectId(...), name: 'Rick' }, + text: 'This is so bogus ... ' }, + ... ] } The format above is really only suitable for chronological display of commentary. The comments are embedded in chronological order, with their -posting date, author, and text. Note that, since we are storing the -comments in sorted order, there is no need to maintain a slug per -comment. If we want to support threading in the embedded format, we need +posting date, author, and text. Note that, since you're storing the +comments in sorted order, there is no longer need to maintain a slug per +comment. If you wanted to support threading in the embedded format, you'd need to embed comments within comments: .. code-block:: python { - _id: ObjectId(…), - … lots of topic data … + _id: ObjectId(...), + ... lots of topic data ... replies: [ - { posted: ISODateTime(…), - author: { id: ObjectId(…), name: 'Rick' }, + { posted: ISODateTime(...), + author: { id: ObjectId(...), name: 'Rick' }, - text: 'This is so bogus … ', + text: 'This is so bogus ... ', replies: [ - { author: { … }, … }, - … ] + { author: { ... }, ... }, + ... ] } -Here, we have added a 'replies' property to each comment which can hold +Here, there is a ``replies`` property added to each comment which can hold sub-comments and so on. One thing in particular to note about the -embedded document formats is we give up some flexibility when we embed -the documents, effectively 'baking in' the decisions we've made about -the proper display format. If we (or our users) someday wish to switch +embedded document formats is you give up some flexibility when embedding +the comments, effectively "baking in" the decisions made about +the proper display format. If you (or your users) someday wish to switch from chronological or vice-versa, this schema makes such a migration quite expensive. -In popular discussions, we also have a potential issue with document -size. If we have a particularly avid discussion, for example, we may -outgrow the 16MB limit that MongoDB places on document size. We can also +In popular discussions, you also might have an issue with document size. +If you have a particularly avid discussion, for example, it might +outgrow the 16MB limit that MongoDB places on document size. You can also run into scaling issues, particularly in the threaded design, as documents need to be frequently moved on disk as they outgrow the space allocated to them. @@ -278,17 +278,18 @@ allocated to them. Operations: All Comments Embedded ================================= -Here, we describe the various operations we might perform with the above -single comment per document schema. Note that, in all the cases below, -we need no additional indexes since all our operations are -intra-document, and the document itself (the 'discussion') is retrieved -by its \_id field, which is automatically indexed by MongoDB. +Here, some common operations that you might need for your CMS are +described in the context of embedded comment schema. Once again, the examples are +in Python. Note that, in all the cases below, +there is no need for additional indexes since all the operations are +intra-document, and the document itself (the "discussion") is retrieved +by its ``_id`` field, which is automatically indexed by MongoDB anyway. Post a new comment ------------------ In order to post a new comment in a chronologically ordered (unthreaded) -system, all we need to do is the following: +system, you need the following ``update()``: .. code-block:: python @@ -299,11 +300,11 @@ system, all we need to do is the following: 'author': author_info, 'text': comment_text } } } ) -Note that since we use the $push operator, all the comments will be +Note that since you used the ``$push`` operator, all the comments will be inserted in their correct chronological order. In the case of a threaded -discussion, we have a good bit more work to do. In order to reply to a -comment, we will assume that we have the 'path' to the comment we are -replying to as a list of positions: +discussion, there si a good bit more work to do. In order to reply to a +comment, the code below assumes that it has access to the 'path' to the comment +you're replying to as a list of positions: .. code-block:: python @@ -320,39 +321,37 @@ replying to as a list of positions: 'author': author_info, 'text': comment_text } } } ) -Here, we first construct a field name of the form -'replies.0.replies.2...' as str\_path and then use that to $push the new -comment into its parent comment's 'replies' property. +Here, you first construct a field name of the form +``replies.0.replies.2...`` as ``str_path`` and then use that to ``$push`` the new +comment into its parent comment's ``replies`` property. View the (Paginated) Comments for a Discussion ----------------------------------------------- -To actually view the comments in the non-threaded design, we need to use -the $slice operator: +To actually view the comments in the non-threaded design, you need to use +the ``$slice`` operator: .. code-block:: python discussion = db.discussion.find_one( {'discussion_id': discussion_id}, - { … some fields relevant to our page from the root discussion …, + { ... some fields relevant to your page from the root discussion ..., 'comments': { '$slice': [ page_num * page_size, page_size ] } }) -If we wish to view paginated comments for the threaded design, we need -to do retrieve the whole document and paginate in our application: +If you wish to view paginated comments for the threaded design, you need +to retrieve the whole document and paginate in your application: .. code-block:: python discussion = db.discussion.find_one({'discussion_id': discussion_id}) - def iter_comments(obj): for reply in obj['replies']: yield reply for subreply in iter_comments(reply): yield subreply - paginated_comments = itertools.slice( iter_comments(discussion), page_size * page_num, @@ -361,9 +360,9 @@ to do retrieve the whole document and paginate in our application: Retrieve a Comment Via Position or Path ("Permalink") ----------------------------------------------------- -Instead of using slugs as above, here we retrieve comments by their +Instead of using slugs as above, this example retrieves comments by their position in the comment list or tree. In the case of the chronological -(non-threaded) design, we need simply to use the $slice operator to +(non-threaded) design, you need simply to use the ``$slice`` operator to extract the correct comment: .. code-block:: python @@ -373,8 +372,8 @@ extract the correct comment: {'comments': { '$slice': [ position, position ] } }) comment = discussion['comments'][0] -In the case of the threaded design, we are faced with the task of -finding the correct path through the tree in our application: +In the case of the threaded design, you're faced with the task of +finding the correct path through the tree in your application: .. code-block:: python @@ -385,8 +384,8 @@ finding the correct path through the tree in our application: comment = current Note that, since the replies to comments are embedded in their parents, -we have actually retrieved the entire sub-discussion rooted in the -comment we were looking for as well. +you've have actually retrieved the entire sub-discussion rooted in the +comment you were looking for as well. Schema Design: Hybrid ===================== @@ -397,23 +396,23 @@ comments each: .. code-block:: python { - _id: ObjectId(…), - discussion_id: ObjectId(…), + _id: ObjectId(...), + discussion_id: ObjectId(...), page: 1, count: 42, comments: [ { slug: '34db', - posted: ISODateTime(…), - author: { id: ObjectId(…), name: 'Rick' }, - text: 'This is so bogus … ' }, - … ] + posted: ISODateTime(...), + author: { id: ObjectId(...), name: 'Rick' }, + text: 'This is so bogus ... ' }, + ... ] } -Here, we have a 'page' of comment data, containing a bit of metadata +Here, you maintain a "page" of comment data, containing a bit of metadata about the page (in particular, the page number and the comment count), as well as the comment bodies themselves. Using a hybrid format actually -makes storing comments hierarchically quite complex, so we won't cover -it in this document. +makes storing comments hierarchically quite complex, that approach is not covered +in this document. Note that in this design, 100 comments is a 'soft' limit to the number of comments per page, chosen mainly for performance reasons and to @@ -425,17 +424,18 @@ correctness of the design. Operations: Hybrid ================== -Here, we describe the various operations we might perform with the above -100-comment 'pages'. +Here, some common operations that you might need for your CMS are +described in the context of 100-comment "pages". Once again, the examples are +in Python. Post a New Comment ------------------ -In order to post a new comment, we need to $push the comment onto the -last page and $inc its comment count. If the page has more than 100 -comments, we will insert a new page as well. For this operation, we -assume that we already have a reference to the discussion document, and -that the discussion document has a property that tracks the number of +In order to post a new comment, you need to ``$push`` the comment onto the +last page and ``$inc`` that page's comment count. If the page has more than 100 +comments, you then must will insert a new page as well. This operation starts +with a reference to the discussion document, and assumes that the discussion +document has a property that tracks the number of pages: .. code-block:: python @@ -445,16 +445,16 @@ pages: 'page': discussion['num_pages'] }, { '$inc': { 'count': 1 }, '$push': { - 'comments': { 'slug': slug, … } } }, + 'comments': { 'slug': slug, ... } } }, fields={'count':1}, upsert=True, new=True ) -Note that we have written the find\_and\_modify above as an upsert -operation; if we don't find the page number, the find\_and\_modify will -create it for us, initialized with appropriate values for 'count' and -'comments'. Since we are limiting the number of comments per page, we -also need to create new pages as they become necessary: +Note that the ``find_and_modify()`` above is written as an upsert +operation; if MongoDB doesn't findfind the page number, the ``find_and_modify()`` +will create it for you, initialized with appropriate values for ``count`` and +``comments``. Since you're limiting the number of comments per page to around +100, you also need to create new pages as they become necessary: .. code-block:: python @@ -464,8 +464,8 @@ also need to create new pages as they become necessary: 'num_pages': discussion['num_pages'] }, { '$inc': { 'num_pages': 1 } } ) -Our update here includes the last know number of pages in the query to -ensure we don't have a race condition where the number of pages is +The update here includes the last known number of pages in the query in order to +ensure that you don't have a race condition where the number of pages is double-incremented, resulting in a nearly or totally empty page. If some other process has incremented the number of pages in the discussion, then update above simply does nothing. @@ -473,9 +473,9 @@ then update above simply does nothing. Index Support ~~~~~~~~~~~~~ -In order to efficiently support our find\_and\_modify and update -operations above, we need to maintain a compound index on -(discussion\_id, page) in the comment\_pages collection: +In order to efficiently support the ``find_and_modify()`` and ``update()`` +operations above, you need to maintain a compound index on +(``discussion_id``, ``page``) in the ``comment_pages`` collection: .. code-block:: python @@ -485,7 +485,8 @@ operations above, we need to maintain a compound index on View the (Paginated) Comments for a Discussion ---------------------------------------------- -In order to paginate our comments with a fixed page size, we need to do +In order to paginate comments with a fixed page size (i.e. not with the 100-ish +number of comments on a database "page"), you need to do a bit of extra work in Python: .. code-block:: python @@ -503,42 +504,47 @@ a bit of extra work in Python: if limit == 0: break return result -Here, we use the $slice operator to pull out comments from each page, -but *only if we have satisfied our skip requirement* . An example will -help illustrate the logic here. Suppose we have 3 pages with 100, 102, -101, and 22 comments on each. respectively. We wish to retrieve comments +Here, the ``$slice`` operator is used to pull out comments from each page, +but *only* if the ``skip`` requirement is satisfied. An example helps illustrate +the logic here. Suppose you have 3 pages with 100, 102, +101, and 22 comments on each. respectively. YOu wish to retrieve comments with skip=300 and limit=50. The algorithm proceeds as follows: -Skip Limit Discussion - -300 50 {$slice: [ 300, 50 ] } matches no comments in page #1; subtract -page #1's count from 'skip' and continue - -200 50 {$slice: [ 200, 50 ] } matches no comments in page #2; subtract -page #2's count from 'skip' and continue - -98 50 {$slice: [ 98, 50 ] } matches 2 comments in page #3; subtract page -#3's count from 'skip' (saturating at 0), subtract 2 from limit, and -continue - -0 48 {$slice: [ 0, 48 ] } matches all 22 comments in page #4; subtract -22 from limit and continue - -0 26 There are no more pages; terminate loop ++-------+-------+-------------------------------------------------------+ +| Skip | Limit | Discussion | ++=======+=======+=======================================================+ +| 300 | 50 | ``{$slice: [ 300, 50 ] }`` matches nothing in page | +| | | #1; subtract page #1's ``count`` from ``skip`` and | +| | | continue. | ++-------+-------+-------------------------------------------------------+ +| 200 | 50 | ``{$slice: [ 200, 50 ] }`` matches nothing in page | +| | | #2; subtract page #2's ``count`` from ``skip`` and | +| | | continue. | ++-------+-------+-------------------------------------------------------+ +| 98 | 50 | ``{$slice: [ 98, 50 ] }`` matches 2 comments in page | +| | | #3; subtract page #3's ``count`` from ``skip`` | +| | | (saturating at 0), subtract 2 from limit, and | +| | | continue. | ++-------+-------+-------------------------------------------------------+ +| 0 | 48 | ``{$slice: [ 0, 48 ] }`` matches all 22 comments in | +| | | page #4; subtract 22 from ``limit`` and continue. | ++-------+-------+-------------------------------------------------------+ +| 0 | 26 | There are no more pages; terminate loop. | ++-------+-------+-------------------------------------------------------+ Index Support ~~~~~~~~~~~~~ -Since we already have an index on (discussion\_id, page) in our -comment\_pages collection, we will be able to satisfy these queries +Since you already have an index on (``discussion_id``, ``page``) in your +``comment_pages`` collection, MongoDB can satisfy these queries efficiently. Retrieve a Comment Via Slug ("Permalink") ----------------------------------------- -Here, we wish to directly retrieve a comment (e.g. *not* requiring -paging through all preceeding pages of commentary). In this case, we can -use the slug to find the correct page, and then use our application to +Suppose you wish to directly retrieve a comment (e.g. *not* requiring +paging through all preceeding pages of commentary). In this case, you can +use the slug to find the correct page, and then use the application to find the correct comment: .. code-block:: python @@ -554,7 +560,7 @@ find the correct comment: Index Support ~~~~~~~~~~~~~ -Here, we need a new index on (discussion\_id, comments.slug) to +Here, you'll need a new index on (``discussion_id``, ``comments.slug``) to efficiently support retrieving the page number of the comment by slug: .. code-block:: python @@ -565,12 +571,12 @@ efficiently support retrieving the page number of the comment by slug: Sharding ======== -In each of the cases above, it's likely that our discussion\_id will at -least participate in the shard key if we should choose to shard. +In each of the cases above, it's likely that your ``discussion_id`` will at +least participate in the shard key if you should choose to shard. In the case of the one document per comment approach, it would be nice -to use our slug (or full\_slug, in the case of threaded comments) as -part of the shard key to allow routing of requests by slug: +to use the ``slug`` (or ``full_slug``, in the case of threaded comments) as +part of the shard key to allow routing of requests by ``slug``: .. code-block:: python @@ -579,14 +585,16 @@ part of the shard key to allow routing of requests by slug: { "collectionsharded" : "comments", "ok" : 1 } In the case of the fully-embedded comments, of course, the discussion is -the only thing we need to shard, and its shard key will probably be +the only thing needed to shard, and its shard key will probably be determined by concerns outside the scope of this document. -In the case of hybrid documents, we want to use the page number of the -comment page in our shard key: +In the case of hybrid documents, you'll want to use the page number of the +comment page in the shard key as well as the ``discussion_id`` to allow MongoDB +to split popular discussions among different shards: .. code-block:: python >>> db.command('shardcollection', 'comment_pages', { - ... key : { 'discussion_id' : 1, ``'page'``: 1 } }) + ... key : { 'discussion_id' : 1, 'page': 1 } }) { "collectionsharded" : "comment_pages", "ok" : 1 } +