DBPedias

Your Database Knowledge Community

Kyle Banker

  1. The Joy of Indexing

    We spend quite a lot of time at 10gen supporting MongoDB users. The questions we receive are truly legion but, as you might guess, they tend to overlap. We get frequent queries on sharding, replica sets, and the idiosyncrasies of JavaScript, but the one subject that never fails to appear each day on our mailing list is indexing.

    Now, to be clear, I’m not talking about how to create an index. That’s easy. The trouble runs much deeper. It’s knowing how indexes work and having the intuition to create the best indexes for your queries and your data set. Lacking this intuition, your production database will eventually slow to a crawl, you’ll upgrade your hardware in vain, and when all else fails, you’ll blame both gods and men.

    This need not be your fate. You can understand indexing! All that’s required is the right mental model, and over the course of this series, that’s just what I hope to provide.

    But caveat emptor: what follows is a thought experiment. To get the most out of this post, you can’t skim it. Read every word. Use your imagination. Think through the quizzes. Do this, and your indexing struggles may soon be no more.

    Picturing an index

    To understand indexing, you need a picture in your head. So imagine a cookbook. And not just any cookbook. A massive cookbook, five thousand pages long with the most delicious recipes for every occasion, cuisine, and season with all the good ingredients you might find at home. This is the cookbook to end them all. Let’s call it The Cookbook Omega.

    Now, although this might be best of all possible cookbooks, there are two tiny problems with the Cookbook Omega. The first is that the recipes are in random order. On page 3,475 you have Australian Braised Duck, and on page 2 there’s Zacatecan Tacos.

    That would be mangeable were it not for the second problem: the Cookbook Omega has no index.

    Quiz:
    With no index, how do you find the recipe for Rosemary Potatoes in the Cookbook Omega?
    Answer:
    Your only choice is to scan through every page of the book until you find the recipe. If the recipe is on page 3,973, that’s how many pages you have to look through. In the worst case, you have to look at every single page!

    The solution is to build an index.

    The recipe search

    There are few ways you can imagine searching for a recipe, but the recipe’s name is probably a good place to start. If we create an alphabetical listing at the end of the book of each recipe’s name followed by its page number, then we’ll have indexed the book by recipe name. A few entries might look like this:

    Tibetan Yak Soufflé
    45
    Toasted Sesame Dumplings
    4,011
    Turkey à la King
    943

    As long as we know the name of recipe, or even the first few letters of that name, we can use this index quickly find that recipe in the book. If that’s the only way we expect to search for recipes, then our work is done.

    But, of course, this is unrealistic. Because we can also imagine wanting to find recipes based on the ingredients we have in our pantry. Or perhaps we want to search by cuisine. For those cases, we need more indexes.

    Quiz:
    With just one index in recipe name, how do you find all the Chicken recipes?
    Answer:
    Lacking the proper indexes, you have to scan the entire book, all 5,000 pages. This is true for any search on ingredients or cuisine.

    So, we need to build another index, this time on ingredients. In this index, we have an alphabetical listing of ingredients each pointing to the page number of recipes containing that ingredient. The most basic index on ingredients would thus look like this:

    Cashews
    3, 20, 42, 88, 103, 1,215…
    Cauliflower
    2, 47, 88, 89, 90, 275…
    Chicken
    7, 9, 80, 81, 82, 83, 84…
    Currants
    1,001, 1,050, 2,000, 2,133…

    Is the index you thought you were going to get? Is it even helpful?

    This index is good if all you need is a list of recipes for a given ingredient. But if you have any other information about the recipe that you want to include in your search, you you still have some scanning to do. Once you know the page numbers where Cauliflower is referenced, you then to go to each of those pages to get the name of the recipe and what cuisine it’s part of. This is, of course, better than paging through the whole book, but there are still improvements to be made.

    Quiz:
    You randomly discovered a great chicken recipe in the Cookbook Omega several months ago, but you’ve forgotten its name. At this point, we have two indexes, one on recipe name and the other on ingredients. Can you think of a way to use these two indexes in combination to find your long lost chicken recipe?
    Answer:
    I sure hope not. Think about it. If you start with the index on recipe name, but don’t remember the name of the recipe, then searching this index is little better than paging through the entire book. If, on the other hand, you start with the index on ingredients, then you’ll have a list of page numbers to check, but those page numbers can in no way be plugged in to the index on recipe name. Therefore, you can only use one index in this case, and it turns out that the one on ingredients is the more helpful of the two.
    Discussion:
    We commonly encounter users who believe that searching on two fields can be facilitated by creating two separate indexes on those fields. This example should give you some intuition about why this just isn’t possible.

    Happily, there is a solution to the long lost chicken recipe, and its answer lies in the use of compound indexes.

    Compound indexes

    The two indexes we’ve created so far are single-key indexes: they both order just one data item from each recipe. We’re now going to build out yet another index for the Cookbook Omega, but this time, instead of using just one data item, we’ll use two. Indexes that use more than one key like this are called compound indexes.

    This compound index uses both ingredients and recipe name, in that order. We’ll notate the index like this: ingredientname. Here’s what part of this index would look like:

    Cashews
    Cashew Marinade
    1,215
    Chicken with Cashews
    88
    Rosemary-Roasted Cashews
    103
    Cauliflower
    Bacon Cauliflower Salad
    875
    Lemon-baked Cauliflower
    89
    Spicy Cauliflower Cheese Soup
    47
    Currants
    Creamed Scones with Currants
    2,000
    Fettuccini with Glazed Duck
    2,133
    Saffron Rice with Currants
    1,050

    The value of this index for a human is obvious. You can now search by ingredient and probably very quickly find the recipe you want. For a machine, it’s still valuable for this use case, but only if we can provide the first letters of the recipe name, which will keep the database from having to scan every recipe name listed for that ingredient. This compound index would be especially useful if, as with the Cookbook Omega, we had several hundred (or thousand) chicken recipes.

    Quiz:
    Remember: with compound indexes, order matters. Imagine a compound index on nameingredient. Would this index be interchangeable with the inverse compound index we just explored?
    Answer:
    Definitely not. With this index, once we have the recipe name, our search is already limited to a single recipe, a single page in our cookbook. So if this index were used on a search for the recipe “Cashew Marinade” and the ingredient “Bananas” then the index could confirm that no such recipe exists. But that’s not exactly our use case.
    Bonus Quiz:
    Our cookbook now has three indexes: one on recipe name, one on ingredients, and one on ingredientname. With the compound index in place, is it possible to eliminate either of the first two indexes we created?
    Answer:
    Yes! We can safely tear out the single-key index on ingredients. Why? Because a search that just specifies an ingredient can use the index on ingredientname. If we know an ingredient, we can, using that compound index, easily get a list of all page numbers containing said ingredient. Look again at the sample entries for this index to see why this is so.
    Exercise:
    If we only know the recipe name, why can’t we use the index on ingredientname to find that recipe’s page number?

    Selectivity

    We’re going to cover one more concept that you can use to design better indexes. It’s the idea that some keys for your data will be more selective than others. Imagine that the recipes in the Cookbook Omega consist of a total of 200 different ingredients but only represent 15 different cuisines. If the cookbook contains 5,000 recipes, which key — ingredients or cuisine — is more selective?

    Intuitively, it should be easy to see that ingredient narrows the the number of recipes much more than cuisine does. On averge, there will be 417 (5,000 / 12) recipes per cuisine but only 125 recipes per ingredient (5,000 / 200 * 5). This assumes an average of five ingredients per recipe, but clearly, the actual selectivity of any given ingredient will always hard to estimate. Some ingredients (chicken) will be less selective than others (anise). But ingredient is generally the more selective of the two fields.

    Quiz:
    Suppose you want to search the Cookbook Omega on both cuisine and ingredient. You’ll need to build a compound index, but the ordering will make a difference. Which should come first? (Hint: think about selectivity).
    Answer:
    You’re probably better off building the compound index on ingredientcuisine. In addition to providing a more selective first key, this compound index will actually be useful for queries on ingredient only. It’s easy to see how the inverse compound index on cuisine→ingredient wouldn’t be all that useful for the opposite case, where the search is on cuisine alone. (And of course, we all know now that queries on the second key of a compound index aren’t possible.)
    Exercise:
    Visualize for yourself the difference between the representatons of ingredientcuisine and cuisineingredient as they would appear as indexes in the Cookbook Omega.

    Lessons and Limits

    The goal of this post was to lay a groundwork for readers needing a better mental model of indexes. And while I’m convinced that having a solid mental model is always better than memorizing rules of thumb, here are few concrete lessons that can derived from this thought experiment:

    • Indexes make for fast retrieval. Without the proper indexes, you’re forced to manually page through your data, which is slow.
    • Indexes on a single key cannot be used in combination with one another. For queries that use multiple keys (e.g., ingredient and recipe name), you need a compound index.
    • An index on ingredient can and should be eliminated if you have a second index on ingredientcuisine. More generally, if you have an index on ab, then a second index on a alone will be redundant.
    • The order of keys in a compound index matters greatly. It’s usually better if the more selective key comes first. (But what’s best will always be dicated by the range of queries you expect to perform.)
    • Keep a mind to selectivity. You should probably avoid creating indexes that aren’t very selective, particularly if they’re on a single key only.

    One final note is to bear in mind that the cookbook analogy can be taken only so far. It’s a model for understanding indexes, but it doesn’t fully correspond to the way that B-tree indexes actually work (B-trees are the data structure used to represent indexes in most databases, including MongoDB). The ideas presented here generally hold true, but if you’d like more nuance, stay tuned for the next post, where I’ll introduce B-trees and build on the mental model begun here.

  2. E-commerce Inventory with MongoDB

    In my last post, I sketched out how a few e-commerce domains might be modeled using MongoDB. Among the examples was an illustration of how inventory could be managed using the database’s find_and_modify command.

    Inventory as Documents

    To recap, I suggested that each physical inventory item be represented as a separate entry in an inventory collection. When a user adds an item to her cart, we run a find_and_modify operation, which searches for an item with the given sku that’s marked as available. If the item is found, we mark the item as in-cart and set an expiration date so that a separate reaper process can periodically comb through the collection and reset any unpurchaed items.

    Given that find_and_modify works atomically, the query and update all happen within the scope of a single atomic operation. This works great when the user is merely adding one item to the cart (or checking out with just one item.) But, as two commenters pointed out, the situation becomes a bit dicier when the user is trying to add or check out with more than one item. How do we handle this situation atomically?

    The short answer is that, strictly speaking, we can’t. MongoDB does not support multi-object transactions at the moment and probably won’t anytime soon (doing so would add too much complexity in a sharded situation). However, there is a workaround that will serve in most situations, which is to simulate a transaction at the application layer.

    Add to cart

    To show that this is possible, I built a simple test application which you can check out on github. If you clone the repo, you can run the test suite; the code is relatively easy to follow.

    The algorithm goes something like this, assuming that multiple items are added to the cart in a single operation:

    For each item:

    • Determine if that item is available.
    • If the item is available, set its state to in-cart, and push that item’s _id value onto the order document.
    • If at any point an item can’t be transitioned, perform a rollback. This involves removing all the added items from the cart, reverting their state to available, and then raising an exception.

    That’s it. The biggest drawback here is that the process isn’t atomic. If either the database server or the application server crashes in the middle of the operation, then we’re probably left in an inconsistent state. Thus, this techinque shouldn’t be used in an accounting system. But you already knew that.

    Checkout

    With just a few slight modifications, the code I posted can also be used at checkout. The checkout process is quite similar:

    • Use find_and_modify to move all cart items to a purchased state. If this fails for any one item, roll all the items back to the in-cart state, and abort the purchase.
    • If all items advance, authorize the credit card. If the credit card
    • authorizes, the order can be transitioned to the next logical state (e.g., ready-to-ship). If the card fails to authorize, revert the cart items to in-cart.

    Conclusion

    Is the inventory control technique outlined here as robust as what you’d get with a transactional relational database engine? Obviously not.

    Is it sufficient for ensuring that no more inventoy is sold than is available? Yes.

    Not all e-commerce sites need to limit purchases based on available inventory. But if you’re building one that does, and you want to use MongoDB, the find_and_modify technique presented here just may do the trick.

  3. MongoDB and E-commerce

    There are still a number of misconceptions regarding MongoDB’s suitability for e-commerce sites. I refer, in particular, to this stackoverflow posting.

    The hesitant reaction there expressed is understandable, as most of the databases falling under the NoSQL umbrella would fare poorly in an e-commerce setting. But this is not the case with MongoDB. Indeed, with its support for rich data models, dynamic, indexed queries, atomic operations, and replicated writes, MongoDB can be a viable and even desirable database for e-commerce; here’s why.

    Catalog Management

    If you need some insight into the way in which catalog management is handled with relational databases, have a quick look at the schemas of the popular Magento e-commerce framework or Apache’s OfBiz. What you’ll see is a flurry of tables working together to provide a flexible schema on top of a fundamentally inflexible style of database system.

    What this means is that the data for any given product is spread out across dozens of tables. This increases the complexity of the code required to persist and query individual products and makes a shell-based inqury all but impossible. Ask yourself if you’d rather write the SQL JOIN to pull together a product modeled like this:


    Or issue a simple find from the MongoDB JavaScript shell to pull back a JSON object like this:

    // Find a product object
    db.products.find({'_id': ObjectID("4bd87bd8277d094c458d2fa5")});
    
    {_id: ObjectID("4bd87bd8277d094c458d2a43"),
     title: "A Love Supreme [Original Recording Reissued]"
     author: "John Coltrane",
     author_id: ObjectID("4bd87bd8277d094c458d2fa5"),
    
     details: {
       number_of_discs: 1,
       label: "Impulse Records",
       issue_date: "December 9, 1964",
       average_customer_review: 4.95,
       asin: "B0000A118M"
     },
    
     pricing: {
      list: 1198,
      retail: 1099,
      savings: 99,
      pct_savings: 8
     },
    
     categories: [
       ObjectID("4bd87bd8277d094c458d2a43"),
       ObjectID("4bd87bd8277d094c458d2b44"),
       ObjectID("4bd87bd8277d094c458d29a1")
     ]
    }
    

    Of course, this is not a complete representation of a product, but it does demonstrate how many of the more trivial tables existing in a relational representation can be dispensed with in a document representation.

    For object-oriented data, documents simply make more sense, both conceptually and performance-wise. A document-oriented representation of product data means fewer entities (a handful of collections vs. dozens of tables), better query performance (no server-side joins), and structures that fit the product precisely. There’s no longer any need to design some master schema that can account for every single conceviable product.

    Catalog management is essentially content management, a domain MongoDB excels at. To read more about this, see the Drupal presentation on their move to MongoDB.

    Shopping Carts and Orders

    Allowing that a shopping cart is merely an order in a ‘cart’ state, the modeling of shopping carts and orders in MongoDB becomes quite straightforward:

    {'_id': objectid('4b980a6dea2c3f4579da141e'),
     'user_id': objectid('4b980a6dea2c3f4579a4f54'),
     'state': 'cart',
    
     'line_items': [
        {'sku': 'jc-432',
         'name': 'John Coltrane: A Love Supreme',
         'retail_price': 1099
        },
    
        {'sku': 'ly-211',
         'name': 'Larry Young: Unity',
         'retail_price': 1199
        },
      ],
    
     'shipping_address': {
       'street': '3333 Greene Ave.',
       'city': 'Brooklyn',
       'state': 'NY',
       'zip': '11216'
      },
    
      'subtotal': 2199
    }
    

    Notice that we can represent the order line items as an array of products. As is usual with documents, this makes displaying the shopping cart more straighforward, since no joins are involved. But it also solves the problem of product versioning. It’s often necessary to take a snapshot of a product when a user purchases it. This can be accomplished in an RDBMS by linking to a particular version of a product. Here, however, we simply store a snapshot of the product in the order itself.

    Querying Orders

    Since MongoDB supports dynamic queries and secondary indexes, querying for orders is a snap. It’s possible, for example, to define an index on product sku, which allows for efficient queries on all orders of a given product:

    db.orders.ensureIndex({'line_items.sku': 1});
    db.orders.find({'line_items.sku' => 'jc-431'});
    

    With MongoDB, we can query on arbitrary attributes, so any query on the orders collection is possible. And for common queries, we can define indexes for greater efficiency.

    Aggregation

    Of course, aggregation is also necessary. We’ll want to report on orders in different ways, and for that, map-reduce is available. As an example, here’s how to write a map-reduce command that aggregates order totals per zip code:

    map = "
      function() {
        emit(this['shipping_address']['zip'], {total: this.total})
      }"
    
    reduce = "
      function(key, values) {
        var sum = 0;
        values.forEach(function(doc) {
          sum += doc.total;
        }
    
        return {total: sum};
      }"
    
    
    db.orders.mapReduce(map, reduce, {out: 'order_totals_by_zip'});
    

    For more on MongoDB’s map-reduce, see Map-Reduce Basics.

    Updating Orders

    Incrementing Quantity

    One way to go about adjusting quantity is to use the positional operator, which lets you apply atomic operations to individual objects within an array. Here’s how we change the number of Coltrane albums we’re ordering:

    db.orders.update({'_id': order_id, 'line_items.sku':'jc-431'},
                     {'$set': {'line_items.$.quantity': 2}});
    

    Adding and Removing Items

    Likewise, atomic operators solve the problem of adding and removing products from the cart. For instance, we can use the $push atomic operator to add an item to our cart:

    db.orders.update({'_id': order_id},
        {'$push': {'line_items':
          {'sku': 'md-12', 'price': 2500, 'title': 'Basketball'}}
         '$inc': {'subtotal': 2500}});
    

    When adjusting quantity and changing the cart items themselves, it’s necessary to update the order totals. Notice that we use the $inc operator to handle this.

    Inventory

    Not all e-commerce sites need inventory management. But for those that do, MongoDB can rise to the ocassion.

    One way to model inventory is to store a separate document per physical item in our warehouse. So, for example, if our warehouse has twenty copies of the Coltrane album, that translates to twenty distinct documents in our inventory collection. Each document looks something like this:

    {'_id': objectid('4b980a6dea2c3f4579da432a'),
     'sku': 'jc-431',
     'state': 'available',
     'expires': null,
     'order_id': null
    }
    

    When a user tries to add an item to the cart, a findAndModify command can be issued to atomically mark the item as in-cart, associate the item with a given order, and set an expiration time:

    query = {'sku': 'jc-431', 'state': 'available'};
    
    update = {'$set':
              {'state':    'cart',
               'order_id': order_id,
               'expires':  Date.now() + 15 * 60}};
    
    item = db.inventory.findAndModify(query: query, update: update);
    

    If we get an item back from the findAndModify operation, we know we have a unique lock on that item, and we can store it in our cart. When the user goes to check out, the item’s state can be advance to ‘purchased,’ or whatever the case calls for.

    Meanwhile, we can run a script in the background that frees cart inventory not purchased in the fifteen-minute window. The update is trivial:

    db.inventory.update({'state': 'cart', 'expires': {'$lt': Date.now()}},
      {'$set' {'state': 'available', 'expires': null, 'order_id': null}},
      {multi: true});
    

    Transactions, Consistency, and Durability

    A lot of the arguments levied against NoSQL in e-commerce center around transactions, consistency, and durability. A couple points, then, are worth noting.

    Concerning transactions, it is true that MongoDB doesn’t support the multi-object kind; however, it does support atomic operations on individual documents. And this, combined with the document-oriented modeling just described and a little creativity, is adequate to many e-commerce problems. Certainly, if we needed to literally debit one account and credit another in the same operation, or if we wanted rollback, we’d need full-fledged transactions. However, the transactionality provided by MongoDB may be sufficient for most, if not all, e-commerce operations.

    If you’re concerned about consistency and durability, know that write operations in MongoDB can be made consistent across connections. Furthermore, MongoDB 1.5 support near-real-time replication, so that it’s now possible to assure that an operation has been replicated before returning. See the documentation on replication acknowldgment for details.

    Conclusion

    Certainly most NoSQL databases weren’t built with e-commerce in mind. Databases that lack rich data models, dynamic queries, and any notion of transactionality can’t be expected to compete in the e-commerce space, and so it’s understandable how one might think that MongoDB couldn’t, either.

    But for the parts of an e-commerce site comprising content management, using MongoDB will mean a clear win. And even for the more transactional components of the system, MongoDB has features that make the prospect of running an entire e-commerce site on it a real possibility. So while the ideas in this post are just a sketch, let’s not run from the notion that a document database just might do e-commerce, and do it well.

  4. Do the MongoDB Drivers Support Feature <i>X</i>?

    With MongoDB’s aggressive roadmap, new features are constantly being added. Notable in the 1.4 release are the $addToSet update operator, the findAndModify command, and indexes supporting two-dimensional coordinates. Usually the first question from users is whether their driver supports the new features. Happily, the answer is almost always yes; in this post, I explain why that is so.

    New MongoDB features, not counting the many internal improvements, usually fit into one of three categories. Either we’re dealing with a new query or update operator, a new database command, or an indexing enhancement. We’ll look at each of these in turn, and show how it is that the drivers can immediately support these types of new features.

    Query and Update Operators

    MongoDB 1.4 features an $addToSet update operator, which pushes an item onto an array only if that item doesn’t yet exist in the array. If we want to uniquely add a tag to an existing array of tags, the following code will work:

    @posts.update({'slug' => 'mongodb-in-ruby'},
      {'$addToSet' => {'tags' => 'technology'}})
    

    This update will succeed only if the tags array doesn’t contain the term ‘technology.’

    Nothing in the driver has to change to support this new feature. This is because the documents sent to the update method are translated directly into BSON and sent to the MongoDB server. Since there’s no driver intervention, new update and query operators are automatically supported.

    Database Commands

    Another new 1.4 feature is the findandmodify command. findandmodify is used to update a document atomically and immediately return it. The is useful for implementing queues. For instance, suppose we want to grab the latest item from a queue and mark it as being in progress. First, we need to construct the parameters of the command:

    command = OrderedHash.new
    
    # The 'findAndModify' key reference the name of the collection.
    command['findandmodify'] = 'jobs'
    
    # The 'query' filters the collection; here we only want 'new' items.
    command['query'] = {'state' => 'new'}
    
    # The sort order in this case allows us to fetch the oldest, unprocessed item.
    command['sort'] = {'created_at' => -1}
    
    # Here we specify the update we want,
    # which is to mark the object as being in progress
    command['update'] = {'$set' => {'state' => 'processing'}}
    

    Once we’ve constructed the command, we send it to the database’s command method, which all the drivers implement:

    # Send the command we just specified.
    result = @db.command(command)
    

    Straightforward enough. But to see what’s really happening here, have a look at another way of accomplishing the same thing:

    @cmd_collection = @db['$cmd']
    
    result = @cmd_collection.find(command).limit(-1).next_document
    

    Do you see how it works? We’re simply querying a special collection called $cmd. The limit of -1 indicates that we don’t want to create a cursor on the server. Since database commands are nothing more than queries on the $cmd collection, the drivers will always support new commands right away.

    Note that some commands need to be run on the admin database. For instance, the new logrotate command falls into this category. But most commands are run on whichever database your application is using. Have a look a the list of database commands for more on this, and look at the implementation of some commonly-used commands. Did you know that count() is actually a command? Check out the Ruby driver’s implementation for a bit of illumination.

    Indexing Enhancements

    1.4 adds support for two-dimensional indexes, and accordingly, we’ve added a few conveniences to the Ruby driver that simplify the creation of these indexes. If you have a places collection and a field called loc that specifies x- and y-coordinates, here’s how you can create the index:

    @places = @db['places']
    @places.create_index([['loc', Mongo::GEO2D]])
    

    But there’s a more direct way of creating indexes: simply insert a document into the database’s system.indexes collection. See, each database has a special collection called system.indexes. You can query the collection to find out which indexes have been defined, and you add and remove documents from that collection to define and delete indexes. If we were going to create the above index directly, it’d look like this:

    # Create a document specifying the index
    doc = {'name' => 'loc2d', 'ns' => 'myapp.places', 'key' => {'loc' => '2d'}}
    
    # Get the system.indexes collection
    @indexes = @db['system.indexes']
    
    # Insert the document
    @indexes.insert(doc)
    

    The index name can be anything. The ns key specifies the namespace, which is the database name followed by a dot followed by the collection name. The key defines the index itself. Note that if you’re creating a compound index, key should point to an ordered hash, since order in that case does matter.

    But that’s all there is to it. If you want to explore further, look at the Ruby implementation of Collection#create_index.

    The Answer is Yes

    Next time you’re wondering if the the drivers support some new MongoDB feature, ask yourself whether that new feature is new update or query operator, and new command, or an indexing enhancement. If so, the answer is an emphatic yes.

  5. Try MongoDB

    Want to try MongoDB in your browser? Just deployed this a few days ago:

    Try MongoDB Screenshot

    One of the nice things about MongoDB’s JavaScript shell is that we can run a variation on it directly in the browser. It doesn’t provide near the functionality of the actual shell, but users can still get a feel for CRUD in MongoDB, and it’s possible to use any of the nifty query and update operators.

    Anyone who’d like to contribute can fork the code on github.

    (Almost forgot to mention that this is indeed inspired by, and is a little homage to, the great _why.)

  6. MongoDB Aggregation III: Map-Reduce Basics

    If you’re accustomed to working with relational databases, the thought of specifying aggregations with map-reduce may be a bit intimidating. Here, in the third in a series of articles on MongoDB aggregation, I explain map-reduce. After reading this, and with a little practice, you’ll be able to apply the map-reduce paradigm to a huge number of aggregation problems.

    A comments example, of course

    Let’s start with an example. Suppose we have a collection of comments with the following document structure:

    { text: "lmao! great article!",
      author: 'kbanker',
      votes: 2
    }
    

    Here we have a comment authored by ‘kbanker’ with two votes. Now, we want to find the total number of votes each comment author has earned across the entire comment collection. It’s a problem easily solved with map-reduce.

    Mapping

    As its name suggests, map-reduce essentially involves two operations. The first, specified by our map function, formats our data as a series of key-value pairs. Our key is the comment author’s name (this makes sense only if this username is unique). Our value is a document containing the number of votes. We generate these key-value pairs by emitting them. See below:

    // Our key is author's userame; 
    // our value, the number of votes for the current comment.
    var map = function() {
      emit(this.author, {votes: this.votes});
    };
    

    When we run map-reduce, the map function is applied to each document. This results in a collection of key-value pairs. What do we do with these results? It turns out that we don’t even have to think about them because they’re automatically passed on to our reduce function.

    Reducing

    Specifically, the reduce function will be invoked with two arguments: a key and an array of values associated with that key. Returning to our example, we can imagine our reduce function receiving something like this:

    reduce('kbanker', [{votes: 2}, {votes: 1}, {votes: 4}]);
    

    Given that, it’s easy to come up with a reduce function for tallying these votes:

    // Add up all the votes for each key.
    var reduce = function(key, values) {
      var sum = 0;
      values.forEach(function(doc) {
        sum += doc.votes;
      });
      return {votes: sum};
    };
    

    Results

    So how do we we run it? From the shell, we pass our map and reduce functions to the mapReduce helper:

    // Running mapReduce.
    var op = db.comments.mapReduce(map, reduce);
    
    {
      "result" : "tmp.mr.mapreduce_1260567078_11",
      "timeMillis" : 8,
      "counts" : {
      "input" : 6,
        "emit" : 6,
        "output" : 2
      },
      "ok" : 1,
    }
    

    Notice that running the mapReduce helper returns stats on the operation; the results of the operation itself are stored in a temporary collection. In this case, that collection is tmp.mr.mapreduce_1260567078_11.

    The other stats also prove informative. First is the operation time in milliseconds. Next are the number of input documents, the number of times we called emit (this can be more than once per document), and the number of output documents. Finally, we can be assured that the operation has succeeded because “ok” is 1.

    Of course, what we really want are the results. To get them, just query the output collection:

    // Getting the results from the shell
    db[op.result].find();
    
    { "_id" : "hwaet", "value" : { "votes" : 21 } }
    { "_id" : "kbanker", "value" : { "votes" : 13 } }
    

    How do I execute map-reduce from Ruby?

    Like this:

    # Running map-reduce from Ruby (irb) assuming
    # that @comments references the comments collection
    
    # Specify the map and reduce functions in JavaScript, as strings
    >> map    = "function() { emit(this.author, {votes: this.votes}); }"
    >> reduce = "function(key, values) { " +
      "var sum = 0; " +
      "values.forEach(function(doc) { " +
      " sum += doc.votes; " +
      "}); " +
      "return {votes: sum}; " +
    "};"
    
    # Pass those to the map_reduce helper method
    @results = @comments.map_reduce(map, reduce)
    
    # Since this method returns an instantiated results collection,
    # we just have to query that collection and iterate over the cursor.
    >> @results.find().to_a
    => [{"_id" => "hwaet",   "value"=>{"votes"=>21.0}}, 
        {"_id" => "kbanker", "value"=>{"votes"=>13.0}}
       ]
    

    Practice

    If you’ve followed along, you should understand the basics of map-reduce in MongoDB. For all the details on options, see the docs. For extra practice, fire up the MongoDB shell and experiment away.

  7. MongoDB Aggregation II: Grouping Elaborated

    This is the second in a series of articles on MongoDB’s aggregation functions. In the first installment, we looked at count(), distinct(), and some of the basics of group(). But group() is rather a beast, so here we take an extended example, and look at two deep features: finalizers and key functions.

    Counting and Sums

    Let’s assume we’re hosting a social news application with a collection of comments. A comment document might look like this:

    # Comment document
    { :id         => 'fcfa23342'
      :user_id    => 'a0fb00004',
      :text       => 'mongodb be so funky',
      :upvotes    => 12
      :downvotes  => 4
      :upvoters   => ['a0fb00004', 'a0fb00005', ...]
      :downvoters => ['a0fb00000', 'a0fb00001', ...]
    }
    

    Now, we want to know which users have garnered the greatest number of upvotes and downvotes. With group(), this is straightforward. Our results will include totals for upvotes and downvotes, with a running total counting each comment. Hence, our initial document has three keys:

    // Initial, aggregator document (in JavaScript)
    var setupDoc = {
      upvote_total: 0,
      downvote_total: 0,
      count: 0
    }
    

    Next, we need a reduce function that adds to these totals:

    // Reduce function (in JavaScript)
    var voteAdder = function(doc, prev) {
      prev.upvote_total   += doc.upvotes;
      prev.downvote_total  += doc.downvotes;
      prev.count          += 1;
    }
    

    Finally, we pass these to group(), specifying user_id as the key to group by:

    // In the MongoDB JS shell:
    db.comments.group({
      key:     {user_id: true},
      initial:  setupDoc,
      reduce:   voteAdder
    });
    

    As expected, this returns an array of documents with the totals we sought:

    // Result of group()
    [
      {
        "user_id"   : "a0fc46004",
        "upvote_total"   : 24,
        "downvote_total" : 30,
        "count"     : 2
      },
      { 
        "user_id"   : "a0fc46005",
        "upvote_total"   : 68,
        "downvote_total" : 3,
        "count"     : 3
      }
    ]
    

    Limiting

    If we don’t need totals for the entire collection of comments, we might be tempted to limit our results by modifying the reduce function:

    // Reduce function, totaling comments from 2009
    var voteAdderWithFilter = function(doc, prev) {
      var startDate = new Date(2009, 0).getTime();
      if(doc.created_at.getTime() > startDate) {
        prev.upvote_total  += doc.upvotes;
        prev.downvote_tota += doc.downvotes;
      }
    }
    

    This gets us our totals, albeit inefficiently. The better route is to specify a query filter. That way, we can take advantage of MongoDB’s query engine and any indexes we may have declared. So we add a cond: key to group(), passing in a document selector like we might pass to find():

    // group(), with a query condition
    db.comments.group({
      key:     {user_id: true},
      initial:  setupDoc,
      reduce:   voteAdder,
      cond:    {created_at: {"$gte": new Date(2009, 0)}}
    });
    

    This will assure that our groupings only include comments from 2009.

    The Finalizer

    But what if counting and summing aren’t enough? Maybe we need the average number of votes per user, or perhaps we need to extract a weighted score. Enter the finalizer.

    Sounds foreboding.

    But it’s just an arbitrary function. It receives each of our result documents and performs whatever operations we wish on them:

    // A finalizer for averaging votes and calculating a score
    var finalizer = function(doc) {
      doc.average_upvotes = doc.upvotes / doc.count;
      doc.score = (0.8 * doc.upvotes) - (0.2 * doc.downvotes);
    }
    

    Run group() again, this time specifying the finalizer:

    // group(), with a query condition and finalizer
    db.comments.group({
      key:     {user_id: true},
      initial:  setupDoc,
      reduce:   voteAdder,
      cond:    {created_at: {"$gte": new Date(2009, 0)}},
      finalize: finalizer
    });
    

    And the richer our results become:

    // group() results, enriched with a finalizer
    [
      {
        "user_id"   : "a0fc46004",
        "uptotal"   : 24,
        "downtotal" : 30,
        "count"     : 2
        "average_upvotes": 12,
        "score"          : 13.2
      },
      { 
        "user_id"   : "a0fc46005",
        "uptotal"   : 68,
        "downtotal" : 3,
        "count"     : 3
        "average_upvotes": 22.667,
        "score"          : 53.8
      }
    ]
    

    But wait!

    MongoDB’s group() function has yet another trick up its sleeves (right?!). Because you might not want to group by any of the available keys in your documents, there’s the keyf option. You pass it a function that returns a document to group by.

    This actually makes loads of sense. You may want to group users by last name: A-F, G-L, etc. Or what if you want to arrange pageviews by month, or week, or hour. If you haven’t cached these values, this isn’t so easy to do. But with a keyfunction, you’ve got it made.

    So, to take an arbitrary but potentially interesting example, what if we decided to group our comments by length? We could define our comments as short (< 10 words), medium (between 10 and 50 words), and long (> 50 words), and see if any correlation between comment length and votes exists.

    Can you see how our keyf function would look?

    // A key function for grouping comments as short, medium, and long
    var commentLength = function(comment) {
      var length = comment.text.split(' ').length;
      if(length < 10) {
        return {short: true};
      else if(length >= 10 && length <= 50) {
        return {medium: true};
      else
        return {large: true};
    }
    

    Then, just replace key with keyf:

    // group(), with a query condition and finalizer
    db.comments.group({
      keyf:     commentLength,
      initial:  setupDoc,
      reduce:   voteAdder,
      cond:    {created_at: {"$gte": new Date(2009, 0)}},
      finalize: finalizer
    });
    

    keyf allows us to group our data in many a sane and zany way. Exploring that range of possibility is left as an exercise to the reader.

    Et, violà

    So, that is indeed the gamut of group() as of MongoDB 1.1.4. Please feel free to drop a note on how you’re using it in your apps.

    In the next article, I present the basics of the not-so-widely-understood map-reduce. Curious? Read on…

  8. MongoDB Aggregation I: Counting and Grouping

    This is the first in a series of articles detailing the syntax, patterns, and use cases for MongoDB’s aggregation functions:

    1. Counting and Grouping
    2. Grouping Elaborated
    3. Map-reduce Basics

    Here, as an introduction, we cover the basics of count(), distinct(), and group().

    Count

    count() simply returns the number of documents in a collection.

    Suppose we have a collection where each document represents a pageview. We can get the total number of pageviews like so:

    // Javascript
    db.pageviews.count()
    
    # Ruby
    @db['pageviews'].count
    

    Admittedly more useful would be the number of pageviews from a given month:

    // Javascript
    db.pageviews.count({date: 
      {'$gte': new Date("Nov 1, 2009"), '$lt': new Date("Dec 1, 2009")}})
    
    # Ruby
    @db['pageviews'].find({:date => 
      {'$gte' => Time.utc(2009, 11), '$lt' => Time.utc(2009, 12)}}).count
    

    Notice that we’ve passed a document selector to the count() method. This ensures that only documents matching the selector are included in the total.

    Distinct

    Another simple but useful aggregator is distinct(), which returns an array of unique values for a given key in a collection. We can specify a root-level key or a nested key. Here, we request unique ip-addresses and user agents:

    // Javascript
    db.pageviews.distinct('ip-address')
    db.pageviews.distinct({'user.agent'})
    
    # Ruby
    @db['pageviews'].distinct('ip-address')
    @db['pageviews'].distinct('user.agent')
    

    Group

    MongoDB’s group() command provides some of the same functionality as SQL’s GROUP BY, although the syntax is rather different. Like most database commands, group() takes a document whose keys designate the parameters of the operation:

    // Group by user-agent, returning a sum for each user-agent.
    db.pageviews.group(
    {
     key: {'user.agent': true}, 
     initial: {sum: 0}, 
     reduce: function(doc, prev) { prev.sum += 1}
    });
    
    // Returns (Note: simplified)
    [
      {
        'user.agent': 'Mozilla 5.0 / Gecko'
        'sum': 241
      },
      {
        'user.agent': 'Mozilla 5.0 / Webkit' 
        'sum': 79
      }
    ]
    

    Looking at the command itself, the first parameter is clear enough: key specifies which key or keys to group by.

    The next two parameters, initial and reduce, may be less familiar. Their use is analgous to most programming language implementations of inject.

    With initial, we provide a base, aggregator document for each grouped result. Here, we’re just saying that the base document should contain one key, sum, having a value of 0.

    // Initial, aggregator document
    {sum: 0}
    

    The reduce function take two parameters: 1) the current document being processed, and 2) the document we’re aggregating over (which starts out as the initial document described above).

    // Reduce function
    function(doc, prev) {
      prev.sum += 1;
    }
    

    In this simple case, we don’t even use the doc parameter; we just increment our initial document’s sum key. Since a new document is used for each grouping, the group() function returns all those documents as an array, adding any keys we’re grouping by, so that we’re given a result like this:

    // Result (Note: simplified)
    [
      {
        'user.agent': 'Mozilla 5.0 / Gecko'
        'sum': 241
      },
      {
        'user.agent': 'Mozilla 5.0 / Webkit' 
        'sum': 79
      }
    ]
    

    The command we’ve been describing could be run from Ruby as follows:

    # Group from Ruby
    @db['pageviews'].group(['user.agent'], nil, {'sum' => 0},
      "function(doc, prev) { prev.sum += 1}")
    

    With the second parameter, here nil, we can provide a query selector, so that group() will only operate over a certain subset of the collection. In the next installment, we explore that along with some of group()’s more advanced features.

    In the meantime, fire the up the MongoDB shell, experiment with some of these functions, and don’t hesitate to consult the docs.

  9. MongoDB in Three Minutes

    Couple nights back, I was given three minutes to demonstrate MongoDB before a somewhat large group of people who’d never heard of it. Source code is on github. At one minute each, the highlights are these:

    1. Schema-free documents.

    MongoDB is schema-free; this means that the structure of MongoDB data need not be defined up front. MongoDB stores data as collections of documents. A document can be thought of as a JSON object, Python dictionary, or Ruby hash (among other things). Documents are natural, elegant ways of representing data, and are of the essence of MongoDB.

    Suppose we want to download a few tweets.

    # DB connection and collection
    @db  = Mongo::Connection.new.db(DATABASE_NAME)
    @nyc = @db.collection('nyc')
     
    (1..5).each do |page|
      Twitter::Search.new('nyc').page(page).each do |tweet|
        @nyc.save(tweet)
      end
    end
    

    That gets us the first five pages of ‘nyc’-related tweets and saves them to the our ‘nyc’ collection in the db. For each tweet, the Twitter gem returns a Ruby hash, which saves naturally to the database.

    2. Dynamic queries.

    MongoDB speaks the language of documents, enabling expressive queries. To take a few examples:

    We can query for a specific key:

      @nyc = DB.collection('nyc')
      @nyc.find(:username => "hwaet")  
    

    Or on a nested key pointing to an array:

      @nyc.find('user.followers' => {'234352343'})  
    

    We can search a field using a regular expression:

      @nyc.find('text' => /z.+/})  
    

    Or query across a date range:

      @nyc.find('created_at' => {'$lte' => Time.now - (60*60*24)}})  
    

    And all of this can be efficient because each collection can define up to 40 secondary indexes:

      @nyc.create_index(['text', 1])
      @nyc.create_index(['user.followers', 1])
      @nyc.create_index([from_user, 1], ['created_at', -1])
    

    3. Binary Storage

    What if we want to store images, videos, or music? MongoDB stores arbitrary binary data, too.

    This code goes through our collection of tweets, fetches each user’s profile image, and saves it to the database using GridFS (i.e., MongoDB’s specification for storing larger binary objects).

    @nyc.find.each do |tweet|
      filename = tweet['from_user'].downcase + ".jpg"
      next if GridStore.exist?(@db, filename)
     
      GridStore.open @db, filename, 'w+' do |file|
        data = open(tweet['profile_image_url']).read
        file.content_type = 'image/jpeg'
        file.puts data
      end
    end
    

    If we want to serve those images with sinatra:

    get '/images/:id' do
      content_type "image/jpeg"
      filename = params[:id].downcase + ".jpg"
      GridStore.read(DB, filename)
    end
    

    Speed, scalability, ease of use…

    All this is to say nothing of production case studies or paths to scalability. Interested readers are encouraged to download a binary and start experimenting.