DBPedias

Your Database Knowledge Community

Snail in a Turtleneck

  1. Replica Set Internals Bootcamp Part III: Reconfiguring

    I’ve been doing replica set “bootcamps” for new hires. It’s mainly focused on applying this to debug replica set issues and being able to talk fluently about what’s happening, but it occurred to me that you (blog readers) might be interested in it, too.

    There are 8 subjects I cover in my bootcamp:

    1. Elections
    2. Creating a set
    3. Reconfiguring
    4. Syncing
    5. Initial Sync
    6. Rollback
    7. Authentication
    8. Debugging

    Prerequisites: I’m assuming you know what replica sets are and you’ve configured a set, written data to it, read from a secondary, etc. You understand the terms primary and secondary.

    Reconfiguring Prerequisites

    One of the goals is to not let you reconfigure yourself into a corner (e.g., end up with all arbiters), so reconfig tries to make sure that a primary could be elected with the new config. Basically, we go through each node and tally up how many votes there will be and if a majority of those is up (the reconfig logic sends out heartbeats).

    Also, the member you send the reconfig to has to be able to become primary in the new setup. It doesn’t have to become primary, but its priority has to be greater than 0. So, you can’t have all of the members have a priority of 0.

    The reconfig also checks the version number, set name, and that nothing is going to an illegal state (e.g., arbiter-to-non-arbiter, upping the priority on a slave delayed node, and so on).

    One thing to note is that you can change hostnames in a reconfig. If you’re using localhost for a single-node set and want to change it to an externally resolvable hostname so you can add some other members, you can just change the member’s hostname from localhost to someHostname and reconfig (so long as someHostname resolves, of course).

    Additive Reconfiguration vs. Full Reconfigs

    Once the reconfiguration has been checked for correctness, MongoDB checks to see if this is a simple reconfig or a full reconfig. A simple reconfig adds a new node. Anything else is a full reconfig.

    A simple reconfig starts a new heartbeat thread for the new member and it’s done.

    A full reconfig clears all state. This means that the current primary closes all connections. All the current heartbeat threads are stopped and a new heartbeat thread for each member is started. The old config is replaced by the new config. Then the member formerly known as primary becomes primary again.

    We definitely take a scorched-earth approach to reconfiguring. If you are, say, changing the priority of a node from 0 to 1, it would make more sense to change that field than to tear down the whole old config. However, we didn’t want to miss an edge case, so we went with better safe than sorry. Reconfig is considered a “slow” operation anyway, so we’ll generally make the tradeoff of slower and safer.

    Propegation of Reconfiguration

    Even if you have a node that is behind on replication or slave delayed, reconfiguration will propegate almost immediately. How? New configs are communicated via heartbeat.

    Suppose you have 2 nodes, A and B.

    You run a reconfig on A, changing the version number from 6 to 7.

    B sends a heartbeat request to A, which includes a field stating that B‘s version number is 6.

    When A gets that heartbeat request, it will see that B‘s config version is less than it’s own, so it’ll send back its config (at version 7) as part of its heartbeat response.

    When B sees that new config, it’ll load it (making the same checks for validity that A did originally) and follow the same procedure described above.

    Force reconfig to the face.

    Forcing Reconfig

    Despite the checks made by reconfig, users sometimes get into a situation where they don’t have a primary. They’d permanently lose a couple servers or a data center and suddenly be stuck with a bunch of secondaries and no way to reconfig. So, in 2.0, we added a force:true option to reconfig, which allowed it to be run on a secondary. That is all that force:true does. Sometimes people complain that force:true wouldn’t let them load an invalid configuration. Indeed, it won’t. force:true does not relax any of the other reconfig constraints. You still have to pass in a valid config. You can just pass it to a secondary.

    Why is my version number 6,203,493?

    When you force-reconfigure a set, it adds a random (big) number to the version, which can be unnerving. Why does the version number jump by thousands? Suppose that we have a network partition and force-reconfigure the set on both sides of the partition. If we ended up with both sides having a config version of 8 and the set got reconnected, then everyone would assume they were in sync (everyone has a config version of 8, no problems here!) and you’d have half of your nodes with one config and half with another. By adding a random number to the version on reconfig, it’s very probable that one “side” will have a higher version number than the other. When the network is fixed, whichever side has a higher version number will “win” and your set will end up with a consistent config.

    It might not end up choosing the config you want, but some config is better than the set puttering along happily with two primaries (or something stupid like that). Basically, if shenanigans happen during a network partition, check your config after the network is healthy again.

    Removing Nodes and Sharding

    I’d just like to rant for a second: removing nodes sucks! You’d think it’s would be so easy, right? Just take the node out of the config and boom, done. It turns out it’s a total nightmare. Not only do you have to stop all of the replication stuff happening on the removed node, you have to stop everything the rest of the set is doing with that node (e.g., syncing from it).

    You also have to change the way the removed node reports itself so that mongos won’t try to update a set’s config from a node that’s been removed. And you can’t just shut it down because people want to be able to play around and do rs.add("foo"); rs.remove("foo"); rs.add("foo") so you have to be able to entirely shut down the replica set’s interaction with the removed node, but in any way that can be restarted on a dime.

    Basically, there are a lot of edge cases around removing nodes, so if you want to be on the safe side, shut down a node before removing it from the set. However, Eric Milkie has done a lot of awesome work on removing nodes for 2.2, so it should be getting better.

    Please let me know what thought of this post, anonymously here or in the comments below:
    Funny (0) Useful (0) Too short (0) Too long (0)
  2. ––thursday #5: diagnosing high readahead

    Having readahead set too high can slow your database to a crawl. This post discusses why that is and how you can diagnose it.

    The #1 sign that readahead is too high is that MongoDB isn’t using as much RAM as it should be. If you’re running Mongo Monitoring Service (MMS), take a look at the “resident” size on the “memory” chart. Resident memory can be thought of as “the amount of space MongoDB ‘owns’ in RAM.” Therefore, if MongoDB is the only thing running on a machine, we want resident size to be as high as possible. On the chart below, resident is ~3GB:

    Is 3GB good or bad? Well, it depends on the machine. If the machine only has 3.5GB of RAM, I’d be pretty happy with 3GB resident. However, if the machine has, say, 15GB of RAM, then we’d like at least 15GB of the data to be in there (the “mapped” field is (sort of) data size, so I’m assuming we have 60GB of data).

    Assuming we’re accessing a lot of this data, we’d expect MongoDB’s resident set size to be 15GB, but it’s only 3GB. If we try turning down readahead and the resident size jumps to 15GB and our app starts going faster. But why is this?

    Let’s take an example: suppose all of our docs are 512 bytes in size (readahead is set in 512-byte increments, called sectors, so 1 doc = 1 sector makes the math easier). If we have 60GB of data then we have ~120 million documents (60GB of data/(512 bytes/doc)). The 15GB of RAM on this machine should be able to hold ~30 million documents.

    Our application accesses documents randomly across our data set, so we’d expect MongoDB to eventually “own” (have resident) all 15GB of RAM, as 1) it’s the only thing running and 2) it’ll eventually fetch at least 15GB of the data.

    Now, let’s set our readahead to 100 (100 512-byte sectors, aka 100 documents): blockdev --set-ra 100. What happens when we run our application?

    Picture our disk as looking like this, where each o is a document:

    ...
    ooooooooooooooooooooooooo
    ooooooooooooooooooooooooo
    ooooooooooooooooooooooooo
    ooooooooooooooooooooooooo
    ooooooooooooooooooooooooo
    ooooooooooooooooooooooooo
    ooooooooooooooooooooooooo
    ooooooooooooooooooooooooo
    ... // keep going for millions more o's
    

    Let’s say our app requests a document. We’ll mark it with “x” to show that the OS has pulled it into memory:

    ...
    ooooooooooooooooooooooooo
    ooooxoooooooooooooooooooo
    ooooooooooooooooooooooooo
    ooooooooooooooooooooooooo
    ooooooooooooooooooooooooo
    ooooooooooooooooooooooooo
    ooooooooooooooooooooooooo
    ooooooooooooooooooooooooo
    ...
    

    See it on the third line there? But that’s not the only doc that’s pulled into memory: readahead is set to 100 so the next 99 documents are pulled into memory, too:

    ...
    ooooooooooooooooooooooooo
    ooooxxxxxxxxxxxxxxxxxxxxx
    xxxxxxxxxxxxxxxxxxxxxxxxx
    xxxxxxxxxxxxxxxxxxxxxxxxx
    xxxxxxxxxxxxxxxxxxxxxxxxx
    xxxxooooooooooooooooooooo
    ooooooooooooooooooooooooo
    ooooooooooooooooooooooooo
    ...
    

    Is your OS returning this with every document?

    Now we have 100 docs in memory, but remember that our application is accessing documents randomly: the likelihood of the next document we access is in that block of 100 docs is almost nil. At this point, there’s 50KB of data in RAM (512 bytes * 100 docs = 51,200 bytes) and MongoDB’s resident size has only increase by 512 bytes (1 doc).

    Our app will keep bouncing around the disk, reading docs from here and there and filing up memory with docs MongoDB never asked for until RAM is completely full of junk that’s never been used. Then, it’ll start evicting things to make room for new junk as our app continues to make requests.

    Working this out, there’s a 25% chance of our app requesting a doc that’s already in memory, so 75% of the requests are going to go to disk. Say we’re doing 2 requests a sec. Then 1 hour of requests is 2 requests * 3600 seconds/hour = 7200 requests, 4800 of which are going to disk (.75 * 7200). If each request pulls back 50KB, that’s 240MB read from disk/hour. If we set readahead to 0, we’ll have 2MB read from disk/hour.

    Which brings us to the next symptom of a too-high readahead: unexpectedly high disk IO. Because most of the data we want isn’t in memory, we keep having to go to disk, dragging shopping-carts full of junk into RAM, perpetuating the high disk io/low resident mem cycle.

    The general takeaway is that a DB is not a “normal” workload for an OS. The default settings may screw you over.

    Please let me know what thought of this post, anonymously here or in the comments below:
    Funny (0) Useful (0) Too short (0) Too long (0)
  3. Night of the Living Dead Ops

    MongoDB users often ask about the “killed” field in db.currentOp() output. For example, if you’ve run db.killOp(), you might see something like:

    > db.currentOp()
    {
    	"inprog" : [
    		{
    			"opid" : 3062962,
    			"active" : true,
    			"lockType" : "write",
    			"waitingForLock" : false,
    			"secs_running" : 32267,
    			"op" : "update",
    			"ns" : "httpdb.servers",
    			"query" : {
    				"_id" : "150.237.88.189"
    			},
    			"client" : "127.0.0.1:50416",
    			"desc" : "conn",
    			"threadId" : "0x2900c400",
    			"connectionId" : 74,
    			"killed" : true,			"numYields" : 0
    		},
    		{
    			"opid" : 3063051,
    			"active" : false,
    			"lockType" : "read",
    			"waitingForLock" : true,
    			"op" : "query",
    			"ns" : "",
    			"query" : {
    				"count" : "servers",
    				"query" : {
    					"code" : {
    						"$gte" : 200
    					}
    				}
    			},
    			"client" : "127.0.0.1:30736",
    			"desc" : "conn",
    			"threadId" : "0x29113700",
    			"connectionId" : 191,
    			"killed" : true,			"numYields" : 0
    		}
            ]
    }

    The operation looks dead… it has killed:true, right? But you can run db.currentOp() again and again and the op doesn’t go away, even though it’s been “killed.” So what’s up with that?

    Chainsaws: the kill -9 of living dead

    It has to do with how MongoDB handles killed operations. When you run db.killOp(3062962), MongoDB looks up operation 3062962 in a hashtable and sets its killed field to true. However, the code running that op gets to decide whether to even check that field and how deal with it appropriately.

    There are basically three ways MongoDB ops handle getting killed:

    • Ones that die when they yield whatever lock they’re holding. This means that if the op never yields (note that numYields is 0 in the example above), it will never be killed.
    • Ones that can be killed at certain checkpoints. For example, index builds happen in multiple stages and check killed between stages. (Many commands do this, too.)
    • Ones cannot be killed at all. For example, rsSync, the name for the op applying replication, falls into this category. There are some sharding commands that can’t be killed, too.

    There is no kill -9 equivalent in MongoDB (other than kill -9-ing the server itself): if an op doesn’t know how to safely kill itself, it won’t die until it’s good and ready. Which means that you can have a “killed” op in db.currentOp() output for a long time. killed might be better named killRequested.

    Also, if you kill an operation before it acquires a lock, it’ll generally start executing anyway (e.g., op 3063051 above). For example, try opening up a shell and make the db hold the writelock for 10 minutes:

    > db.eval("sleep(10*60*1000)")

    While that’s running, in another shell, try doing an insert (which will block, as the db cannot do any writes while the db.eval() is holding the writelock).

    > db.foo.insert({x:1})

    Now, in a third shell, kill the insert we just did (before the 10 minutes elapse):

    > db.currentOp()
    ...
    > // get the opId of the insert from currentOp
    > db.killOp(opId)
    > // run currentOp again to see that killed:true
    > db.currentOp()
    ...

    If you wait 10 minutes for the db.eval() to finish, then do a find on db.foo, you’ll see that {x:1} was actually inserted anyway. This is because the op’s lifecycle looks something like:

    • Wait for lock
    • Acquire lock!
    • Start running
    • Yield lock
    • Check for killed

    So it’ll run a bit before dying (if it can be killed at all), which may produce unintuitive results.

    Please let me know what thought of this post, anonymously here or in the comments below:
    Funny (0) Useful (0) Too short (0) Too long (0)
  4. The Comments Conundrum

    One of the most common questions we get is:

    I have a collection of blog posts and each post has an array of comments. How do I get…
    …all comments by a given author
    …the most recent comments
    …the most popular commenters?

    And so on. The answer to this has always been “Well, you can’t do that on the server side…” You can either do it on the client side or store comments in their own collection. What you really want is the ability to treat embedded documents like a “real” collection.

    The aggregation pipeline gives you this ability by letting you “unwind” arrays into separate documents, then doing whatever else you need to do in subsequent pipeline operators.

    For example…

    Getting all comments by Serious Cat

    Serious Cat’s comments are scattered between post documents, so there wasn’t a good way of querying for just those embedded documents. Now there is.

    Let’s assume we want each comment by Serious Cat, along with the title and url of the post Serious Cat was commenting on. So, the steps we need to take are:

    1. Extract the fields we want (title, url, comments)
    2. Unwind the comments field: make each comment into a “real” document
    3. Query our new “comments collection” for “Serious Cat”

    Using the aggregation pipeline, this looks like:

    > db.runCommand({aggregate: "posts", pipeline: [
    {
       // extract the fields 
       $project: {
            title : 1,
            url : 1,
            comments : 1
        }
    },
    {
        // explode the "comments" array into separate documents
        $unwind: "$comments"
    },
    {
        // query like a boss
        $match: {comments.author : "Serious Cat"}
    }]})

    Now, this works well for something like a blog, where you have human-generated (small) data. If you’ve got gigs of comments to go through, you probably want to filter out as many as possible (e.g., with $match or $limit) before sending it to the “everything-in-memory” parts of the pipeline.

    Getting the most recent comments

    Let’s assume our site lists the 10 most recent comments across all posts, with links back to the posts they appeared on, e.g.,

    1. Great post! -Jerry (February 2nd, 2012) from This is a Great Post
    2. What does batrachophagous mean? -Fred (February 2nd, 2012) from Fun with Crosswords
    3. Where can I get discount Prada shoes? -Tom (February 1st, 2012) from Rant about Spam

    To extract these comments from a collection of posts, you could do something like:

    > db.runCommand({aggregate: "posts", pipeline: [
    {
       // extract the fields
       $project: {
            title : 1,
            url : 1,
            comments : 1
        }
    {
        // explode "comments" array into separate documents
        $unwind: "$comments"
    },
    {
        // sort newest first
        $sort: {
            "comments.date" : -1
        }
    },
    {
        // get the 10 newest
        $limit: 10
    }]})

    Let’s take a moment to look at what $unwind does to a sample document.

    Suppose you have a document that looks like this after the $project:

    {
        "url" : "/blog/spam",
        "title" : "Rant about Spam",
        "comments" : [
            {text : "Where can I get discount Prada shoes?", ...},
            {text : "First!", ...},
            {text : "I hate spam, too!", ...},
            {text : "I love spam.", ...}
        ]
    }

    Then, after unwinding the comments field, you’d have:

    {
        "url" : "/blog/spam",
        "title" : "Rant about Spam",
        "comments" : [
            {text : "Where can I get discount Prada shoes?", ...},
        ]
    }
    {
        "url" : "/blog/spam",
        "title" : "Rant about Spam",
        "comments" : [
            {text : "First!", ...}
        ]
    }
    {
        "url" : "/blog/spam",
        "title" : "Rant about Spam",
        "comments" : [
            {text : "I hate spam, too!", ...}
        ]
    },
    {
        "url" : "/blog/spam",
        "title" : "Rant about Spam",
        "comments" : [
            {text : "I love spam.", ...}
        ]
    }

    Then we $sort, $limit, and Bob’s your uncle.

    Rank commenters by popularity

    Suppose we allow users to upvote comments and we want to see who the most popular commenters are.

    The steps we want to take are:

    1. Project out the fields we need (similar to above)
    2. Unwind the comments array (similar to above)
    3. Group by author, taking a count of votes (this will sum up all of the votes for each comment)
    4. Sort authors to find the most popular commenters

    Using the pipeline, this would look like:

    > db.runCommand({aggregate: "posts", pipeline: [
    {
       // extract the fields we'll need
       $project: {
            title : 1,
            url : 1,
            comments : 1
        }
    },
    {
        // explode "comments" array into separate documents
        $unwind: "$comments"
    },
    {
        // count up votes by author
        $group : {
            _id : "$comments.author",
            popularity : {$sum : "$comments.votes"}
        }
    },
    {
        // sort by the new popular field
        $sort: {
            "popularity" : -1
        }
    }]})

    As I mentioned before, there are a couple downsides to using the aggregation pipeline: a lot of the pipeline is done in-memory and can be very CPU- and memory-intensive. However, used judiciously, it give you a lot more freedom to mush around your embedded documents.

    Please let me know what thought of this post, anonymously here or in the comments below:
    Funny (0) Useful (0) Too short (0) Too long (0)
  5. Hacking Chess: Data Munging

    This is a supplement to the Hacking Chess with the MongoDB Pipeline. This post has instructions for rolling your own data sets from chess games.

    Download a collection of chess games you like. I’m using 1132 wins in less than 10 moves, but any of them should work.

    These files are in a format called portable game notation (.PGN), which is a human-readable notation for chess games. For example, the first game in TEN.PGN (helloooo 80s filenames) looks like:

    [Event "?"]
    [Site "?"]
    [Date "????.??.??"]
    [Round "?"]
    [White "Gedult D"]
    [Black "Kohn V"]
    [Result "1-0"]
    [ECO "B33/09"]
    
    1.e4 c5 2.Nf3 Nc6 3.d4 cxd4 4.Nxd4 Nf6
    5.Nc3 e5 6.Ndb5 d6 7.Nd5 Nxd5 8.exd5 Ne7
    9.c4 a6 10.Qa4  1-0
    

    This represents a 10-turn win at an unknown event. The “ECO” field shows which opening was used (a Sicilian in the game above).

    Unfortunately for us, MongoDB doesn’t import PGNs in their native format, so we’ll need to convert them to JSON. I found a PGN->JSON converter in PHP that did the job here. Scroll down to the “download” section to get the .zip.

    It’s one of those zips that vomits its contents into whatever directory you unzip it in, so create a new directory for it.

    So far, we have:

    $ mkdir chess
    $ cd chess
    $
    $ ftp ftp://ftp.pitt.edu/group/student-activities/chess/PGN/Collections/ten-pg.zip ./
    $ unzip ten-pg.zip
    $
    $ wget http://www.dhtmlgoodies.com/scripts/dhtml-chess/dhtml-chess.zip
    $ unzip dhtml-chess.zip

    Now, create a simple script, say parse.php, to run through the chess matches and output them in JSON, one per line:

    <?php
     
    require("PgnParser.class.php");
     
    $parser = new PgnParser("/path/to/chess/TEN.PGN");
     
    $total = $parser->getNumberOfGames();
    for ($i=0; $i<$total; $i++) {
        echo $parser->getGameDetailsAsJson($i)."\n";
    }
     
    ?>

    Run parse.php and dump the results into a file:

    $ php parse.php > games.json

    Now you’re ready to import games.json.

    Back to the original “hacking” post

    What do you think of this post?
    Funny (0) Useful (0) Too short (0) Too long (0)
  6. Hacking Chess with the MongoDB Pipeline

    MongoDB’s new aggegation framework is now available in the nightly build! This post demonstrates some of its capabilities by using it to analyze chess games.

    Make sure you have a the “Development Release (Unstable)” nightly running before trying out the stuff in this post. The aggregation framework will be in 2.1.0, but as of this writing it’s only in the nightly build.

    First, we need some chess games to analyze. Download games.json, which contains 1132 games that were won in 10 moves or less (crush their soul and do it quick).

    You can use mongoimport to import games.json into MongoDB:

    $ mongoimport --db chess --collection quick_wins games.json
    connected to: 127.0.0.1
    imported 1132 objects

    We can take a look at our chess games in the Mongo shell:

    > use chess
    switched to db chess
    > db.fast_win.count()
    1132
    > db.fast_win.findOne()
    {
    	"_id" : ObjectId("4ed3965bf86479436d6f1cd7"),
    	"event" : "?",
    	"site" : "?",
    	"date" : "????.??.??",
    	"round" : "?",
    	"white" : "Gedult D",
    	"black" : "Kohn V",
    	"result" : "1-0",
    	"eco" : "B33/09",
    	"moves" : {
    		"1" : {
    			"white" : {
    				"move" : "e4"
    			},
    			"black" : {
    				"move" : "c5"
    			}
    		},
    		"2" : {
    			"white" : {
    				"move" : "Nf3"
    			},
    			"black" : {
    				"move" : "Nc6"
    			}
    		},
                    ...
    		"10" : {
    			"white" : {
    				"move" : "Qa4"
    			}
    		}
    	}
    }

    Not exactly the greatest schema, but that’s how the chess format exporter munged it. Regardless, now we can use aggregation pipelines to analyze these games.

    Experiment #1: First Mover Advantage

    White has a slight advantage in chess because you move first (Wikipedia says it’s a 52%-56% chance of winning). I’d hypothesize that, in a short game, going first matters even more.

    Let’s find out.

    The “result” field in these docs is “1-0″ if white wins and “0-1″ if black wins. So, we want to divide our docs into two groups based on the “result” field and count how many docs are in each group. Using the aggregation pipeline, this looks like:

    > db.runCommand({aggregate : "fast_win", pipeline : [
    ... {
    ...    $group : {
    ...        _id : "$result",      // group by 'result' field
    ...        numGames : {$sum : 1} // add 1 for every document in the group
    ...    }
    ... }]})
    {
    	"result" : [
    		{
    			"_id" : "0-1",
    			"numGames" : 435
    		},
    		{
    			"_id" : "1-0",
    			"numGames" : 697
    		}
    	],
    	"ok" : 1
    }

    That gives a 62% chance white will win (697 wins/1132 total games). Pretty good (although, of course, this isn’t a very large sample set).

    In case you're not familiar with it, a reference chessboard with 1-8, a-h marked.

    Experiment #2: Best Starting Move

    Given a starting move, what percent of the time will that move lead to victory? This probably depends on whether you’re playing white or black, so we’ll just focus on white’s opening move.

    First, we’ll just determine what starting moves white uses with this series of steps:

    • project all of white’s first moves (the moves.1.white.move field)
    • group all docs with the same starting move together
    • and count how many documents (games) used that move.

    These steps look like:

    > db.runCommand({aggregate: "fast_win", pipeline: [
    ... // '$project' is used to extract all of white's opening moves
    ... {
    ...     $project : {
    ...         // extract moves.1.white.move into a new field, firstMove
    ...         firstMove : "$moves.1.white.move"
    ...     }
    ... },
    ... // use '$group' to calculate the number of times each move occurred
    ... {
    ...     $group : { 
    ...         _id : "$firstMove",
    ...         numGames : {$sum : 1}
    ...     }
    ... }]})
    {
    	"result" : [
    		{
    			"_id" : "d3",
    			"numGames" : 2
    		},
    		{
    			"_id" : "e4",
    			"numGames" : 696
    		},
    		{
    			"_id" : "b4",
    			"numGames" : 17
    		},
    		{
    			"_id" : "g3",
    			"numGames" : 3
    		},
    		{
    			"_id" : "e3",
    			"numGames" : 2
    		},
    		{
    			"_id" : "c4",
    			"numGames" : 36
    		},
    		{
    			"_id" : "b3",
    			"numGames" : 4
    		},
    		{
    			"_id" : "g4",
    			"numGames" : 11
    		},
    		{
    			"_id" : "h4",
    			"numGames" : 1
    		},
    		{
    			"_id" : "Nf3",
    			"numGames" : 37
    		},
    		{
    			"_id" : "f3",
    			"numGames" : 1
    		},
    		{
    			"_id" : "f4",
    			"numGames" : 25
    		},
    		{
    			"_id" : "Nc3",
    			"numGames" : 14
    		},
    		{
    			"_id" : "d4",
    			"numGames" : 283
    		}
    	],
    	"ok" : 1
    }

    Now let’s compare those numbers with whether white won or lost.

    > db.runCommand({aggregate: "fast_win", pipeline: [
    ... // extract the first move
    ... {
    ...    $project : {
    ...        firstMove : "$moves.1.white.move",
    ...        // create a new field, "win", which is 1 if white won and 0 if black won
    ...        win : {$cond : [
    ...            {$eq : ["$result", "1-0"]}, 1, 0
    ...        ]}
    ...    }
    ... },
    ... // group by the move and count up how many winning games used it
    ... {
    ...     $group : {
    ...         _id : "$firstMove",
    ...         numGames : {$sum : 1},
    ...         numWins : {$sum : "$win"}
    ...     }
    ... },
    ... // calculate the percent of games won with this starting move
    ... {
    ...     $project : {
    ...         _id : 1,
    ...         numGames : 1,
    ...         percentWins : {
    ...             $multiply : [100, {
    ...                 $divide : ["$numWins","$numGames"]
    ...             }]
    ...         }
    ...     }
    ... },
    ... // discard moves that were used in less than 10 games (probably not representative) 
    ... {
    ...     $match : {
    ...         numGames : {$gte : 10}
    ...     }
    ... },
    ... // order from worst to best
    ... {
    ...     $sort : {
    ...         percentWins : 1
    ...     }
    ... }]})
    {
    	"result" : [
    		{
    			"_id" : "f4",
    			"numGames" : 25,
    			"percentWins" : 24
    		},
    		{
    			"_id" : "b4",
    			"numGames" : 17,
    			"percentWins" : 35.294117647058826
    		},
    		{
    			"_id" : "c4",
    			"numGames" : 36,
    			"percentWins" : 50
    		},
    		{
    			"_id" : "d4",
    			"numGames" : 283,
    			"percentWins" : 50.53003533568905
    		},
    		{
    			"_id" : "g4",
    			"numGames" : 11,
    			"percentWins" : 63.63636363636363
    		},
    		{
    			"_id" : "Nf3",
    			"numGames" : 37,
    			"percentWins" : 67.56756756756756
    		},
    		{
    			"_id" : "e4",
    			"numGames" : 696,
    			"percentWins" : 68.24712643678161
    		},
    		{
    			"_id" : "Nc3",
    			"numGames" : 14,
    			"percentWins" : 78.57142857142857
    		}
    	],
    	"ok" : 1
    }

    Pawn to e4 seems like the most dependable winner here. Knight to c3 also seems like a good choice (at a nearly 80% win rate), but it was only used in 14 winning games.

    Experiment #3: Best and Worst Moves for Black

    We basically want to do a similar pipeline to Experiment 2, but for black. At the end, we want to find the best and worst percent.

    > db.runCommand({aggregate: "fast_win", pipeline: [
    ... // extract the first move
    ... {
    ...    $project : {
    ...        firstMove : "$moves.1.black.move",
    ...        win : {$cond : [
    ...            {$eq : ["$result", "0-1"]}, 1, 0
    ...        ]}
    ...    }
    ... },
    ... // group by the move and count up how many winning games used it
    ... {
    ...     $group : {
    ...         _id : "$firstMove",
    ...         numGames : {$sum : 1},
    ...         numWins : {$sum : "$win"}
    ...     }
    ... },
    ... // calculate the percent of games won with this starting move
    ... {
    ...     $project : {
    ...         _id : 1,
    ...         numGames : 1,
    ...         percentWins : {
    ...             $multiply : [100, {
    ...                 $divide : ["$numWins","$numGames"]
    ...             }]
    ...         }
    ...     }
    ... },
    ... // discard moves that were used in less than 10 games (probably not representative) 
    ... {
    ...     $match : {
    ...         numGames : {$gte : 10}
    ...     }
    ... },
    ... // get the best and worst
    ... {
    ...     $group : {
    ...          _id : 1,
    ...          best : {$max : "$_id"},
    ...          worst : {$min : "$_id"}
    ...     }
    ... }]})
    {
    	"result" : [
    		{
    			"_id" : 1,
    			"best" : "g6",
    			"worst" : "Nc6"
    		}
    	],
    	"ok" : 1
    }

    “Nc6″ means “move the knight to c6.” Or, rather, don’t, because it doesn’t tend to work out that well.

    I like this new aggregation functionality because it’s feels simpler than MapReduce. You can start with a one-operation pipeline and build it up, step-by-step, seeing exactly what a given operation does to your output. (And no Javascript required, which is always a plus.)

    There’s lots more documentation on aggregation pipelines in the docs and I’ll be doing a couple more posts on it.

    What do you think of this post?
    Funny (0) Useful (0) Too short (0) Too long (0)
  7. Querying for Timestamps Using ObjectIds

    ObjectIds contain a timestamp, which tells you when the document was created. Because the _id field is always indexed, that means you have a “free” index on your “created at” time (unless you have persnickety requirements for creation times, like resolutions of less than a second, synchronization across app servers, etc.).

    Actually using this index can seem daunted (how do you use an ObjectId to query for a certain date?) so let’s run through an example.

    First, let’s insert 100 sample docs, 10 docs/second.

    > for (i=0; i<10; i++) { 
    ... print(i+": "+Date.now()); 
    ... for (j=0; j<10; j++) { 
    ...    db.foo.insert({x:i,y:j}); 
    ... } 
    ... sleep(1000); 
    ... }
    0: 1324417241111
    1: 1324417242112
    2: 1324417243112
    3: 1324417244113
    4: 1324417245114
    5: 1324417246115
    6: 1324417247115
    7: 1324417248116
    8: 1324417249117
    9: 1324417250117

    Let’s find all entries created after 1324417246115 (i=5).

    The time is currently in milliseconds (that’s how JavaScript does dates), so we’ll have to convert it to seconds:

    > secs = Math.floor(1324417246115/1000)
    1324417246

    (Your secs will be different than mine, of course.)

    ObjectIds can be constructed from a 24-character string, each two characters representing a byte (e.g., “ff” is 255). So, we need convert secs to hexidecimal, which luckily is super-easy in JavaScript:

    > hexSecs = secs.toString(16)
    4ef100de

    Now, we create an ObjectId from this:

    > id = ObjectId(hexSecs+"0000000000000000")
    ObjectId("4ef100de0000000000000000")

    If you get the wrong number of zeros here, you’ll get an error message that is, er, hard to miss.

    Now, we query for everything created after this timestamp:

    > db.foo.find({_id : {$gt : id}})
    { "_id" : ObjectId("4ef100de7d435c39c3016405"), "x" : 5, "y" : 0 }
    { "_id" : ObjectId("4ef100de7d435c39c3016406"), "x" : 5, "y" : 1 }
    { "_id" : ObjectId("4ef100de7d435c39c3016407"), "x" : 5, "y" : 2 }
    { "_id" : ObjectId("4ef100de7d435c39c3016408"), "x" : 5, "y" : 3 }
    { "_id" : ObjectId("4ef100de7d435c39c3016409"), "x" : 5, "y" : 4 }
    { "_id" : ObjectId("4ef100de7d435c39c301640a"), "x" : 5, "y" : 5 }
    { "_id" : ObjectId("4ef100de7d435c39c301640b"), "x" : 5, "y" : 6 }
    { "_id" : ObjectId("4ef100de7d435c39c301640c"), "x" : 5, "y" : 7 }
    { "_id" : ObjectId("4ef100de7d435c39c301640d"), "x" : 5, "y" : 8 }
    { "_id" : ObjectId("4ef100de7d435c39c301640e"), "x" : 5, "y" : 9 }
    { "_id" : ObjectId("4ef100df7d435c39c301640f"), "x" : 6, "y" : 0 }
    { "_id" : ObjectId("4ef100df7d435c39c3016410"), "x" : 6, "y" : 1 }
    { "_id" : ObjectId("4ef100df7d435c39c3016411"), "x" : 6, "y" : 2 }
    { "_id" : ObjectId("4ef100df7d435c39c3016412"), "x" : 6, "y" : 3 }
    { "_id" : ObjectId("4ef100df7d435c39c3016413"), "x" : 6, "y" : 4 }
    { "_id" : ObjectId("4ef100df7d435c39c3016414"), "x" : 6, "y" : 5 }
    { "_id" : ObjectId("4ef100df7d435c39c3016415"), "x" : 6, "y" : 6 }
    { "_id" : ObjectId("4ef100df7d435c39c3016416"), "x" : 6, "y" : 7 }
    { "_id" : ObjectId("4ef100df7d435c39c3016417"), "x" : 6, "y" : 8 }
    { "_id" : ObjectId("4ef100df7d435c39c3016418"), "x" : 6, "y" : 9 }
    Type "it" for more

    If we look at the explain for the query, you can see that it’s using the index:

    > db.foo.find({_id:{$gt:id}}).explain()
    {
    	"cursor" : "BtreeCursor _id_",
    	"nscanned" : 50,
    	"nscannedObjects" : 50,
    	"n" : 50,
    	"millis" : 0,
    	"nYields" : 0,
    	"nChunkSkips" : 0,
    	"isMultiKey" : false,
    	"indexOnly" : false,
    	"indexBounds" : {
    		"_id" : [
    			[
    				ObjectId("4ef100de0000000000000000"),
    				ObjectId("ffffffffffffffffffffffff")
    			]
    		]
    	}
    }

    We’re not quite done, because we’re actually not returning what we wanted: we’re getting all docs greater than or equal to the “created at” time, not just greater than. To fix this, we’d just need to add 1 to the secs before doing anything else. Or I can claim that we were querying for documents created after i=4 all along.

    What do you think of this post?
    Funny (0) Useful (0) Too short (0) Too long (0)
  8. SQL to MongoDB: An Updated Mapping

    Rick Osborne's original chart.

    The aggregation pipeline code has finally been merged into the main development branch and is scheduled for release in 2.2. It lets you combine simple operations (like finding the max or min, projecting out fields, taking counts or averages) into a pipeline of operations, making a lot of things that were only possible by using MapReduce doable with a “normal” query.

    In celebration of this, I thought I’d re-do the very popular MySQL to MongoDB mapping using the aggregation pipeline, instead of MapReduce.

    Here is the original SQL:

    SELECT
      Dim1, Dim2,
      SUM(Measure1) AS MSum,
      COUNT(*) AS RecordCount,
      AVG(Measure2) AS MAvg,
      MIN(Measure1) AS MMin
      MAX(CASE
        WHEN Measure2 < 100
        THEN Measure2
      END) AS MMax
    FROM DenormAggTable
    WHERE (Filter1 IN (’A’,’B’))
      AND (Filter2 = ‘C’)
      AND (Filter3 > 123)
    GROUP BY Dim1, Dim2
    HAVING (MMin > 0)
    ORDER BY RecordCount DESC
    LIMIT 4, 8

    We can break up this statement and replace each piece of SQL with the new aggregation pipeline syntax:

    MongoDB Pipeline MySQL
    aggregate: "DenormAggTable"
    FROM DenormAggTable
    {
        $match : {
            Filter1 : {$in : ['A','B']},
            Filter2 : 'C',
            Filter3 : {$gt : 123}
        }
    }
    WHERE (Filter1 IN (’A’,’B’))
      AND (Filter2 = ‘C’)
      AND (Filter3 > 123)
    {
        $project : {
            Dim1 : 1,
            Dim2 : 1,
            Measure1 : 1,
            Measure2 : 1,
            lessThanAHundred : {
                $cond: [ 
                    {$lt: ["$Measure2", 100] },
                    "$Measure2", // if
                    0]           // else
            }
        }
    }
    CASE
      WHEN Measure2 < 100
      THEN Measure2
    END
    {
        $group : {
            _id : {Dim1 : 1, Dim2 : 1},
            MSum : {$sum : "$Measure1"},
            RecordCount : {$sum : 1},
            MAvg : {$avg : "$Measure2"},
            MMin : {$min : "$Measure1"},
            MMax : {$max : "$lessThanAHundred"}
        }
    }
    SELECT
      Dim1, Dim2,
      SUM(Measure1) AS MSum,
      COUNT(*) AS RecordCount,
      AVG(Measure2) AS MAvg,
      MIN(Measure1) AS MMin
      MAX(CASE
        WHEN Measure2 < 100
        THEN Measure2
      END) AS MMax
     
    GROUP BY Dim1, Dim2
    {
        $limit : 4
    },
    {
        $skip : 8
    }
    LIMIT 4, 8
    {
        $match : {MMin : {$gt : 0}}
    }
    HAVING (MMin > 0)
    {
        $sort : {RecordCount : -1}
    }
    ORDER BY RecordCount DESC

    Putting all of these together gives you your pipeline:

    > db.runCommand({aggregate: "DenormAggTable", pipeline: [
    {
        $match : {
            Filter1 : {$in : ['A','B']},
            Filter2 : 'C',
            Filter3 : {$gt : 123}
        }
    },
    {
        $project : {
            Dim1 : 1,
            Dim2 : 1,
            Measure1 : 1,
            Measure2 : 1,
            lessThanAHundred : {$cond: [{$lt: ["$Measure2", 100]}, {
                "$Measure2",
                0]
            }
        }
    },
    {
        $group : {
            _id : {Dim1 : 1, Dim2 : 1},
            MSum : {$sum : "$Measure1"},
            RecordCount : {$sum : 1},
            MAvg : {$avg : "$Measure2"},
            MMin : {$min : "$Measure1"},
            MMax : {$max : "$lessThanAHundred"}
        }
    },
    {
        $limit : 4
    },
    {
        $skip : 8
    },
    {
        $match : {MMin : {$gt : 0}}
    },
    {
        $sort : {RecordCount : -1}
    }
    ]})

    As you can see, the SQL matches the pipeline operations pretty clearly. If you want to play with it, it’ll be available soon to a the development nightly build.

    If you’re at MongoSV today (December 9th, 2011), check out Chris Westin’s talk on the new aggregation framework at 3:45 in room B4.

    What do you think of this post?
    Funny (0) Useful (0) Too short (0) Too long (0)
  9. On working at 10gen

    10gen is trying to hire a gazillion people, so I’m averaging two interviews a day (bleh). A lot of people have asked what it’s like to work on MongoDB, so I thought I’d write a bit about it.

    A Usual Day

    Coffee: the lynchpin of my day.

    • Get in around 10am.
    • Check if there are any commercial support questions that need to be answered right now.
    • Have a cup of coffee and code until lunch.
    • Eat lunch.
    • If nothing dire has happened, go out for coffee+writing. This refuels my brain and is a creative outlet: that’s where I am now. My coffee does not look nearly as awesome as the coffee on the right.
    • Go back to the office, code all afternoon.
    • Depending on the day, usually between 5:30 and 6:30 the programmers will naturally start discussing problems we had over the day, interviews, support, the latest geek news, etc. Often beers are broken out.
    • Wrap up, go home.

    There are some variations on this: as I mentioned, a lot of time lately is taken up by interviewing. Other coworkers spend a lot more time than I do at consults, trainings, speaking at conferences, etc.

    Other General Workday Stuff

    On Fridays, we have lunch as a team. After lunch, we have a tech talk where someone presents on what they’re working on (e.g., the inspiration for my geospatial post) or general info that’s good to know (e.g., the inspiration for my virtual memory post). This is a nice way to end the week, especially since Fridays often wrap up earlier than other days.

    A couple people use OS X or Windows for development, most people use Linux. You can use whatever you want. I’d like to encourage emacs users, in particular, to apply, as we’re falling slightly behind vi in numbers.

    We sit in an open office plan, everyone at tables in a big room (including the CEO and CTO, who are both programmers). The only people in separate rooms are the people who have to be on the phone all day (sales, marketers, basketweavers… I’m not really clear on what non-technical people do).

    And speaking of what people actually do, here are three examples of my job (that are more specific than “coding”):

    Fixing Other People’s Bugs

    Recently, a developer was using MongoDB and IBM’s DB2 with PHP. After he installed the MongoDB driver, PHP started segfaulting all over the place. I downloaded the ibm_db2 PHP extension to take a look.

    PHP keeps a “storage unit” for extensions’ long-term memory use. Every extension shares the space and can store things there.

    The DB2 extension was basically fire-bombing the storage unit.

    It went through the storage, object by object, casting the objects into DB2 types and then freeing them. This worked fine when DB2 was the only PHP extension being used, but broke down when anyone else tried to use that storage. I gave the user a small patch that stopped the DB2 extension from destroying objects it didn’t create, and everything worked fine for them, after that.

    The Game is Afoot

    A user reported that they couldn’t initialize their replica set: a member wasn’t coming online. The trick with this type of bug to get enough evidence before the user wants to beat you over the head with the 800th log you’ve requested.

    I asked them to send the first round of logs. It was weird, nothing was wrong from server1‘s point of view: it initialized properly and could connect to everyone in the set. I puzzled over the messages, figuring out that once server1 had created the set, server2 had accepted the connection from server1 but then somehow failed to connect back to server1 and so couldn’t pick up the set config. However, according to server1, it could connect fine to server2 and thought it was perfectly healthy!

    I finally realized what must be happening: “It looks like server2 couldn’t connect to any of the others, but all of them could connect to it. Could you check your firewall?”

    “Oh, that server was blocking all outgoing connections! Now its working fine.”

    Elementary, my dear Watson.

    You know you’re not at a big company when…

    At least it had "handles."

    Someone on Sparc complained that the Perl driver wasn’t working at all for them. My first thought was that Sparc is big-endian, so maybe the Perl driver wasn’t flipping memory correctly. I asked Eliot where our Power PC was, and he said we must have forgotten it when we moved: it was still in our old office around the corner.

    “Bring someone to help carry it,” he told me. “It’s heavy.”

    Pshaw, I thought. How heavy could an old desktop be?

    I went around the corner and the other company graciously let me walk into their server room, choose a server, and walk out with it. Unfortunately, it weighed about 50 pounds, and I have a traditional geek physique (no muscles). The trip back to our office involved me staggering a couple steps, putting it down, shaking out my arms, and repeat.

    When I got to our office, I just dragged it down the hallway to our server closet. Eliot saw me tugboating the thing down the hallway.

    “You didn’t bring someone to help?”

    “It’s *oof* fine!”

    Unfortunately, once it was all set up, the Perl driver worked perfectly on it. So it wasn’t big-endian specific.

    I was now pretty sure it was Sparc-specific (another person had reported the same problem on a Sparc), so I bought an elderly Sparc server for a couple hundred bucks off eBay. When it arrived a couple days later, Eliot showed me how to rack it and I spent a day fighting with the Solaris/Oracle package manager. However, it was all worth it: I tried running the Perl driver and it instantly failed (success!).

    After some debugging, I realized that Sparc was much more persnickety than Intel about byte alignment. The Perl driver was playing fast and loose with a byte buffer, casting pieces of it into other types (which Sparc didn’t like). I changed some casts to memcpys and the Perl driver started working beautifully.

    But every day is different

    The episodes above are a very small sample of what I do: there are hundreds of other things I’ve worked on over the last few years from speaking to working on the database to writing a freakin Facebook app.

    So, if this sounded interesting, please go to our jobs website and submit an application!

  10. Getting Started with MMS

    Telling someone “You should set up monitoring” is kind of like telling someone “You should exercise 20 minutes three times a week.” Yes, you know you should, but your chair is so comfortable and you haven’t keeled over dead yet.

    For years*, 10gen has been planning to do monitoring “right,” making it painless to monitor your database. Today, we released the MongoDB Monitoring Service: MMS.

    MMS is free hosted monitoring for MongoDB. I’ve been using it to help out paying customers for a while, so I thought I’d do a quick post on useful stuff I’ve discovered (documentation is… uh… a little light, so far).

    So, first: you sign up.

    There are two options: register a company and register another account for an existing company. For example, let’s say I wanted to monitor the servers for Snail in a Turtleneck Enterprises. I’ll create a new account and company group. Then Andrew, sys admin of my heart, can create an account with Snail in a Turtleneck Enterprises and have access to all the same monitoring info.

    Once you’re registered, you’ll see a page encouraging you to download the MMS agent. Click on the “download the agent” link.

    This is a little Python program that collects stats from MongoDB, so you need to have pymongo installed, too. Starting from scratch on Ubuntu, do:

    $ # prereqs
    $ sudo apt-get install python python-setuptools
    $ sudo easy_install pymongo
    $
    $ # set up agent
    $ unzip name-of-agent.zip
    $ cd name-of-agent
    $ mkdir logs
    $
    $ # start agent
    $ nohup python agent.py > logs/agent.log 2>&1 &

    Last step! Back to the website: see that “+” button next to the “Hosts” title?

    Designed by programmers, for Vulcans

    Click on that and type a hostname. If you have a sharded cluster, add a mongos. If you have a replica set, add any member.

    Now go have a nice cup of coffee. This is an important part of the process.

    When you get back, tada, you’ll have buttloads of graphs. They probably won’t have much on them, since MMS will have been monitoring them for all of a few minutes.

    Cool stuff to poke

    This is the top bar of buttons:

    Of immediate interest: click “Hosts” to see a list of hosts.

    You’ll see hostname, role, and the last time the MMS agent was able to reach this host. Hosts that it hasn’t reached recently will have a red ping time.

    Now click on a server’s name to see all of the info about it. Let’s look at a single graph.

    You can click & drag to see a smaller bit of time on the graph. See those icons in the top right? Those give you:

    +
    Add to dashboard: you can create a custom dashboard with any charts you’re interested in. Click on the “Dashboard” link next to “Hosts” to see your dashboard.
    Link
    Link to a private URL for this chart. You’ll have to be logged in to see it.
    Email
    Email a jpg of this chart to someone.
    i
    This is maybe the most important one: a description of what this chart represents.

    That’s the basics. Some other points of interest:

    • You can set up alerts by clicking on “Alerts” in the top bar
    • “Events” shows you when hosts went down or came up, because primary or secondary, or were upgraded.
    • Arbiters don’t have their own chart, since they don’t have data. However, there is an “Arbiters” tab that lists them if you have some.
    • The “Last Ping” tab contains all of the info sent by MMS on the last ping, which I find interesting.
    • If you are confused, there is an “FAQ” link in the top bar that answers some common questions.

    If you have any problems with MMS, there’s a little form at the bottom to let you complain:

    This will file a bug report for you. This is a “private” bug tracker, only 10gen and people in your group will be able to see the bugs you file.

    * If you ran mongod --help using MongoDB version 1.0.0 or higher, you might have noticed some options that started with --mms. In other words, we’ve been planning this for a little while.

  1. 1
  2. Next ›
  3. Last »