DBPedias

Your Database Knowledge Community

paperplanes

  1. The Virtues of Monitoring

    Over the last year I haven't only grown very fond of coffee, but also of infrastructure. Working on Scalarium has been a fun ride so far, for all kinds of reasons, one of them is dealing so much with infrastructure. Being an infrastructure platform provider, what can you do, right?

    As being responsible for deployment, performance tuning, monitoring, infrastructure has always been a part of many of my job I thought it'd be about time to sprinkle some of my thoughts and daily ops thoughts on a couple of articles. The simple reason being that no matter how much you try, no matter how far away from dealing with servers you go (think Heroku), there will always be infrastructure, and it will always affect you and your application in some way.

    On today's menu: monitoring. People have all kinds of different meanings for monitoring, and they're all right, because there is no one way to monitor your applications and infrastructure. I just did a recount, and there are no less than six levels of detail you can and probably should get. Note that these are my definitions, they don't necessarily have to be officially named, they're solely based on my experiences. Let's start from the top, the outside view of your application.

    Availability Level

    Availability is a simple measure to the user, either your site is available or it's not. There is nothing in between. When it's slow, it's not available. It's a beautifully binary measure really. From your point of view, any component or layer in your infrastructure could be the problem. The art is to quickly find out which one it is.

    So how do you notice when your site is not available? Waiting for your users to tell you is an option, but generally a pretty embarrassing one. Instead you generally start polling some part of your site that's representative of it as a whole. When that particular site is not available, your whole application may as well not be.

    What that page should do is get a quick measure of the most important components of your site, check if they're available (maybe even with a timeout involved so you get an idea if a specific component is broken) and return the result. An external process can then monitor that page and notify you when it doesn't return the expected result. Make sure the site does a bit more than just return "OK". If it doesn't hit any of the major components in your stack, there's a chance you're not going to notice that e.g. your database is becoming unavailable.

    You should run this process from a different host, but what do you do if that host is not available? Even as an infrastructure provider I like outsourcing parts of my own infrastructure. Here's where Pingdom comes into play. They can monitor a specific URL, TCP ports and whatnot from some two dozen locations across the planet and they randomly go through all of them, notifying you when your site is unavailable or the result doesn't match the expectations.

    Pingdom

    Business Level

    These aren't necessarily metrics related to your application's or infrastructure's availability, they're more along the lines of what your users are doing right now, or have done over the last month. Think number of new users per day, number of sales in the last hour, or, in our case, number of EC2 instances running at any minute. Stuff like Google Analytics or click paths (using tools like Hummingbird, for example) in general also fall into this category.

    These kind of metrics may be more important to your business than to your infrastructure, but they're important nonetheless, and they could e.g. be integrated with another metrics collection tool, some of which we'll get to in a minute. Depending on what kind of data you're gathering they're also useful to analyze spikes in your application's performance.

    This kind of data can be hard to track in a generic way. Usually it's up to your application to gather them and turn them into a format that's acceptable to a different tool to collect them. They're also usually very specific to your application and its business model.

    Application Level

    Digging deeper from the outsider's view, you want to be able to track what's going on inside of your application right now. What are the main entry points, what are the database queries involved, where are the hot spots, which queries are slow, what kinds of errors are being caused by your application, to name a few.

    This will give you an overview of the innards of your code, and it's simply invaluable to have that kind of insight. You usually don't need much historical data in this area, just a couple of days worth will usually be enough to analyze problems in retrospect. It can't hurt to keep them around though, because growth also shows trends in potential application code hot spots or database queries getting slower over time.

    To get an inside view of your application, services like New Relic exist. While their services aren't exactly cheap (most monitoring services aren't, no surprise here), they're invaluable. You can dig down from the Rails controller level to find the method calls and database queries that are slowest at a given moment in time (most likely you'll be wanting to check the data for the last hours to analyze an incident), digging deeper into other metrics from there. Here's an example of what it looks like.

    New Relic

    You can also use the Rails log file and tools like Request-log-analyzer. They can help you get started for free, but don't expect a similar, fine-grained level of detail like you get with New Relic. However, with Rails 3 it's become a lot easier to instrument code that's interesting to you and gather data on runtimes of specific methods yourself.

    Other means are e.g. JMX, one of the neat features you get when using a JVM-based language like JRuby. Your application can contiuously collect and expose metrics through a defined interface to be inspected or gathered by other means. JMX can even be used to call into your application from the outside, without having to go through a web interface.

    Application level monitoring also includes exception reporting. Services like Exceptional or Hoptoad are probably the most well known in that area, though in higher price regions New Relic also includes exception reporting.

    Process Level

    Going deeper (closer to inception than you think) from the application level we reach the processes that serve your application. Application servers, databases, web servers, background processing, they all need a process to be available.

    But processes crash. It's a bitter and harsh truth, but they do, for whatever reason, maybe they consumed too many resources, causing the machine to swap or the process to simply crash because the machine doesn't have any memory left to allocate. Think of a memory leaking Rails application server process or the last time you used RMagick.

    Someone must ensure that the processes keep running or that they don't consume more resources than they're allowed to, to ensure availability on that level. These tools are called supervisors. Give them a pid file and a process, running or not, and they'll make sure that it is. Whether a process is running can depend on multiple metrics, availability over the network, a file size (think log files) or simply the existence of the process, while allowing you to send some sort of grace period, so they'll retry a number of times with a timeout before actually restarting the process or giving up monitoring it altogether.

    A good supervisor will also let you alert someone when the expected conditions move outside or their acceptable perimeter and a process had to be restarted. A classic in this area is Monit, but people also like God and Bluepill. On a lower level you have tools like runit or upstart, but their capabilities are usually built around a pid file and a process, not allowing to go on a higher level of checking system resources.

    While I find the syntax of Monit's configuration to not be very aesthetically pleasing, it's proven to be reliable and has a very small footprint on the system, so it's our default on our own infrastructure, and we add it to most our cookbooks for Scalarium, as it's installed on all managed instances anyway. It's a matter of preference.

    Infrastructure/Server Level

    Another step down from processes we reach the system itself. CPU and memory usage, load average, disk I/O, network traffic, are all traditional metrics collected on this level. The tools (both commercial and open source) in this area can't be counted. In the open source world, the main means to visualize these kinds of metrics is rrdtool. Many tools use it to graph data and to keep an aggregated data history around, using averages for hours, days or weeks to store the data efficiently.

    This data is very important in several ways. For one, it will show you what your servers are doing right now, or in the last couple of minutes, which is usually enough to notice a problem. Second, the data collected is very useful to discover trends, e.g. memory usage increasing over time, swap usage increasing, or a partition running out of disk space. Any value constantly increasing over time is a good sign that you'll hit a wall at some point. Noticing trends will usually give you a good indication that something needs to be changed in your infrastructure or your application.

    Munin

    There's a countless number of tools in this area, Munin (see screenshot), Nagios, Ganglia, collectd on the open source end, and CloudKick, Circonus, Server Density and Scout on the paid service level, and an abundance of commercial tools on the very expensive end of server monitoring. I never really bother with the commercial ones, because I either resort to the open source tools or pay someone to take care of the monitoring and alerting for me on a service basis. Most of these tools will run some sort of agent on every system, collecting data in a predefined cycle, delivering it to a master process, or the master processing picking up the data from the agents.

    Again, it's a matter of taste. Most of the open source tools available tend to look pretty ugly on the UI part, but if the data and the graphs are all that matters to you, they'll do just fine. We do our own server monitoring using Server Density, but on Scalarium we resort to using Ganglia as an integrated default, because it's much more cost effective on our users, and given the elastic nature of EC2 it's much easier for us to add and remove instances as they come and go. In general I'm also a fan of Munin.

    Most of them come with some sort of alerting that allows you to define thresholds which trigger the alerts. You'll never get the thresholds right the first time you configure them, constantly keep an eye on them to get a picture of what thresholds are normal, and which are indeed problem areas and require an alert to be triggered.

    The beauty about these tools is that you can throw any metric at them you can think of. They can even be used to collect business level data, utilizing the existing graphing and even alerting capabilities.

    Log Files

    The much dreaded log file won't go out of style for a long time, that's for sure. Your web server, your database, your Rails application, your application server, your mail server, all of them dump more or less useful information into log files. They're usually the most immediate and uptodate view of what's going on in your application, if you chose to actually log something, Rails appliations traditionally seem to be less of a candidate here, but your background services sure are, or any other service running on your servers. The log is the first to know when there's problems delivering email or your web server is returning an unexpected amount of 500 errors.

    The biggest problem however is aggregating the log data, centralized logging if you will. syslog and all the alternative tools are traditionally sufficient, while on the larger scale end you have custom tools like Cloudera's Flume or Facebook's Scribe. There's also a bunch of paid services specializing on logging, most noteworthy are Splunk and Loggly. Loggly relies on syslog to collect and transmit data from your servers, but they also have a custom API to transmit data. The data is indexed and can easily be searched, which is usually exactly what you want to do with logs. Think about the last time you grepped for something in multiple log files, trying to narrow down the data found to a specific time frame.

    There's a couple of open source tools available too, Graylog2 is a syslog server with a MongoDB backend and a Java server to act as a syslog endpoint, and a web UI allowing nicer access to the log data. A bit more kick-ass is logstash which uses RabbitMQ and ElasticSearch for indexing and searching log data. Almost like a self-hosted Loggly.

    When properly aggregated log files can show trends too, but aggregating them gets much harder the more log data your infrastructure accumulates.

    ZOMG! So much monitoring, really?

    Infrastructure purists would start by saying that there's a different between monitoring, metrics gathering and log files. To me, they're a similar means to a similar end. It doesn't exactly matter what you call it, the important thing is to collect and evaluate the data.

    I'm not suggesting you need every single kind of logging, monitoring and metrics gathering mentioned here. There is however one reason why eventually you'll want to have most if not all of them. At any incident in your application or infrastructure, you can correlate all the available data to find the real reason for a downtime, a spike or slow queries.

    For example, your site's performance is becoming sluggish in certain areas, users start complaining. Application level monitoring indicates specific actions taking longer than usual, pointing to a specific query. Server monitoring for your database master indicates an increased number of I/O waits, usually a sign that too much data is read from or written to disk. Simplest reason could be an index missing or that your data doesn't fit into memory anymore and too much of it is swapped out to disk. You'll finally be looking at MySQL's slow query log (or something similar for your favorite database) to find out what query is causing the trouble, eventually (and hopefully) fixing it.

    That's the power of monitoring, and you just can't put any price on a good setup that will give you all the data and information you need to assess incidents or predict trends. And while you can set up a lot of this yourself, it doesn't hurt to look into paid options. Managing monitoring yourself means managing more infrastructure. If you can afford to pay someone else to do it for you, look at some of the mentioned services, which I have no affiliation with, I just think they're incredibly useful.

    Even being an infrastructure enthusiast myself, I'm not shy of outsourcing where it makes sense. Added features like SMS alerts, iPhone push notifications should also be taken into account. Remember that it'd be up to you to implement all this. It's not without irony that I mention PagerDuty. They sit on top of all the other monitoring solutions you have implemented and just take care of the alerting, with the added benefit of on-call schedules, alert escalation and more.

  2. A Simple Redis Use Case for Sorted Sets

    Over at Scalarium we constantly find outselves adding new statistics to track specific parts of the system. Thought it'd be a good idea to share some of them, and how we're using Redis to store them.

    Yesterday I was looking for a way to track the time it takes for an EC2 instance to boot up. Booting up in this case means, how long it takes for the instance to change from state "pending" to "running" on EC2. Depending on utilization and availability zone this can take anywhere from 30 seconds to even 30 minutes (us-east, I'm looking at you). I want to get a feel for how long it takes on average.

    We poll the APIs every so many seconds, so we'll never get an exact number, but that's fine. It actually makes the tracking easier, because the intervals are pretty fixed, and all I need to do is store the interval and increment a number.

    Sounds like a job for a sorted set. We could achieve similar results with a hash structure too, but let's look at the sorted set nonetheless, because it's pre-sorted, which suits me well in this case. For every instance that's been booted up I simply store the interval and increment the number of instances.

    In terms of a sorted set, my interval will be the member in the sorted set and the number of instances falling into that particular interval will be the score, the value determining the member's rank. Advantage here is that the set will automatically be sorted by the number of instances in that particular interval, so that e.g. the interval with the most instances always comes first.

    We don't need anything to get started, we just have to increment the score for the particular interval (or member), in this case 60 seconds, Redis will start from zero automatically, I'll use the Redis Ruby library for brevity.

    redis.zincrby('instance_startup_time', 1, 60)
    

    Another instance took 120 seconds to boot up, so we'll increment the score for that interval too.

    redis.zincrby('instance_startup_time', 1, 120)
    

    After some time we have added some good numbers to this sorted set, and we can start keeping an eye on the top five.

    redis.zrevrange('instance_startup_time', 0, 4, :with_scores => true)
    # => ["160", "22", "60", "21", "90", "10", "120", "10", "40", "5"]
    

    The default sort order is ascending in a sorted set, hence we'll get a reverse range (using the zrevrange command) of the five intervals with the highest score, i.e. where the most instances fall into.

    To get the number of instances for a particular interval, we can use the zscore command.

    redis.zscore('instance_startup_time', 60)
    # => 21
    

    To find the rank in the sorted set for a particular interval, e.g. to find out if it falls into the top five intervals, use zrevrank.

    redis.zrank('instance_startup_time', 160)
    # => 0
    

    Now we want to find the intervals where a particular number of instances fall into, say everything from 10 to 20 instances. We can use zrangebyscore for this purpose.

    redis.zrangebyscore('instance_startup_time', 10, 20, :with_scores => true)
    # => ["120", "10", "90", "10"]
    

    Note that Redis has some nifty operators where you can e.g. ask for every interval that has more than 10 instances, using the +inf operator, useful when you don't know the highest score in the sorted set.

    redis.zrangebyscore('instance_startup_time', 10, '+inf', :with_scores => true)
    # => ["120", "10", "90", "10", "60", "21", "160", "22"]
    

    Now you want to sort the sorted set by the interval, e.g. to display the numbers in a table. You can use the sort command to sort the set by its elements, but unfortunately there doesn't seem to be a way to get the scores in the same call.

    redis.sort('instance_startup_time')
    # => ["20", "40", "60", "90", "120", "160"]
    

    To make up for this you could iterate over the results and fetch the results in one go using the multi command.

    members = redis.sort('instance_startup_time')
    redis.multi do
      members.each do |member|
        redis.zscore('instance_startup_time', member)
      end
    end
    

    So far we've stored all numbers in one big sorted set, which will grow over time, making the statistical numbers very broad and less informative. Suppose we want to store daily metrics and then run the numbers weekly and monthly. We just used a different key derived from the current date.

    today = Date.today.strftime("%Y%m%d")
    redis.zincrby("instance_startup_time:#{today}", 1, 60)
    

    Suppose we have collected data in the last two days. Thanks to zunionstore we can add the two sets together. Assume you have data from all days of the week, then you can use zunionstore to accumulate that data and store it with a different key.

    redis.zunionstore('instance_startup_time:week49',
                      ['instance_startup_time:20102911', 'instance_startup_time:20103011'])
    

    This will create a union of the sorted sets for the two subsequent days. The neat part is that will aggregate the data of the elements in the sets. So if on the one day 12 instances took 60 seconds to start and on the second 15, Redis will create the sum of all the scores. Neat, huh? What you get is a weekly aggregate of the collected data, of course it's easy to create monthly data as well.

    Instead of summing up the scores you could also store the maximum or minimum across all the sets.

    redis.zunionstore('instance_startup_time:week49',
                      ['instance_startup_time:20102911', 'instance_startup_time:20103011'],
                      :aggregate => 'max')
    

    Of course you could save the extra union and just create counters for days, weeks and months in one go, but that wouldn't give me much material to highlight the awesomeness of sorted set unions now, wouldn't it?

    You could achieve a similar data structure by using hashes, but you can do some neat things on sorted sets that you'd have to implement manually with hashes. Sorted sets are pretty neat when you need a weighed counter, e.g. download statistics, clicks, views, prelisted by the number of hits (scores) for the particular element.

  3. Why Riak Search Matters...

    The awesome dudes at Basho released Riak 0.13 and with it their first version of Riak Search yesterday. This is all kinds of exciting, and I'll tell you why. Riak Search is (way down below) based on Lucene, both the library and the query interface. It mimicks the Solr web API for querying and indexing. Just like you'd expect something coming out of Basho, you can add and remove nodes at any time, scaling up and down as you go. I've seen an introduction on the basics back at Berlin Buzzwords, and it was already shaping up to be nothing but impressive. But enough with all the praise, why's this stuff exciting?

    • The key/value model is quite restrictive when it comes to fetching data by, well anything else than a key. Keeping reverse lookup indexes was one way to do it, but the consistency model of Riak made it hard if not impossible to maintain a consistent list of interesting entries in an atomic way.

      Riak Search fills this gap (and not only for Riak, the key/value store, but for any key/value store if you will) by offering something that scales up and down in the same way as Riak, so you don't have to resort to e.g. Redis to maintain reverse lookup indexes.

      Run queries in any way you can think of, fetch ranges, groups, you name it, no need to do anything really. It even integrates directly with Riak through pre-commit hooks.

    • It's based on proven technology (Lucene, that is). It doesn't compete with something entirely new, it takes what's been worked on and constantly improved for quite a while now, and raises it onto a new foundation to make it scale much nicer, the foundation being Riak Core, Riak KV and Bitcasks, and some new components developed at Basho.

    • It uses existing interfaces. Imagine just pointing your search indexing library to a new end point, and there you go. Just the thought of that makes me teary. Reindex data, reconfigure your clients to point to a new endpoint, boom, there's your nicely scalable search index.

    • Scaling Solr used to be awkward. Version 1.5 will include some heavy improvements, but I believe the word shard fell at some point. Imagine a Solr search index where you can add and remove nodes at any time, the indexing rebalancing without requiring manual intervention.

      Sound good? Yeah, Riak Search can do that too.

    Remember though, it's just a first release, which will be improved over time. I for one am just happy they finally released it, I almost crapped my pants, it's that exciting to have something like Riak Search around. And I say that with all honesty and no fanboyism whatsoever. Having used Solr quite a lot in the past I'm well aware of its strengths and weaknesses and the sweet spot Riak Search hits.

    I urge you to play with it. Installing it and feeding it with data could not be easier. Well done, Basho!

    Update: From reading all this you may get the impression that Riak Search builds heavily on a Lucene foundation. That's not the case. When I say that it builds on top of Lucene, I actually meant that it can and does reuse its analyzers and query parsing. Both can be replaced with custom (Erlang) implementations. That's the only part of Lucene that is actually used by Riak Search, because why reinvent the wheel?

  4. Be Humble, and Get Shit Done!

    I had the honor of speaking at JAOO, sorry GOTO, this year. Being part of so many great speakers, like James Gosling, Rich Hickey, Martin Fowler, Tim Bray, Michael Nygard, and Dan Ingalls (maker of several Smalltalk versions), made me feel nothing but humble, but not in a bad way. I talked about CouchDB, and if you care for it, check out my slides. This is my take away from the conference.

    Be Humble

    My point here is not to make myself look like someone who's unimportant, though I'm not important either. I'm humble, that's all. At the speaker dinner on Wednesday night I sat at a table with John Allspaw (Flickr/Etsy), Tom Preston-Werner (GitHub), Andy Gross (Basho), and Mike Malone (SimpleGeo). I knew some of these guys before, and talked in one way or the other, but this time was different. First of all, they're an incredibly smart bunch. Smarter than I'll probably ever be. Which is not a bad thing, because if anything it's a motivation to constantly improve myself, to never stop learning.

    They shared stories from all the places they've worked, not gossip stories, but more stories on problems they solved and how they solved them. That just fascinated me. I could've sat there for hours, just listening to stories from how they did and do operations, how they handled certain problems, and all that at a scale that's usually way out of my league. I'm usually not a quiet person, but it's times like these where I can just sit and listen.

    The problem I realized at some point though was, that in Germany, this culture of sharing simply doesn't exist. People don't talk much about operations, how they solve specific problems, the really interesting stuff. People talk about tools, languages, Amazon Web Services, all that stuff, but not how they go about to solve real life problems, at any scale. It's sort of sad, and I'm trying to come up with ideas on how to change that. Maybe it even happens, but outside of my usual circles. Other people from around here agree with me though, so I guess I'm not the only one thinking this way.

    Because I just felt lucky being able to hear what they had to say. I love hearing these stories. There's a lot to gain from them, sometimes even more than just reading books (which you should still do of course). In a group I much prefer being the humblest in the band, and to just listen, obverse and learn. I love getting new ideas, new motivation and energy out of them. The motivation, together with a very specific track, lead to another realization.

    Get Shit Done!

    Every day there was one track at JAOO dealing with Scrum, Agile, Kanban, Devops, Lean, Continuous Something, you name it. I have a rather specific opinion on these topics, which I won't go into right here. I just find the amount of talk on the subjects ridiculous.

    Which brings me right to the subject. Instead of talking about agile processes, or whatever kind of process, just get shit done. The secret to being a great coder, operations guy, or even writer is not to talk about becoming one, it's to just start writing. Or, as Tom Preston-Werner put it: Innovate, Execute, Iterate.

    Talking about process won't get you anywhere. Pick what works for you and move on. If it doesn't work, reconsider specifically what doesn't, and improve. Don't blame the process. If shit doesn't get done, you have only yourself to blame. This realization is not exactly new, but it blows my mind how much time people spend talking about getting things done, instead of actually doing them. So here's the only tip I'll give you: get shit done. Working in a startup, which I just so happen to do, this is the only thing that matters.

    My personal take-away from JAOO/GOTO, even though it's not even directly related to the conference itself but the stuff I experienced around it: Be humble, and get shit done.

  5. Why I Love and Hate Distributed Systems

    Let me go ahead and say it: I love distributed systems. Why? Simply because they bend my brain. Yesterday I tweeted "Distributed databases are my happy place." One response I got was along the lines of: "then you're probably not running a distributed database in production." Busted! But does it matter? We all love distributed stuff, we love thinking about scaling. They seem like problems everyone wants to have and solve.

    But the truth is, I don't, and I can assure you, you don't want to either, sometimes I doubt my brain is even capable properly solving these problems, but that doesn't prevent me from trying. I prefer to work on as small a scale as possible, you could even say I hate distributed systems. Scaling and distribution is a problem most of us don't have, and are probably better of not having.

    Truth be told, I'm not highly interested in running highly distributed systems in production, quite the opposite. I prefer maxing out what I have as far up as possible. Sometimes I do take the plunge and just try something new in production, but I'm happy prepared to replace it with something different, even something simpler, if that seems like the better option in the end. Everyone should experiment at some point, but not all the time.

    But why then do I love distributed systems? Simply because they make me think about how they could be put to use, what algorithms and the problems involved are, and what implications they would have on a production system, both from an operations and developer perspective. That's where the value is for me, it allows me to simply make informed decisions when the time comes.

    Take Riak, for example, on which I gave a shortish talk at yesterday's meet-up of the local Ruby brigade. Riak's distribution model is based on Amazon's Dynamo implementation, with some neat features sprinkled on top. Riak is built by a bunch of really, really smart guys at Basho, whose work I have nothing but respect for, but who also are sane and open enough to tell people when their database may or may not be a good fit (something a certain other database is severly lacking).

    Riak is exciting for me because it was the first database that really made me dive into Amazon's Dynamo, and once I started grokking it, it blew my mind. If you haven't read it, please do. It blew my mind simply because it introduced me to a whole new thinking, to heavily distributed storage, with all the potential hot spots, downsides and business use cases for specifics parts of it thrown in. The same is true for Google's BigTable. The technologies involved with both are true mind-benders.

    And there's my bottom line. Distributed systems aren't necessarily awesome just because they allow scaling to infinite heights (exaggeration intended), but because they broaden your personal horizon. It's like learning new programming languages. It's about getting new ideas in your head, ideas outside of your everyday working realm. Ideas you can maybe even take back to what you're working on and start applying them where it makes sense, and only if it makes sense. Learning about distributed systems is not just about learning how to use them, but when. Knowing is half the battle.

    While you're at it, check out Evan Weaver's "Distributed Systems Primer", a collection of papers on distributed systems, or the papers collection over at NoSQL Summer. Get ready to have your mind blown in whole new ways. Say what you will, that stuff is just fascinating. It appeals to the distributed database lover in me.

  6. An Inconvenient Caveat about MongoDB's Replica Sets (updated)

    Update: Read the comments and below. The issue is not as bad as it used to be in the documentation and the original design, thankfully.

    A lot has happened since I've first written about MongoDB back in February. Replica Pairs are going to be deprecated, being replaced by Replica Sets, a working Auto-Sharding implementation, including rebalancing shards, and lots more, all neatly wrapped into the 1.6 release.

    The initial draft on how they'd turn out sounded good, but something struck me as odd, and it is once again one of these things that tend to be overlooked in all the excitement about the new features. Before we dive any deeper, make sure you've read the documentation, or check out this rather short introduction on setting up a Replica Set, I won't go into much detail on Replica Sets in general, I just want to point out one major issue I've found with them. Part of the documentation sheds some light on the inner workings of Replica Sets. It's not exhaustive, but to me more interesting than the rest of the documentation.

    One part struck me as odd, the paragraph on resyncing data from a new primary (as in master). It's two parts actually, but they pretty much describe the same caveat:

    When a secondary connects to a new primary, it must resynchronize its position. It is possible the secondary has operations that were never committed at the primary. In this case, we roll those operations back.

    Also:

    When we become primary, we assume we have the latest data. Any data newer than the new primary's will be discarded.

    Did you notice something? MongoDB rolls operations back that were never committed to the primary, discarding the updated data, which is just a fancy term for silently deleting data without further notice. Imagine a situation where you just threw a bunch of new or updated data at your current master, and the data has not yet fully replicated to all slaves, when suddenly your master crashes. According to the protocol the node with the most recent opslog entries takes over the primary's role automatically.

    When the old master comes back up, it needs to resynchronize the changes from the current master, before it can play any role in the set again, no matter if it becomes the new primary, or sticks to being a secondary, leaving the new master in place. During that resync it discards data that has not been synchronized to the new master yet. If the opslog on the new master was behind a couple of dozen entries before the old one went down, all that data is lost. I repeat: lost. Think about that.

    There's ways to reduce the pain, and I appreciate that they're mentioned appropriately in the documentation. You can tell MongoDB to consider a write successful when it replicated to a certain number of secondaries. But you have to wait until that happened, polling getLastError() for the state of the last operation. Or you could set maxLag accordingly, so that the master will fail or block a write until the secondaries catch up with the replication, though I couldn't for the life of me figure out (using the Googles) where and how to set it.

    But I don't approve of this behavior as a default, and the fact that you need to go through the internals to find out about it. Everything else suggests that there's no point of failure in a MongoDB setup using sharding and Replica Sets, even comparing it to the Dynamo way of guaranteeing consistency, which it simply isn't when the client has to poll for a successful write.

    It's one of those things that make me reconsider my (already improved) opinions on MongoDB all over again, just when I started to warm up with it. Yes, it's wicked fast, but I simply disagree with their take on durability and consistency. The tradeoff (as in: losing data) is simply too big for me. You could argue that these situations will be quite rare, and I would not disagree with you, but I'm not fond of potentially losing data when they do happen. If this works for you, cool! Just thought you should know.

    Update: There's been some helpful comments by the MongoDB folks, and there's good news. Data is not silently discarded in 1.6 anymore, apparently it's stored in some flat file, fixed with this issue, though it's hard for me to say from the commits what exactly happens. The documentation does not at all reflect these changes, but improvements are on the way. I'm still not happy about some of the design decision, but they're rooted in the way MongoDB currently works, and changing that is unlikely to happen, but at least losing data doesn't seem to be an option anymore. If making a bit of a fool out of myself helped to improve on the documentation front, so be it. I can live with that.

  7. 10 Annoying Things About CouchDB

    Hi, I'm Mathias, and I'm a CouchDB user. I've been using it for almost a year now, and we have a project using it in production, with a side of Redis. I think it's an awesome database, some of its features are simply unrivaled. Offline replication, CouchApps, to name a few. CouchDB just hit version 1.0. It's been a long time coming, with CouchDB having probably one of the longest histories in the non-relational database space. I've heard about it first back in September 2008, when Jan Lehnardt talked about it at a local co-working space. I still blame him for getting me all excited about this whole NoSQL thing. Fun fact: I bookmarked the CouchDB website back in February 2008.

    The features being added to it with every release are nothing short of exciting. CouchDB 0.11 got filtered replication, support for URL rewriting and vhosts, amongst other things. But there's still some things that annoy me, that somewhat bug me in my daily work with it.

    The following things are not incredible pet-peeves I have with CouchDB. I think CouchDB is pretty awesome, and I really like using it. However, it doesn't come without the occasional odditity that will leave you scratching your head. These probably aren't the only things to be aware of, they're just the most annoying to me. Your mileage may vary. They may or may not be annoying to you, but they're things that are good to know working with CouchDB. Whether CouchDB should or should not have what I'm listing here is a whole different story. It's my wishlist of improvements, if you will.

    It's also stuff you're buying into when you move off the beaten path of relational databases. As always, some of these are not hard to find out, some of them do only get really annoying once you're moving into production, or when you get a deeper knowledge of the tool at hand. Nothing specific to CouchDB here, but some of the issues listed below stem from actively using it. Take them with a grain of salt. While they may seem annoying at first, they're things you can live with. Believe me, you can.

    Views are updated on read access

    You can dump in as many documents as you want, and you can create as many map/reduce views as you want. The truth is, they'll only come all together to slow down your application when you're querying the view. Assume you have a good stash of documents in your database, and you decide you need a new view on your data. Throw in the JavaScript functions and go ahead and query the view. Calling it a slow-down may be a stretch at times though, it really depends on how often your data is updated.

    CouchDB will notice that the B-tree for the view doesn't exist yet, so it goes ahead and builds it on the first read. Depending on how many documents you have in your database, that can take a while, putting a good work load on your database.

    On every subsequent read, CouchDB will check if documents have changed since the last write, and throw the changed documents at the map and reduce function. So if you only query some views from time to time, but have lots of changes in between, expect some delays on the next read. A way around this would of course be to keep your views warm by reading them regularly, e.g. through a cron job.

    When you add new views, be sure to pre-warm them before you first access them in your application. One way would be to add the views at a time where you database isn't accessed as much. It doesn't block all access to the documents, but it sure has a certain impact on your database's performance, and of course the first requests that may time out because CouchDB is building the requested views in the background.

    When it comes to just updating a view, and it might take too long, you can set the parameter stale=ok. That way, even if the view data needs to be updated, CouchDB won't update it and just return the last known state of the view's B-tree.

    That's all fun and giggles, but when on earth are you supposed to actually update your view? Always reading stale data is not great? I've gotten some odd suggestions when I complained about this elsewhere, but in the end I just want to tell the database that I'm okay with stale data, but that it should update the view in the background.

    No automatic compaction

    As your database grows and data gets updated, CouchDB leaves old and stale data untouched, appending new data (inserted and updated documents are considered new data) to the end of its database files, a fact that's also true for view files. That has the neat advantage that you can still access old revisions of your documents, but it will also leave your database files growing constantly. Now, depending on the number of documents and updates on them, that might not be a big deal, but it's a good idea to start regular compaction earlier than later.

    Riak's Bitcask file backend has a neat way of automatically compacting its files. It appends data in a similar manner as CouchDB, but can determine if a node in the cluster can run compaction on its data, and do so automatically, without much need for human intervention. It'd be nice to have something similar as part of CouchDB without having to run cron jobs to do that.

    The append-only mechanism makes CouchDB bullet-proof, no doubt, you'll always have consistent data files on your hard disk, backups are as simple as copying the files elsewhere, or take an EBS volume snapshot at any time. But that level of data consistency comes with a price, and that's an ever-growing data file.

    No partial updates

    Whenever you update a document in CouchDB, you update it as a whole, there's nothing in between. That kind of makes sense with the way CouchDB works, but as a user it annoys me from time to time. It seems so pointless fetching and sending a whole document when I'm just updating one attribute. There's a neat RFC for the PATCH command in HTTP making the rounds, I'd love to see that end up in CouchDB at some point. No idea how likely that is, the makers of CouchDB have a weird aversion to using diffs to update data.

    Note that I'm not talking about the MongoDB way of setting attributes atomically. I don't need that, because it simply doesn't scale well, especially not with the CouchDB storage model, and you're not updating data in-place like MongoDB. It's more about just being able to send a diff or a minor update than a whole document.

    You can somewhat fake this using update handlers (look at the view called "in-place") from CouchDB 0.10 on. It's pretty neat, but it's just not the same.

    No built-in way to scale up

    CouchDB's replication is unrivaled, no doubt. Being able to replicate any database with any other database at any point in time makes CouchDB unique, some say it's the killer feature, and I concur. There's lot of argueing whether or not that defines CouchDB as being distributed. In the most traditional sense, at least to me, it sure does, but I'm not here to nitpick about that. It's easy to scale out by adding more nodes and setting them up to constantly replicate with each other, make anyone a master or slave as you like. But there's no way to distribute write and read access across a cluster of nodes.

    CouchDB-lounge has been the traditional way to approaching that, but I never really liked it, because it added more components to the infrastructure. Something like that should really be built in. The good news is that Cloudant is planning on open-sourcing their clustering solution Open Cloudant, which will then hopefully become part of CouchDB. A quorum based system for CouchDB would be neat, and it doesn't seem too far away.

    Pagination is awkward

    CouchDB's B-tree is a leaky abstraction, that's the conclusion I came to at some point. It has a pretty big impact on your application's code, and that's not necessarily a bad thing. Suddenly you deal with things like conflicts, or simply updating views on reads. But no other part of your web application will make that as obvious as pagination, a pretty common and natural part of a web application.

    The path of least resistence to get pagination is to use the skip and limit parameters, but it's not recommended, as you'll still be walking the whole B-tree to determine the number of documents that must be skipped before it can collect the ones you're interested in.

    The recommended way to do pagination is a bit awkward if you ask me. There's a good explanation in the CouchDB book, so I'll spare you repeating it here. But be sure to read it, because understanding that takes you half way to understanding the B-tree. It may be awkward, and very different from what you're used to, but that's how the B-tree works. It's not always unicorns and rainbows, sometimes it kinda gets in your way. Trade-offs, meh.

    The simpler alternative would of course be to just use endless pagination, where you let the users just click a more button instead of clicking through the pages, because you know the last document displayed in your list, and the key that was used to fetch it. You simply use that key and the last document's id to step directly into the B-tree where you left off. You need to remember to fetch one additional document, as CouchDB will return the last document too, or you can just skip one document, which is acceptable, as skipping just one leaf in a tree is an operation of predictable performance.

    Range queries are awkward

    To do a range, you have to specify a start and an end key. That's the simple part. It starts getting awkward when your keys get slightly more complex, e.g. when your map function emits arrays. Assume you want to fetch all elements where the first part of the array matches a particular key, and the second part doesn't matter, e.g. when you emitted a timestamp as the second part to keep a natural (in terms of last update for example) order.

    Assume your keys look like this: ['123', '2010/07/21'], that's the key format SimplyStored uses to manage associations between documents. To get the range that only matches the first part of the key, your startkey has to look like this: ['123']. This will match all documents having the above key. If you don't specify an endkey, CouchDB will simply return all documents following that key, so you need to specify an endkey. The recommended way to do that is to use the following format: ['123', {}]. That way you'll get all documents matching the first part of the key, because {} is considered to be greater than any string you may have emitted. See the CouchDB wiki on more details on this technique called view collation.

    Obviously it's not impossible to do range queries in CouchDB, but it's slightly awkward. It all goes downhill as soon as you want to fetch only a particular subrange of the original one, using startkeydocid or endkeydocid, say for pagination. With the above ranges, they simply don't work. Both need a startkey and endkey that is an exact match. The whole point of the above range query is not to care about the exact start and end key, isn't it?

    No CommonJS available in MapReduce functions

    With CouchDB 0.11, CommonJS and all its awesomeness became available in view functions. I was pretty excited about it, and I still am. However, map and reduce functions were left out in the cold. Every time I have to write the same piece of JavaScript in a map or reduce function that I've used elsewhere already, I get bitter about that. Sometimes it's just very basic stuff that I could easily solve by throwing an existing library at it, but instead I'm cluttering my view code with it over and over again. And yes, there's the !code placeholder, but it's not about throwing an undebuggable mess of code into my view function, it's about not repeating myself. !code doesn't really solve that problem good enough for me.

    Word is that it's got something to do with determining whether files have updated or not, but hey CouchDB, why don't you let me worry about that and let me tell you when I think a file I've included through CommonJS has been updated? I would very much appreciate that.

    No link-walking between documents

    With CouchDB 0.11, map functions got a way to emit other documents using {_id: doc.other_id}, but that still doesn't allow full access to e.g. attributes of said documents. Sometimes that'd just be handy to have. Sure, you could use embedded documents, but in that case it'd just be a dumb workaround, where I could just have a way to fetch a document by its identifier and throw some of its attributes at the map function.

    Say what you will though, just being able to emit other documents is still pretty cool. Makes querying and fetching associated documents a bit easier.

    All reads go to disk

    CouchDB doesn't cache anything. It does delay commits if you want it to, so that it doesn't hit the disk on every document update, but it sure as heck doesn't cache anything in memory. This is both curse and blessing. It keeps the memory footprint of CouchDB incredibly small, no doubt. Considering they're targeting mobile devices it makes a lot of sense, plus, accessing flash-based storage is a lot cheaper than spinning disks.

    But, on the other hand, when I have the memory available, why not use it? I know caching is a hard problem to solve. CouchDB is also made for high concurrency, no doubt, but my disks aren't necessarily. Sure, I could buy faster disks, but if you really think about it, memory is the new disk, plus, tell Amazon to offer faster network storage for EC2, please do, maybe that'd already help. CouchDB somewhat relies on the file system cache doing its magic to speed up things, but I really don't want to rely on magic. You could put an HTTP-level reverse proxy like Varnish in front of CouchDB though, that'd be a feasable option, but that adds another layer to your infrastructure.

    In all seriousness, I'd love to see some caching introduced in CouchDB. I won't say it's an easy feature to implement, because it sure isn't, but it doesn't need to be something fancy, I just would like to see CouchDB use some of my memory for data that's read more often than it's written. But until then, Varnish to the rescue!

    Error messages are not helping

    I'm just gonna post the following snippet from my CouchDB log file, and leave you to it. You tell me how useful it is. Suffice it to say, I just wish CouchDB would not dump all that Erlang trace into my log, but maybe a useful error message for a change. It works in some cases, but a lot of times, when the problem usually is as simple as a permissions problem, you're left scratching your head.

    {<0.84.0>,supervisor_report,
     [{supervisor,{local,couch_secondary_services}},
      {errorContext,start_error},
      {reason,
          {'EXIT',
              {undef,
                  [{couch_auth_cache,start_link,[]},
                   {supervisor,do_start_child,2},
                   {supervisor,start_children,3},
                   {supervisor,init_children,2},
                   {gen_server,init_it,6},
                   {proc_lib,init_p_do_apply,3}]}}},
      {offender,
          [{pid,undefined},
           {name,auth_cache},
           {mfa,{couch_auth_cache,start_link,[]}},
           {restart_type,permanent},
           {shutdown,brutal_kill},
           {child_type,worker}]}]}}
    

    The End

    There you go, some annoying things about CouchDB. They're annoying, but I still like CouchDB a lot. It's stuff I can live it, it's stuff I can work around, it's stuff that doesn't have as big an effect in production as it may seem. The bottom line is, as always, evaluate your tools. The above list is not to be taken as a list of arguments purely against using CouchDB. Consider them a list of things you need to be aware of, that may or may not be acceptable compared to what you gain.

    In the end, and any way you look at it, CouchDB still kicks butt.

  8. Relational Data, Document Databases and Schema Design

    By now it should be obvious that I'm quite fond of alternatives data stores (call them NoSQL if you must). I've given quite a few talks on the subjects recently, and had the honor of being a guest on the (German) heise Developer Podcast on NoSQL.

    There's some comments and questions that pop up every time alternative databases are being talked about, especially by people deeply rooted in relational thinking. I've been there, and I know it requires some rethinking, and also am quite aware that there are some controversial things that basically are the exact opposite of everything you learned in university.

    I'd like to address a couple of those with some commentary and my personal experience (Disclaimer: my experience is not the universal truth, it's simply that: my experience, your mileage may vary). When I speak of things done in practice, I'm talking about how I witnessed things getting done in Real Life™, and how I've done them myself, both good and bad. I'm focussing on document databases, but in general everything below holds true for any other kind of non-relational database.

    It's easy to say that all the nice features document databases offer are just aiming for one thing, to scale up. While that may or may not be true, it just doesn't matter for a lot of people. Scaling is awesome, and it's a problem everyone wants to solve, but in reality it's not the main issue, at least not for most people. Also, it's not an impossible thing to do even with MySQL, I've had my fun doing so, and it sure was an experience, but it can be done.

    It's about getting stuff done. There's a lot more to alternative databases in general, and document databases in particular, that I like, not just the ability to scale up. They simply can make my life easier, if I let them. If I can gain productivity while still being aware of the potential risks and pitfalls, it's a big win in my book.

    What you'll find, when you really think about it, is that everything below holds true no matter what database you're using. Depending on your use case, it can even apply to relational databases.

    Relational Databases are all about the Data

    Yes, they are. They are about trying to fit your data into a constrained schema, constrained in length, type, and other things if you see fit. They're about building relationships between your data in a strongly coupled way, think foreign key constraints. Whenever you need to add data, you need to migrate your schema. That's what they do. They're good at enforcing a set of ground rules on your data.

    See where I'm going with this? Even though relational databases tried to be a perfect fit for data, they ended up being a pain once that data needed to evolve. If you haven't felt that pain yet, good for you. I certainly have. Tabular data sounds nice in theory, and is pretty easy to handle in Excel, but in practice, it causes some pain. A lot of that pain stemmed from people using MySQL, yes, but take that argument to the guy who wrote it and sold it to people as the nicest and simplest SQL database out there.

    It's easy to get your data into a schema once, but it gets a lot harder to change the schema and the data into a different schema at a later point in time. While data sticks around, the schema evolves constantly. Something relational databases aren't very good at supporting.

    Relational Databases Enforce Data Consistency

    They sure do, that's what they were built for. Constraints, foreign keys, all the magic tricks. Take Rails as a counter-example. It fostered the idea that all that stuff is supposed to be part of the application, not the database. Does it have trade-offs? Sure, but it's part of your application. In practice, that was correct, for the most part, although I can hear a thousand Postgres users scream. There's always an area that requires constraints on the database level, otherwise they wouldn't have been created in the first place.

    But most web applications can live fine without it, they benefit from being free about their data, to shape it in whichever way they like, adding consistency on the application level. The consistency suddenly lies in your hands, a responsibility not everyone is comfortable with. You're suddenly forced to think more about edge cases. But you sure as hell don't have to live without consistent data, quite the opposite. The difference is that you're taking care of the consistency yourself, in terms of your use case, not using a generic one-fits-all solution.

    Relationships between data aren't always strict. They can be loosely linked, what's the point of enforcing consistency when you don't care if a piece of data still exists or not? You handle it gracefully in your application code if you do.

    SQL is a Standard

    The basics of SQL are similar, if not the same, but under the hood, there's subtle differences. Why? Because under the hood, every relational database works differently. Which is exactly what document databases acknowledge. Every database is different, trying to put a common language on top will only get you so far. If you want to get the best out of it, you're going to specialize.

    Thinking in Map/Reduce as CouchDB or Riak force you to is no piece of cake. It takes a while to get used to the ideas around it and what implications it has for you and your data. It's worth it either way, but sometimes SQL is just a must, no question. Business reporting can be a big issue, if your company relies on supporting standard tools, you're out of luck.

    While standards are important, in the end it's important what you need to do with your data. If a standard gets in your way, how is that helpful? Don't expect a standard query language for document databases any time soon. They all solve different types of problems in different ways, and they don't intend to hide that from you with a standard query language. If on the other hand, all you need is a dynamic language for doing ad-hoc queries, check out MongoDB.

    Normalized Data is a Myth

    I learned a lot in uni about all the different kinds of normalization. It just sounded so nice in theory. Model your data upfront, then normalize the hell out of it, until it's as DRY as the desert.

    So far so good. I noticed one thing in practice: Normalized data almost never worked out. Why? Because you need to duplicate data, even in e-commerce applications, an area that's traditionally mentioned as an example where relational databases are going strong.

    Denormalizing data is simply a natural step. Going back to the e-commerce example, you need to store a lot of things separately when someone places an order: Shipping and billing address, payment data used, product price and taxes, and so on. Should you do it all over the place? Of course not, not even in a document database. Even they encourage storing similar data to a certain extent, and with some of them, it's simply a must. But you're free to make these decisions on your own. They're not implying you need to stop normalizing, it still makes sense, even in a document database.

    Schemaless is not Schemaless

    But there's one important thing denormalization is not about, something that's being brought up quite frequently and misunderstood easily. Denormalization doesn't mean you're not thinking about any kind of schema. While the word schemaless is brought up regularly, schemaless is simply not schemaless.

    Of course you'll end up with having documents of the same type, with a similar set of attributes. Some tools, for instance MongoDB, even encourage (if not force) you to store different types of documents in different collections. But here's the kicker, I deliberately used the word similar. They don't need to be all the same across all documents. One document can have a specific attribute, the other doesn't. If it doesn't, just assume it's empty, it's that easy. If it needs to be filled at some point, write data lazily, so that your schema eventually is complete again. It's evolving naturally, which does sound easy, but in practice requires more logic in your application to catch these corner cases.

    So instead of running migrations that add new tables and columns, and in the end pushing around your data, you migrate the data on the next access, whether that's a read or a write is up to your particular use case. In the end you simply migrate data, not your schema. The schema will evolve eventually, but first and foremost, it's about the data, not the constraints they live in. The funny thing: In larger projects, I ended up doing the same thing with a relational database. It's just easier to do and gentler on the load than running a huge batch job on a production database.

    No Joins, No Dice

    No document database supports joins, simple like that. If you need joins, you have two options: Use a database that supports joins, or adapt your documents so that they remove the need for joins.

    Documents have one powerful advantage: It's easy to embed other documents. If there's data you'd usually fetch using a join, and that'd be suitable for embedding (and therefore oftentimes: denormalizing), there's your second option. Going back to the e-commerce example: Whereas in a relational database you'd need a lot of extra tables to keep that data around (unless you're serializing it into single column), in a document database you just add it as embedded data to the order document. You have all the important data one in place, and you're able to fetch it in one go. Someone said that relational databases are a perfect fit for e-commerce. Funny, I've worked on a market platform, and I've found that to be a ludicrous statement. I'd have benefited from a loser data storage several times, joins be damned.

    It's not always viable, sure, and it'd be foolish to stick with a document database if that's an important criterion for your particular use case, then no dice. It's relational data storage or bust.

    Of course there's secret option number three, which is to just ignore the problem until it's a problem, just by going with a document database and see how you go, but obviously that doesn't come without risks. It's worth noticing though that Riak supports links between documents, and even fetching linked documents together with the parent in one request. In CouchDB on the other hand, you can emit linked documents in views. You can't be fully selective about the document data you're interested in, but if all you want is fetch linked documents, there is one or two ways to do that. Also, graph databases have made it their main focus to make traversal of associated documents an incredibly cheap operation. Something your relational database is pretty bad at.

    Documents killed my Model

    There's this myth that you just stop thinking about how to model your data with document databases or key-value storage. That myth is downright wrong. Just because you're using schemaless storage doesn't mean you stop thinking about your data, quite the opposite, you think even more about it, and in different ways, because you simply have more options to model and store it. Embedding documents is a nice luxury to have, but isn't always the right way to go, just like normalizing the crap out of a schema isn't always the way to go.

    It's a matter of discipline, but so is relational modelling. You can make a mess of a document database just like you can make a mess of a relational database. When you migrate data on the fly in a document database, there's more responsibility in your hands, and it requires good care with regards to testing. The same is true for keeping track of data consistency. It's been moved from the database into your application's code. Is that a bad thing? No, it's a sign of the times. You're in charge of your data, it's not your database's task anymore to ensure it's correct and valid, it's yours. With great power comes great responsibility, but I sure like that fact about document databases. It's something I've been missing a lot when working with relational databases: The freedom to do whatever the heck I want with my data.

    Read vs. Write Patterns

    I just like including this simply because it always holds true, no matter what kind of database you're using. If you're not thinking about how you're going to access your data with both reads and writes, you should do something about that. In the end, your schema should reflect your business use case, but what good is that when it's awkward to access the data, when it takes joins across several tables to fetch the data you're interested in?

    If you need to denormalize to improve read access, go for it, but be aware of the consequences. A schema is easy to build up, migrating on the go, but if document databases force you to do one thing, and one thing only, it's to think about how you're reading and writing your data. It's safe to say that you're not going to figure it all out upfront, but you're encouraged to put as much effort into it as you can. When you find out you're wrong down the line, you might be surprised to find that they make it even easier to change paths.

    Do your Homework

    Someone recently wrote a blog post on why he went back to MySQL from MongoDB, and one of his reasons was that it doesn't support transactions. While this is a stupid argument to bring up in hindsight, it makes one thing clear: You need to do research yourself, noone's going to do it for you. If you don't want to live up to that, use the tools you're familiar with, no harm done.

    It should be pretty clear up front what your business use case requires, and what tools may or may not support you in fulfilling these requirements. Not all tool providers are upfront about all the downsides, but hey, neither was MySQL. Read up, try and learn. That's the only thing you can do, and noone will do it for you. Nothing has changed here, it's simply becoming more obvious, because you suddenly have a lot more options to work with.

    Polyglot Data Storage

    Which brings me to the most important part of them all: Document databases (and alternative, non-relational data stores in general) are not here to replace relational databases. They're living alongside of them, with both sides hopefully somewhat learning from each other. Your projects won't be about just one database any more, it's not unlikely you're going to end up using two or more, for different use cases.

    Polyglot persistence is the future. If there's one thing I'm certain of, this is it. Don't let anyone fool you into thinking that their database is the only one you'll need, they all have their place. The hard part is to figure out what place that is. Again, that's up to you to find out. People ask me for particular use cases for non-relational databases, but honestly, there is no real distinction. Without knowing the tools, you'll never find out what the use cases are. Other people can just give you ideas, or talk about how they're using the tools, they can't draw the line for you.

    Back to the Future

    You shouldn't think of it as something totally new, document databases just don't hide these things from you. Lots of the things I mentioned here are things you should be doing anyway, no matter if you're using a relational or a non-relational data store. They should be common sense really. We're not trying to repeat what went wrong in history, we're learning from it.

    If there's one thing you should do, it's to start playing with one of the new tools immediately. I shouldn't even be telling you this, since you should hone your craft all the time, and that includes playing the field and broadening your personal and professional horizon. Only then will you be able to judge what use case is a good fit for e.g. a document database. I'd highly suggest starting to play with e.g. CouchDB, MongoDB, Riak or Redis.

    Item Information

    Published
    Contributor
    mmeyer
    Comments
    0 comments
    Tags
    mysql, hadoop, cassandra, simpledb, azure, hive hbase, nosql, mongodb
    Content Type
    Entry
  9. Vim - A Never-ending Love Story

    About eighteen months ago I wrote about going back to Vim as my daily text editor. It was a bust, and I went back to TextMate after about a week.

    Suddenly it's the year 2010, and I'm typing this in Vim. What happened? My itch was re-scratched if you will. I was wary of some of TextMate's perceived shortcomings, and honestly I missed having a command and insert mode. It may sound stupid, but I really prefer that way of working with text and code. TextMate is still a nice editor, but seeing its development coming to a perceived halt made me realize that Vim is simply forever, not being developed by just one guy, but a community.

    It's also worth mentioning that I simply started from scratch. Last time I built upon a configuration that grew over the years, and that included things about whose purpose I just had no idea. I watched the Smash Into Vim PeepCode too, and started with the clean slate configuration set that comes with it. If you're thinking of getting (back) into Vim, it's highly recommended, it's sure to wet your appetite. There's also a collection of screencasts and a free book on Vim 7 available on the interwebs. I have some useful links in my bookmark collection too.

    There's been a lot of developments around scripts for Vim that bring TextMate-like functionality, or that support things like Cucumber, smart quotes and auto-closing braces, or even the most awesome Git integration you'll find. But the nicest of them all is Pathogen, a script that allows you to keep all your other scripts in separate places, not losing overview of what's installed where, and in which version.

    Coming from TextMate, you're gonna miss the "Go To File" dialog, I'm sure. Check out Command-T, which does exactly that, only with path-matching sprinkled on top. It's not as fast unfortunately, but a lot faster to use than the annoying fuzzy thing I used the last time I tried to live on Vim. There's also PeepOpen, but it always opens files in new tabs, and that can get quite annoying, as new Vim tabs are quite different from Vim buffers. For project views I use NERDtree, though LustyExplorer also seems acceptable.

    As I said, I started from scratch, with a clean slate. So the decent thing to do was to put all my Vim configuration files on GitHub. They include all the scripts I'm using, and my configuration, all neatly separated into different bundles thanks to Pathogen. There's a couple of things that are still a bit wonky. Lusty Juggler doesn't work as advertised all the time, though it's a neat tool, allowing you to quickly select one of a list of the latest open buffers. RubyTest is quite weird, and I'm thinking of dumping it completely, and simply rolling my own commands to run tests based on it. The rails.vim script package does include some support to run tests too, but not to execute a single test case.

    In general, I haven't found anything that works in TextMate that you can't somehow get to work in Vim. Yes, I've used the word somehow. It's not easy as pie all of the time, and it can be different, heck it's a different editor. But I willingly accept that, because as a text editor, I find Vim to be a lot better than TextMate.

    I've been back on Vim for a month now, and I'm not looking back at all. It's like coming back to an old friend and learning what awesome things he's been up to. It's pretty much as exciting as playing with new technologies at the moment. Learning new things can be pretty exciting, even if it's just another text editor. But it's not all fun and giggles. I have some annoyances still, but no editor is perfect. I'm more willing to accept Vim's for the increased text surgeon skills than TextMate's, to be frank. TextMate is still a nice editor, don't get me wrong, my heart just always belonged to Vim.

    Honestly, I'm more willing to invest my learning time in an editor that I know I can use everywhere than one I can only use on the Mac with a running user interface. I'm using Vim on every server I'm managing, so why not on my local machine? Vim makes me think about how I can edit text in the most efficient way possible, and I like that very much. It even made me map my caps-lock key to control, finally!

    Update: Was just tipped off that PeepOpen can be made to behave properly and open files in the current MacVim tab. When you set your MacVim options like in the picture below (notice the part "Open files from applications"), it works a treat. Thanks, Mutwin!

    MacVim Options

  10. Presentation Fu

    I've attended my fair share of conferences this month alone, plus a Seedcamp, and I can safely say that in any way, I learned a lot about how to build slides, how to keep the audience engaged and things one just shouldn't do in a talk or in slides. While I certainly don't claim to be an expert on the topic now, I just wanted to put all of my impressions and lessons learned into a post.

    I'm definitely not the first person to write about this kind of stuff, a year ago Geoffrey Grosenbach wrote on presenting, and just recently John Nunemaker wrote a post on improving your presentations for less then $50. Both are well worth reading, but they don't cover everything I find annoying in presentations, so there you go.

    Slides

    Keep them small

    Seven bullet points per slide is bullshit, that's way too much. One phrase per slide is a decent rule, though I'm not dogmatic about it. One phrase and a couple of short bullet points (not more than four) work from time to time, but not all the time. I usually go for a bigger slide set these days, with less content on each slide.

    I can run through 80 slides in 45 minutes. I know that sounds like a lot, and I certainly go through them fast, but I'd rather give people something to think about than bore them to death. Slides with too much text on it also have the negative effect of distracting the audience. They shouldn't read the slide text, they should be listening to what you have to say. Even if you do talk slow, less text on slides is always a good idea. The people should listen to you, not try to understand what your slides are saying.

    What I usually do is just crank out slides with any text that I'd like to say, and then I go through them one or two times to refine and shorten the prases I used to be no more than four or five words for the most part. I also throw out slides when I realize they're disrupting the flow or contain things I'm likely to talk about when I'm on a different slide.

    Use a large font

    Just do it. Not only does it make your slides more readable for everyone in the audience, it forces you to keep the information on a single slide short. My headlines are usually 60pt, my subheadings and bullet points around 45pt. The bigger the better.

    While we're talking about fonts, avoid italic. It's a lot harder to read, especially when you mix it with a regular font. If you need to emphasize something, just make it bold. Italic fonts disrupt your slides' flow.

    Avoid full sentences

    Except when you're quoting someone. Short phrases or even just a single word are much easier to grasp for the audience, and they give you a better sense of flow.

    Dark text on a bright background

    A dark background only works for Steve Jobs, because his team does everything they can to adjust the lighting on location for his talk. You on the other end, have to assume the worst. If there's just a little too much light coming into the room, your slides will be unreadable, when you use a dark background. I've even seen slides where people chose a dark background and just a slightly dark font.

    You have no influence on the lighting in the room, and you'll pretty much just embarrass yourself when your slides are unreadable. There's just no excuse why you shouldn't just use a light background and a dark font.

    Avoid dark photos

    Photos are at a similar risk. The more contrast you have in photos you're using in your preso, the less likely people will be able to see them. I tend to not use a lot of photos in my slides anyway, but I just hate having to say: "Geee, that's a bit hard to see, isn't it?"

    Slides are for the people attending the talk

    Your slide set should not be focussed on being fully understandable by people who have not attended your talk. You end up with so called slideuments, presentations that read like a document. You're talking for the people attending your talk, they probably paid to hear you speak, so focus your energy on giving them a good talk. If you want the rest of the world to know about details of your preso, write a blog post or put it into the presenter notes.

    Video killed the conference star

    I've seen video in presentations quite a few times, and honestly, it bores me to death, especially when there's a voiceover on the video. If you must include video, at least talk yourself, taking the audience through whatever happens on the screen, especially because you don't know how the audio is going to be at the venue. I'm well aware that live demos are a finnicky thing, but so is video. Not always do you have the luxury of using your own computer to do the presentation.

    Avoid long code snippets

    Code is simply hard to grasp within just a couple of seconds, and it's awkward trying to explain larger chunks of it. Use short snippets instead. If you must include some longer examples, split it up in smaller bits, explaining them one by one. I tend to avoid overly complex code snippets. Trying to explain them properly just takes too much time.

    Avoid flashy animations

    They simply take up valueable time and distract the audience. Even though they're nice to look at in theory, in practice they're the bane of a well-built presentation. This is true for both transitions between slides and elements of a single slide appearing later. Just make them appear, not sparkle or fade in.

    The Talk

    Practice, practice, practice

    I find practicing a talk by speaking to myself awkward, not because it's embarrassing, but simply because of the butterflies in my stomach I always end up saying different things in the actual talk. Now, that's not to say you shouldn't think about what you want to say. I tend to go through my slides several times, going through the things I associate with every single one of them, giving me a rough idea and a line of thought on what I want to say. This definitely is a lot easier to do when it's a topic you've talked about before, but in general the above has worked much better for me.

    Drink, drink, drink

    It's a simple fact that talking a lot lets your mouth run dry. I need about half a liter of water to get through a talk. Or at least I make sure I have that amount ready. Before you run dry and faint in the midst of your talk, drink, it's not a shameful thing to do, it simply keeps you going. Shame on conference organizers not thinking about having drinks ready for their speakers. When in doubt, scout the talks before you and make sure you have a bottle ready should it not being taken care of.

    Look at the audience, not the big screen

    It should be so obvious, yet I've just seen people do it again at Cloud Expo. One of the guy's slides had 14 bullet points on it, and the font probably was too small for him to be able to read it from the laptop screen. Another reason why I keep my slides short, they're purpose is to keep me in a flow, to give me short reminders of what I want to talk about.

    Don't read your presenter notes

    If you need presenter notes to run your talk, you need to practice more. They're surely useful for people just looking at your slides, but if it takes full sentences to keep your talk running, you'll end up wasting a lot of time trying to read what your notes say. Talking freely is a challenge, but the earlier you take it on, the faster you'll get used to it. I've seen people use index cards with their presenter notes on them, handwritten, trying to decipher what they've written on them.

    If you know what you're talking about (at least the slightest bit), you'll be fine without them, trust me.

    Two's not a company

    Having more than one speaker is awkward, especially when one of them is just standing there for most of the time, waiting for his turn. Have one up in front at any one time, bring in the next person when it's his turn. Simple like that.

    Don't ask questions

    The audience simply won't answer. If you ask anything, make the audience raise their hands on a topic, but don't expect anyone to answer a specific question. That's your task. Involving the audience sounds like a good idea, but they're lazy, they want to learn something.

    Jokes, tiny bits and stories

    Stories and jokes can really lighten up a presentation. Sure, you shouldn't tell jokes all the time, but something sarcastic thrown in from time to time sure can help to wake up the audience. Stories are even better, people love benefitting from real life experiences in any way. If it has a happy ending, even better.

    Talking slowly is for wimps

    The rule of spending two minutes on a slide is bullshit. It would only mean you'd have seven bullet points on a particular slide. You shouldn't rush through anything, and I certainly try to avoid doing that, and it definitely depends on the topic you're talking about, but when I talk about technical things I expect the audience to be curious about it and try to keep up. If they can't, they can always come back to my slides or ask questions. But as always, it depends.

    Talking fast is for the impatient

    If it's on more generic things that involve higher level topics, or some sort of longer-running workshop, it's only appropriate to walk the people through it and take your time doing so. Usually in these situations it's a lot easier to focus on a single topic. It just depends on how broad your talks topic is.

    Take tiny breaks

    Should you realize you're sort of losing track, simply bring yourself back on the rails. Take a tiny break or just stop talking. You don't need to apologize for that. It's easy to start blabbering on about a certain topic which you didn't even intend to cover in your talk. On the other hand, that's what makes every talk unique, and is exactly why shorter phrases on slides are so much better. They keep your brain engaged, making up associations with certain things as you go, and they help keeping a talk interesting.

    Avoid longer breaks though as people end up being bored, and you're losing precious time. Longer breaks are usually a sign that you're not as prepared as you should be. If you need to switch in between e.g. slides and a live demo, make sure that everything is prepared before the talk.

    Talking in front of others is a challenge, no doubt about it, but there's really no point trying to avoid it, because the only way to improve your skills is to simply talk in front of people. This is my view of the talking world. I constantly try to improve on my slides and think about what I'm doing wrong during talks to improve on that. I'll never loose the excitement right before a talk, and that's a good thing. When it becomes routine, you tend to bore people instead of engaging them. It's about constantly improving yourself to simply become better at talking in front of others.

    This is my view of giving presentations. Feel free to throw in your ideas, or even to disagree. These guidelines probably aren't for everyone, and they might even change for me within just a couple of months, but most of them simply make sense to me. I do need to get me a good remote though, since with my larger slide sets, I find myself hitting the space bar a lot.

  1. 1
  2. Next ›
  3. Last »