DBPedias

Your Database Knowledge Community

Datastax

  1. What the Growth in Multi-Data Centers Means for Databases and Big Data

    A recent article in InfoWorld contained some pretty interesting statistics about the rise and growth of multi-data centers.  In their latest poll of data center managers, the Uptime Institute discovered that 80 percent of respondents have built a new data center or upgraded an existing facility within the past five years. Another study of the North American data center market done by Digital Realty Trust found that 92 percent of respondents said their companies will definitely or probably expand their data center space in 2012, which was the highest percentage reported in six years. This news, coupled with the fact that data centers are primarily put in place to hold (gasp!) data, makes it not hard to see that the need for databases that easily span and interact between multiple data centers is only going to escalate, and likely escalate at a rapid clip. But what does a multi-data center database look like? Does it just equate to log shipping, mirroring between data centers, master-slave replication, something else? To be sure, there are use cases where the above options will work just fine, but increasingly, I’m seeing the following short list requirements:

    • The ability for a single, logical database to span 1-N datacenters; not just two.
    • Multi-directional syncs between data centers; not just one way. Or, put another way: the desire to have truly location independent, read and write anywhere freedom.
    • Built in network intelligence so that data is smartly transferred between data centers to minimize bandwidth overload and latency issues.
    • The ability to support all key types of data traffic across data centers (e.g. real-time, analytic, search, etc.)
    The reasons why a multi-data center database is needed vary. Some use cases involve just the simple desire for a good disaster recovery plan. But the majority of use cases revolve around needing to keep one logical database synched up between 1-n physical data centers and deliver response times as fast as possible for the users each serves in their assigned locale. Pulling this off isn’t easy unless you start with the right database architecture and feature set. For example, master-slave designs are many times practically impossible as the requirement for read-write anywhere can’t be met. Fortunately, Cassandra’s architecture is tailor-made for multiple data centers. Its peer-to-peer design coupled with online scale-out and full redundancy that offers no single points of failure and continuous availability make it ideal in multi-data center environments. Further, DataStax Enterprise supports all of this not only for Cassandra, but also Hadoop and Apache Solr / enterprise search in database clusters. This includes both on-premise data centers as well as cloud deployments. So if you’re one of the many who are running and expanding your organization’s data center footprint, and you need a database that is built from the ground up to support multi-data centers, download DataStax Enterprise now and give it a try. We think you’ll be pleased with the end result, just like many of our customers who are happily running their database across multiple data centers.  
  2. Cassandra vs DynamoDB TCO

    I’ve written on our technical blog about Cassandra vs DynamoDB features; TCO is also an important factor in infrastructure decisions. Amazon recently published their take on TCO for DyanamoDB, contrasted with an unnamed “NoSQL” database. While a useful starting point, most of the assumptions on display there apply poorly to Cassandra, as I will explain. First, at a high level, I do agree with Amazon that when you have a workload that peaks at a fraction of a single machine’s throughput, SaaS pricing can make a lot of sense. However, Cassandra and DataStax Enterprise are targeted firmly at workloads requiring scale out, either for volume of data or velocity (i.e. req/s). If we redo Amazon’s comparison with their server pricing but using Cassandra performance instead of an unnamed generic nosql competitor, the 5000req/s from the “high” usage scenario fit easily on a single machine. (One high-cpu XL node can handle about 25,000 1KB inserts or reads per second.) Adding in x3 for replication and leaving the rest of the numbers unchanged gets us to $2432 for Cassandra, vs $2560 for DynamoDB. But, there’s one glaring error in Amazon’s numbers, which is requiring $1/GB for “redundant storage” (i.e. SAN or similar). Cassandra and all other scale-out nosql solutions *strongly* recommend using direct attached storage for better reliablity, performance, and cost effectiveness. (This applies even more so to avoiding EBS.) Fixing that takes us down to $1082 for Cassandra. Now, if we go to 2x the load at 14k req/s peak (10,000 peak writes, 4,000 reads), that’s still not maxing out our tiny 3-node Cassandra cluster. The Cassandra cost remains $1082/m, but DynamoDB is up to $5120/m. Multiply this out by a factor of 3x or 10x and you see why DynamoDB isn’t showing up on our customers’ evaluation radar a whole lot. One last note: a reasonable person might ask, “but what if that 1.2TB of data is accessed in a highly random pattern? Clever storage engines can only go so far to reduce the iops required for random reads. How will you deal with keeping up with i/o demand?” We actually have several customers deploying Cassandra on SSDs for exactly this reason, where it works quite well. And if the rumor mill is to be believed, this will soon be an option for those deploying Cassandra on EC2 as well. Switching our Cassandra nodes to SSDs would surely add some cost, but but nowhere near the 5x required to bring it up to DynamoDB’s range.
  3. An Option for an Always-On, Continuously Available Hadoop

    A short while back I was sitting in a half-day Hadoop tutorial at a conference. After the first block of teaching material had been covered, the instructor paused and asked if there were any questions. A hand immediately shot up on the front row. “What happens to my Hadoop cluster if my data center has a major disaster?” The instructor talked about various possibilities that could be pieced together via home-grown methods, a hot-standby option that was coming soon, and some other ideas. But as the questioner pressed his use case, the instructor admitted, “Yeah, right now, you’re pretty much out of luck.” Actually, that’s not true. One of the benefits of running Hadoop in our DataStax Enterprise Edition is that you get complete protection for your Hadoop cluster so that you can easily setup an always-on, continuously available (not just highly available) Hadoop that ensures you’re protected against any failure. The DataStax Enterprise solution is easy to use, too, so there’s no complicated setup, configuration, or ongoing management needed to make it happen. What enables a continuously available Hadoop in DataStax Enterprise is that the HDFS component of Hadoop is swapped out for Cassandra, named the Cassandra File System or CFS in DataStax Enterprise. CFS and Cassandra’s easy to use and powerful peer-to-peer replication create a very strong combination that does away with the various failure points in community Hadoop and other 3rd party Hadoop distributions. This approach is also quite different from hot standby’s or mirroring. Those options, while certainly nice to have for certain use cases, are typically one-way only (the master to the stand-by, etc.) and still can involve downtime while a switchover to the standby or mirror is made. By contrast, the Hadoop solution in DataStax Enterprise is capable of spanning multiple data centers and the cloud, and is multi-directional in nature. This means you can make updates in multiple data centers and have everything synched up between them so that no matter where a failure might occur, your Hadoop system is guaranteed to keep running. One last point: this continuous availability feature of DataStax Enterprise benefits not only Hadoop systems, but also real-time data applications using Cassandra and enterprise search systems using Solr. Everything is integrated together so that real-time, batch analytic, and search workloads are seamlessly handled together in one database cluster. For more technical information on CFS, see this blog post from one of our architects. You can try Hadoop and DataStax Enterprise out for yourself by downloading it from our website; it’s completely free to use for development purposes. To learn how to setup a Cassandra and Hadoop cluster on Linux, refer to this article or our online docs. In next two or so months, we’ll be releasing an updated version of DataStax Enterprise that supplies additional Hadoop availability, support, and performance benefits over what I’ve talked about above, so stay tuned for that.

  4. The Five Minute Interview – Workware Systems

    This article is one in a series of quick-hit interviews with companies using Apache Cassandra for key parts of their business.  For this interview, we talked to Don Ledford (Co-Founder & CTO) of Workware Systems Inc. in Seattle, WA. DataStax: So Don, what’s Workware Systems all about? Don: We produce intelligence collection and management systems, which sounds a little broad, but essentially we center on investigative case management and serve markets like government intelligence and law enforcement agencies, and other public organizations, as well as private enterprises. DataStax: What kind of infrastructure and development environment do you have? Don: Our applications are all web-based. We use Linux for our servers, and Java as the primary development language. We use Hector to connect to Cassandra. DataStax: What caused you to go with Cassandra for your data store? Don: We started working with relational databases, and began building things primarily with PostgreSQL at first.  But dealing with the kind of data that we do, the data model just wasn’t appropriate. We started with Cassandra in the beginning to solve one problem: we needed to persist large vector data that was updated frequently from many different sources. RDBMS’s just don’t do that very well, and the performance is really terrible for fast read operations. By contrast, Cassandra stores that type of data exceptionally well and the performance is fantastic. We went on from there and just decided to store everything in Cassandra. DataStax: Why did you choose Cassandra over some of the other NoSQL options? Don: There were two primary reasons. First, the data model allowed us to have very effective large vector handling. And second, the Cassandra community is vast, vibrant and extremely helpful; everyone is very responsive whereas other communities weren’t even close in that regard. DataStax: What’s the typical configuration and data volume that you guys deal with? Don: A normal setup for our customers is a couple of web servers, a 3-node Cassandra cluster, with a replication factor of 3, and a total data volume of around 12 terabytes.  That usually serves an organization like a police department with several thousand concurrent users very well. There are some larger law enforcement agencies that we’re beginning to work with that will necessitate a multi-data center setup for disaster recovery purposes. DataStax: What kind of performance are you seeing with Cassandra? Don: For a general million-plus set of objects and unstructured documents we see near instantaneous response time for reads and searches, which is great. DataStax: Don – thanks for the time! Don: You bet. For more information on Workware Systems Inc., visit: http://workwaresystems.com  
  5. The Five-Minute Interview – AppScale

    This article is one in a series of quick-hit interviews with companies using Cassandra for key parts of their business.  For this interview, we caught up with Raj Chohan, Ph.D. student at UCSB and one of the founders of AppScale. DataStax: Raj, tell us a little about AppScale’s history and what you guys do. Raj: We released our first version of AppScale in 2009. What we allow you to do is take your Google App Engine application and run it on your own hardware. We started with Eucalyptus for private clouds, which makes it easy to also support OpenStack for both public and private clouds since they’re a fork of Eucalyptus. We also have our AMI for EC2. DataStax: And you support different back-end databases? Raj: Yes. We support 12 datastores so the user can choose what datastore they want. We support Cassandra, MySQL, MongoDB, HBase, and others. DataStax: Tell us about your Cassandra usage and support. Raj: We’ve made Cassandra our default datastore because our benchmarks have shown it to be the highest performing database over all the others. We’ve done quite a bit of research in this area that shows this to be the case, and have written a number of research papers on the subject. DataStax: Does data volume play into that at all? Raj: Yes. For example, we saw that HBase did well up until a certain size, but then slowed down. With Cassandra, this wasn’t the case. And we haven’t moved to Cassandra 1.0 yet, which I understand is even faster. DataStax: What about ease-of-use and setup of Cassandra over the others? Raj: For us, Cassandra has been very simple to work with and use. The only knock I have about Cassandra is sometimes when we do see errors, the messages aren’t as meaningful as I would like. Having an error message that links to the online docs or a troubleshooting guide would be nice. DataStax: What other things do you like about Cassandra over the other databases? Raj: Configuration is very simple compared to everybody else. For example, with HBase, you have to configure HDFS, Zookeeper, and then HBase, whereas Cassandra is just one package by itself. When we started out with both MongoDB and Cassandra, we found that Cassandra was a much easier start to do the things we needed. Also, at the time, MongoDB wasn’t doing all the distributed sharding and what not. They kind of came late to the party with all those features. So that’s one reason we like and use Cassandra so much; it has all the core features that we want and has had them for some time. MongoDB can be fast in some cases, but its data persistence can be an issue as well as the global write lock. But overall we’ve found MongoDB to be slower than Cassandra. DataStax: What other functionality – database-wise – is important to you? Raj: Google App Engine doesn’t have great support for OLAP type operations, so things like Hive and Pig support are meaningful to us. These are things, I know, that are in your Enterprise offering. DataStax: Raj, thanks for the time. Raj: Sure thing. For more information on AppScale, visit: http://appscale.cs.ucsb.edu/  
  6. Top 5 Considerations for a Big Data Solution

    I gave a presentation at the GigaOM Structure event in New York last week that attracted a larger audience than I expected, so I thought I’d post the presentation for viewing. You can either go to the Slideshare site to view the deck or use the widget below.

  7. Tips for Getting Started with DataStax Enterprise 2.0

    As we announced yesterday, our DataStax Enterprise 2.0 server is now ready for download! Here at DataStax we’re (naturally!) jazzed about this release because it provides a number of very exciting new features including:
    • Enterprise search with Apache Solr! This is definitely the headlining new feature in 2.0. We take all the goodness of Solr, and mix in full data durability, easy scale out, no single points of failure, automatic data sharding, multi-data center support, and real-time search queries via CQL additions. Sound good? We think so.
    • Elastic workload provisioning! Want to easily change some existing Cassandra nodes into Hadoop nodes whenever you’d like and double or quadruple the analytic processing power of your cluster? Well, now you can! And, of course, you can go in the opposite direction too if needed (Hadoop to Cassandra).
    • Easy RDBMS data migration! With 2.0, it’s cake to pump out the data from your legacy RDBMS’s into a database cluster that’s ready for big data.
    • Web/Application log integration! Want to easily analyze and search your web logs and application log files? We make it pretty simple in DataStax Enterprise 2.0.
    • Hadoop upgrade! We’re now on 1.0.
    To get a better idea of each new feature, and a good understanding of DataStax Enterprise in general, check out our new “What’s New in DataStax Enterprise 2.0?” paper that’s now available for download. You can also watch a short video that goes over DataStax Enterprise 2.0 as well. And if you’re wanting to move data right now from one or more RDBMS’s into Cassandra/Hadoop/Solr, check out a new article we’ve posted that gives you a good tutorial on how it’s done. DataStax Enterprise 2.0 also ships with sample demos that showcase all the key features of the server in action. You can run our portfolio demo that shows off the power of Hadoop, our Solr demo that downloads, indexes, and searches Wikipedia with blazing speed, our RDBMS migration demo that migrates data from MySQL into our server, and our log integration demo that shows you how to stream, analyze, and search log data in 2.0. You can find all these in the /demos subdirectory of the server install. Of course, you can also check out our updated online and PDF docs that cover everything that’s in 2.0 in specific detail. You can download DataStax Enterprise 2.0 now from our website’s download page. Remember, also, that DataStax Enterprise 2.0 is completely free for development use, so don’t worry about any trial periods or time bombs in the software you’re using. Production deployments do, however, require a subscription, so when you’re ready to go live with your new DataStax Enterprise applications, be sure to contact us so you’re all legal-eagle and fully supported. Let us know what you think of DataStax Enterprise 2.0 when you get a chance, and thanks for your support of DataStax and Apache Cassandra.  
  8. Introducing DataStax Enterprise 2.0

    Today we announced DataStax Enterprise 2.0, the next step in our evolution of helping people solve their big data problems. In DataStax Enterprise 2.0, we have added:
    • Enterprise search via the extremely popular open-source project, Solr.
    • Elastic workload provisioning that allows your cluster to change compute power between Cassandra and Hadoop based on your application needs.
    • Snap-in log integration for application and weblogs so that they can be written, indexed, and searched, all in the same cluster with the rest of your data.
    • RDBMS data transfer via Sqoop to help quickly move your relational data into DataStax Enterprise.
    The question: What does this really mean to businesses? Big data applications affect three primary groups: The “Business”, Developers, and IT operations/decision makers.  If these groups are misaligned, it creates substantial challenges that slow everything down to a crawl. With DataStax Enterprise 2.0, we provide a platform that can deliver continuous availability without compromising performance, operational simplicity, or cost.  The Business no longer has to wait on complex infrastructures to be built or managed.  Developers have a single cluster for handling everything in their applications from real-time transactions, to Hadoop batch analytics, to powerful search.  And IT operations has an elegant architecture that will easily span datacenters and/or the cloud, which means low operational costs even under the most extreme demands. This is what DataStax Enterprise is all about.  Helping customers move faster in what otherwise can be a very complicated world of systems that fall short on many levels, creating an inability to react and build great things with big data applications. You can check out all the information on DataStax Enterprise 2.0 here.  And you may also find our white paper on Big Data – Beyond the Hype helpful as well. I also want to publicly thank all our folks at DataStax who have made this such a great release!  We have an amazing team dedicated to you, our customers, and I could not be any prouder of the work they do.
  1. 1
  2. Next ›
  3. Last »