Choosing a Database for the Cloud

In today’s world of software engineering, traditional relational databases (RDBMS) like MySQL or PostgreSQL databases are no longer the ‘de facto’ choice for a database system. Since the increase in popularity of cloud computing, NoSQL databases have risen to play an important part in data architecture in the cloud. Cloud servers no longer guaranteed dedicated performance (disk io / cpu / memory) as well as a 99.99% uptime. The best approach of designing software in the cloud is by designing the application to expect failures. This proves to be an issue as databases are usually the living heart and soul of any data-driven application. Without data, the application cripples.

Because Cloud Computing provides the ability to scale server resources up/down easily and quickly, the database design will also need to respond to the change in traffic load and scale accordingly. Most databases will have replication features, which you can setup a master-slave network of databases to help ease the load of a high-read application. But what about a high-write application – would you need to consider sharding?

The key point to take away from this post is that although there are many databases available that you can choose from (MySQL, PostgreSQL, Redis, Riak, MongoDB, CouchDB, HBase, Cassandra, Neo4j, etc.), there is no such thing as the ‘best database for the cloud’. In order for you to choose the ‘best’ database, you must first identify the needs of your application. If you are familiar with the CAP Theorem (Consistency, Availability, Partition Tolerant), different databases are designed for different combinations of CAP. Although there are many blog posts on the interweb comparing the different variations of databases, you should not base your choice solely from these results. For example, although big corporations like Twitter and Facebook use HBase for some of their products, doesn’t mean that you should implement HBase on your design. Perhaps a less complex setup is key and therefore a database like CouchDB is more suitable. So below are a few of the key questions you should try and answer which should help you narrow down the choices so that you can then focus on the details of the databases and then ultimately choose the ‘best’ database for your application.

Is your application read or write heavy? or both?

  • Read-Heavy – [DBs with Replication feature i.e. almost all] most databases provide master-slave (or even master-master) replication. Replication will have handle the load of read-heavy applications.
  • Write-Heavy – [DBs with Sharding feature i.e. MongoDB, HBase, Cassandra, Riak] whilst you can have a master-slave setup, all writes (i.e. inserts / updates / deletes) will be directed to the master. In order to take some of the load off the master, you need sharding
  • Both – [DBs with Replication and Sharding feature i.e. MongoDB, HBase, Cassandra, Riak]

Do you have an ops team to help with the complex setup / management of db clusters?

  • Databases like MySQL, CouchDB are very easy to get started with on a single server. They provide easy to use GUI / admin tools that you can experiment around with. Others like HBase, Cassandra and MongoDB will require more planning and architecture design to get an optimized setup.

Does your data need guaranteed durability?

  • Databases like MongoDB and Redis are known for their blazing speed because they first store values onto memory which then gets flushed to disk periodically. However as a trade-off to speed, they are a threat for data not being persisted given a DB failure event.

How big is your data?

  • Databases like Cassandra and HBase are designed for ‘Big Data’ ground up. However the ability to handle huge data comes at a cost: complexity

What is your primary goal and what does your application dataset resemble?

  • Are you building a write-log type system, or a read-cache reference type system, or a write-analyse analytics type system? Does your application natural fall under a key-value (Redis, Riak) / document orientated (MongoDB, CouchDB) / relational (MySQL, PostgreSQL) / columnar (Cassandra / HBase) or graph (Neo4J) type data model?

Do you need features like map-reduce / secondary indices / REST interface / views or stored procedures?

  • Some databases provide a subset of features where others don’t. For example, if you need a feature like secondary indices, you would choose MongoDB over CouchDB.

There are many other questions you should ask yourself but the bottom line is: there is no ‘right or wrong’ database. Instead there are ‘suitable and less suitable’ ones. In fact, you don’t even have to choose just one – a combination of multiple databases could as well, yield the best result.

@munwaikong

Infrastructure