pinterest

Less than a year ago, I wrote a post on the different DB solutions available that are ‘cloud-ready’. For those who are still struggling with the decision of which database solution to use and want some real-life examples, I recommend reading this recent post on Pinterest’s infrastructure architecture.

Key notes from the article:

Choice of DB: MySQL + Redis over MongoDB or Cassandra
Reason for DB: Sharding over Clustering – because of maturity and ease of hire
Number of DBs: 88 Master + 88 Slaves (cc2.8xlarge = $2.700 per Hour = $1944 per Month per Server = $342,144 per month for just MySQL)

Interesting points:

  • Algorithm for placing data is very simple. The main reason. Has a SPOF, but it’s half a page of code rather than a very complicated Cluster Manager. After the first day it will work or won’t work.
  • Can’t perform most joins.
  • Lost all transaction capabilities. A write to one database may fail when a write to another succeeds.
  • Reports require running queries on all shards and then perform all the aggregation yourself.
  • Joins are performed in the application layer.
  • When the Pin table went to a billion rows the indexes ran out of memory and they were swapping to disk.
  • If your project will have a few TBs of data then you should shard as soon as possible.
  • Architecture is doing the right thing when growth can be handled by adding more of the same stuff. You want to be able to scale by throwing money at the problem by throwing more boxes at the problem as you need them. If you are architecture can do that, then you’re golden.

My thoughts:

Firstly, thank you Pinterest for showing us your infrastructure architecture. With so many choices of DBs available (NoSQL vs SQL vs NewSQL), it is very insightful and helpful for others in a position to choose which DB strategy path to follow. After analysing the post, I feel that for a small team / startup and for future proofing, a fully managed DB is a better option. Small teams don’t usually have the necessary man power to maintain and branch out new MySQL shards. What they should do instead, is to concentrate on delivering quality software and if traffic increases, to have the flexibility to push a button and let a DaaS take care of the scaling ala DynamoDB or Instaclustr or MongoHQ.

Their decision to shard MySQL meant that:

a) they can horizontally scale
b) they can run flexible (but limited) queries

however, sharding MySQL had drawbacks too:

a) you can no longer have JOINS – one of the main strong points of having SQL
b) no longer fully transactional – again, another strong point for choosing SQL over NoSQL
c) operational burden for manual sharding – your team needs to be 100% on top of data size as well as when the need to re-shard

It does feel to me that there is quite a lot to work with (though they have the man power to do so 44 engineers). But then again, they’ve done it and proved it to work – so surely it’s can be considered a winning strategy?

Thoughts?

@munwaikong

Infrastructure

In today’s world of software engineering, traditional relational databases (RDBMS) like MySQL or PostgreSQL databases are no longer the ‘de facto’ choice for a database system. Since the increase in popularity of cloud computing, NoSQL databases have risen to play an important part in data architecture in the cloud. Cloud servers no longer guaranteed dedicated performance (disk io / cpu / memory) as well as a 99.99% uptime. The best approach of designing software in the cloud is by designing the application to expect failures. This proves to be an issue as databases are usually the living heart and soul of any data-driven application. Without data, the application cripples.

Because Cloud Computing provides the ability to scale server resources up/down easily and quickly, the database design will also need to respond to the change in traffic load and scale accordingly. Most databases will have replication features, which you can setup a master-slave network of databases to help ease the load of a high-read application. But what about a high-write application – would you need to consider sharding?

The key point to take away from this post is that although there are many databases available that you can choose from (MySQL, PostgreSQL, Redis, Riak, MongoDB, CouchDB, HBase, Cassandra, Neo4j, etc.), there is no such thing as the ‘best database for the cloud’. In order for you to choose the ‘best’ database, you must first identify the needs of your application. If you are familiar with the CAP Theorem (Consistency, Availability, Partition Tolerant), different databases are designed for different combinations of CAP. Although there are many blog posts on the interweb comparing the different variations of databases, you should not base your choice solely from these results. For example, although big corporations like Twitter and Facebook use HBase for some of their products, doesn’t mean that you should implement HBase on your design. Perhaps a less complex setup is key and therefore a database like CouchDB is more suitable. So below are a few of the key questions you should try and answer which should help you narrow down the choices so that you can then focus on the details of the databases and then ultimately choose the ‘best’ database for your application.

Is your application read or write heavy? or both?

  • Read-Heavy – [DBs with Replication feature i.e. almost all] most databases provide master-slave (or even master-master) replication. Replication will have handle the load of read-heavy applications.
  • Write-Heavy – [DBs with Sharding feature i.e. MongoDB, HBase, Cassandra, Riak] whilst you can have a master-slave setup, all writes (i.e. inserts / updates / deletes) will be directed to the master. In order to take some of the load off the master, you need sharding
  • Both – [DBs with Replication and Sharding feature i.e. MongoDB, HBase, Cassandra, Riak]

Do you have an ops team to help with the complex setup / management of db clusters?

  • Databases like MySQL, CouchDB are very easy to get started with on a single server. They provide easy to use GUI / admin tools that you can experiment around with. Others like HBase, Cassandra and MongoDB will require more planning and architecture design to get an optimized setup.

Does your data need guaranteed durability?

  • Databases like MongoDB and Redis are known for their blazing speed because they first store values onto memory which then gets flushed to disk periodically. However as a trade-off to speed, they are a threat for data not being persisted given a DB failure event.

How big is your data?

  • Databases like Cassandra and HBase are designed for ‘Big Data’ ground up. However the ability to handle huge data comes at a cost: complexity

What is your primary goal and what does your application dataset resemble?

  • Are you building a write-log type system, or a read-cache reference type system, or a write-analyse analytics type system? Does your application natural fall under a key-value (Redis, Riak) / document orientated (MongoDB, CouchDB) / relational (MySQL, PostgreSQL) / columnar (Cassandra / HBase) or graph (Neo4J) type data model?

Do you need features like map-reduce / secondary indices / REST interface / views or stored procedures?

  • Some databases provide a subset of features where others don’t. For example, if you need a feature like secondary indices, you would choose MongoDB over CouchDB.

There are many other questions you should ask yourself but the bottom line is: there is no ‘right or wrong’ database. Instead there are ‘suitable and less suitable’ ones. In fact, you don’t even have to choose just one – a combination of multiple databases could as well, yield the best result.

@munwaikong

Infrastructure