How to choose a cloud database - Software Contract Solutions

Transactional cloud databases come in all shapes and sizes, from simple key-value stores to planet-scale distributed relational databases. Here’s how to choose the right cloud database for your application.

Databases have come a long way since the early 1980s when they only ran on mainframes and your choice was between CODASYL databases and relational databases. CODASYL databases tended to be twice as fast as relational databases, but eventually the improvements in computer hardware and the convenience of SQL queries (versus writing navigational database code) led to relational databases dominating the market.

Now there are databases to run anywhere, from your smartphone, to servers in your data center, to geographically distributed cloud databases. In addition to relational tabular databases, there are databases for time series, graph, spatial, text, online analytical processing (OLAP), XML, and JSON data. Some databases specialize in a single kind of data; some offer a choice of data types, but only one at a time; some allow multiple data types to coexist in the same database instance. Some databases specialize in online transaction processing (OLTP), some in analysis (OLAP), and some work well with combined transactional and analysis workloads.

For this article, we will concentrate on transactional workloads running in the cloud, and leave cloud data warehouses for another article. Some of the databases we’ll discuss can also run on-premises; some have specialized cloud support but are compatible with on-prem databases; and some are “cloud-native,” meaning they are available only from a cloud provider.

Determining database requirements

A database is almost never a thing unto itself. Instead, a database is generally the back end or storage layer of an application.

The CAP theorem in a nutshell

The CAP theorem (Brewer et al., 1998) states that any networked shared-data system can have at most two of three desirable properties:

Consistency (C) equivalent to having a single up-to-date copy of the data;
Availability (A) of that data (for updates); and
Tolerance to network partitions (P).

The ideal properties of a database are determined by the needs of the application it serves. If the application displays a catalog, then the read speed and latency of the database are important; a document database might be ideal, but relational and wide-column databases would also work. If the application handles financial transactions, then the ACID properties (Atomicity, Consistency, Isolation, and Durability) of the database are important, and a relational database might be ideal.

As explained in a 2012 follow-up paper, the two-out-of-three formulation turned out to be an oversimplification. In modern distributed database architectures, node failures and network partitions are mitigated by consensus groups using Paxos or Raft algorithms. Essentially, when one node drops out of a cluster, the cluster continues to work as long as it has a quorum. Also, partitions are rare inside of private networks such as those found at major cloud providers, which use redundant fibers among data centers and don’t transmit internal traffic over the public internet.

What this means is that, while no database can technically get around the CAP theorem, in practice the best cloud databases have greater than five nines (99.999%) availability. So effectively, these databases can get around the CAP theorem and be both consistent and available.

If the application is a world-wide multiplayer video game, then the read and write latency are both important, and the database probably needs to be distributed, although not necessarily relational, and not necessarily with strong consistency; a key-value database might be ideal. If the application records and monitors sensor outputs from valves, the database should be able to write large amounts of time series data quickly.

How much data will you generate? And how fast?

Small amounts of data, measured in gigabytes or less, can be handled by almost any cloud database—and some can handle this data in memory. Many cloud databases can handle terabytes (thousands of gigabytes); only a few can accommodate petabytes (millions of gigabytes). Note that most cloud databases charge you for storage on a monthly basis, and charge you more for SSD storage than disk storage.

The speed at which data arrives can stress other metrics, such as database write speed and network capacity. If lots of data arrives in a burst, the database or a front-end program may need to buffer it in RAM while it writes to permanent storage, to avoid data loss. Redis is often mentioned as able to buffer large data bursts for other databases.

Do you know your schema ahead of time?

If your database schema (the structure of the data) is predetermined and unlikely to change significantly over time, and you want most fields to have consistent types from record to record, then SQL databases would be a good choice for you. Otherwise, NoSQL databases, some of which don’t even support schemas, might be better for your application.

There are exceptions, however. For example, Rockset, an operational database, allows for SQL queries without imposing a fixed schema or consistent types on the data it imports.

What kind of shape is a natural fit to your data?

Relational SQL databases, such as Microsoft SQL Server, PostgreSQL, and MySQL, store strongly-typed data in rectangular tables with rows and columns. They rely on defined relations between tables, use indexes to speed up selected queries, and use JOINS to query multiple tables at once. Many modern relational databases, including Oracle Database, also support other shapes.

Document databases, such as MongoDB and Couchbase, typically store weakly-typed JSON (either text or binary) that may include arrays and nested documents. Graph databases either store vertexes and edges with properties, for example Neo4j, or RDF triples, for example AllegroGraph. No matter the implementation, graph databases emphasize the connections between entities. Other NoSQL database categories include key-value, such as RocksDB, and columnar stores, such as Cassandra.

Sometimes the data is captured in a shape that will also work for analysis; sometimes it isn’t, and a transformation will be necessary. Sometimes one kind of database is built on another. For example, key-value stores can underlie almost any kind of database.

What are your latency requirements?

Latency refers both to the response time of the database and to the end-to-end response time of the application. Ideally every user action will have a sub-second response time; that often translates to needing the database to respond in under 100 milliseconds for each simple transaction. Analytic queries can often take seconds or minutes. Applications can preserve response time by running complicated queries in the background.

Running the database in the cloud complicates the measurement of latency. There are multiple factors at work here. The simplest consideration is that the latency between the client and the database adds to the latency due to the database query response. A more complicated consideration is that committing transactions in a distributed database may involve waiting for writes in geographically distributed regions, especially if the database is maintaining strong consistency.

Do you need a clustered database?

Clustered databases can offer several advantages over single-node databases, at higher expense and complexity. Among other benefits, clusters can exhibit higher availability, higher throughput, and, in some cases, lower latency.

For the kind of cluster where each node has a copy of the entire database, you get significant redundancy and higher availability. Depending on the policy, only one node may be needed for a read, or else a quorum of the nodes in the cluster might have to agree on the value to return.

Using a cluster with multiple nodes makes more CPU available to the database, which increases the throughout and may increase the transaction rate. With a read policy that allows the nearest node to return a value, the read latency typically goes down. On the other hand, a write or transaction policy that must wait for all nodes to commit can sometimes increase the write latency.

The use of consensus groups helps reduce latency. If you have a three-node cluster and one node is loaded down, the other two nodes can approve a consensus transaction, and update the third node when it is available.

Sharding is a way of handling more data by dividing up the database. While manual sharding can be a time-consuming pain in the neck, many databases are capable of automatic sharding.

Do you need a distributed database?

Clustering is not the ultimate way to expand a database, but it’s a first step. The next step is a distributed database, which usually means that there are clusters in mul‏tiple regions. Some databases allow for distributed read replicas and a master read-write instance or cluster. Other databases allow for distributed read-write instances or clusters, and have synchronization mechanisms.

Distributed databases can often deliver lower latency and higher throughput for remote users. Users in Tokyo might see 260 millisecond latency to a server in Barcelona, but if there were a database copy in Japan their average read latency could be as little as 10 milliseconds. The implications of this for writes and transactions depend on the consistency requirements of the database and how the remote clusters are configured.

Early distributed databases were NoSQL databases with eventual consistency. Eventual consistency means that reads after writes in a remote location aren’t guaranteed to return current information, but will update over time. Eventual consistency relaxes the requirements for writes and transactions to complete, resulting in lower latency.

Recently a few distributed databases have implemented strong consistency aided by data fabrics, consensus groups, and time synchronization. Examples of this include Google Cloud Spanner and CockroachDB.

What’s your database budget?

While most databases have a “community” or “develop/test” version that is free, those may lack support other than online community forums. Community and open source versions may also lack some of the performance optimizations offered in commercial builds. If your company will depend on the database, you should invest in a license and support.

If you’re running a database in the cloud, you’ll at minimum need to pay for your cloud resources. For a commercial database, you’ll also need a database license, which can either be a long-term license from the vendor, or a pay-as-you-go license purchased through the cloud provider.

Key cloud database products

There were 373 systems in the DB-Engines ranking the last time I looked, although many of these products are not explicitly cloud databases. I’ve chosen 12 cloud service and cloud database vendors for this list as examples, and listed them alphabetically. Note that inclusion in this list is not a recommendation, and exclusion is not a condemnation.

Amazon Web Services offers at least 15 databases in its cloud, although several of those are data warehouses, and a few others have been deprecated. Aurora is its high-performance, high-availability relational database service, which supports both MySQL and PostgreSQL. RDS is its standard-performance relational database service, which supports five engines: MariaDB, MySQL, Oracle Database, PostgreSQL, and Microsoft SQL Server. DynamoDB is its high-traffic key-value database service. ElastiCache is its in-memory service, with Memcached and Redis compatibility. DocumentDB is a document database service with MongoDB compatibility. Keyspaces is a wide-column database service with Cassandra compatibility. Neptune is a graph database service that supports both property graph and RDF models. Timestream is a time-series database service. QLDB is a ledger database service.

CockroachDB is a distributed, horizontally scalable, dynamically sharded, relational, multi-model database that implements PostgreSQL on top of a key-value store; it has strong consistency and incredible survivability. CockroachDB Core is free and open source; CockroachDB Enterprise is a commercial version with additional features; CockroachCloud is a vendor-managed multi-cloud database as a service based on CockroachDB Enterprise and Kubernetes; CockroachCloud Free is an always-free version of CockroachCloud with reduced functionality and a limit of 1 vCPU and 5 GB storage per free cluster. CockroachDB added spatial data storage and indexing at the end of 2020.

Couchbase Server is a memory-first, distributed, flexible, JSON document database that is strongly consistent within a local cluster. Couchbase Lite is a mobile version that can run locally and can also synch to the server when connected. Couchbase Cloud is a fully managed, NoSQL database as a service for mission-critical applications that automates the deployment and management of Couchbase Server in your cloud environment on either AWS or Microsoft Azure.

DataStax Enterprise is an enhanced, cloud-native version of the open-source, wide-column database Apache Cassandra. DataStax Astra is a cloud-native multi-cloud, serverless, scalable, multi-region DBaaS built on Apache Cassandra/DataStax Enterprise. Storage-attached indexing gives Astra query capabilities on non-primary keys, which isn’t yet available on any other version of Cassandra.

Google Cloud hosts more than a dozen kinds of database. Relational databases include Bare Metal Solution for Oracle Database; Cloud SQL for MySQL, PostgreSQL, and Microsoft SQL Server; and Google Cloud Spanner, which is cloud-native with unlimited scale, consistency, and 99.999% availability. Google Cloud Bigtable is a wide-column store similar to Cassandra or HBase. Firestore and Firebase Realtime Database are document databases. Memorystore supports Redis and Memcached APIs. Google Cloud Partner Services support managed offerings from MongoDB, DataStax, Redis Labs, and Neo4j.

IBM offers about 10 kinds of database in its cloud. Relational database services include PostgreSQL, EnterpriseDB (a commercial extension of PostgreSQL), and IBM Db2. NoSQL database services include IBM Cloudant (a document database), MongoDB (also a document database), DataStax (a commercial extension of wide-column Cassandra), and Redis (an in-memory data structure store, used as a database, cache, and message broker). IBM hosts both PostgreSQL and MongoDB in hyper-protect environments, which are end-to-end encrypted.

Microsoft Azure supports eight transactional cloud databases. Azure SQL is the cloud-native version of SQL Server, a relational multi-model database; Azure SQL instances are similar, but offer maximum compatibility with the latest SQL Server engine. You can also run SQL Server in virtual machines. Azure Database supports MariaDB, MySQL, and PostgreSQL. Cosmos DB is a high-availability, multi-model, multi-region database service that offers document, wide-column, key-value, and graph models, albeit one model per instance. Azure Cache is compatible with Redis. Azure Managed Instance for Cassandra is a managed wide-column database that can synchronize with on-prem Cassandra clusters.

MongoDB Atlas is a multi-cloud, document database service available on AWS, Google Cloud, and Microsoft Azure. MongoDB itself is available as a managed service or in virtual machines on essentially every cloud service provider.

MySQL, MariaDB, Vitess, PlanetScale, and SkySQL are MySQL-derived databases available as a cloud service. MySQL is an open source multi-model relational database, available as a managed service in AWS, Google Cloud, Microsoft Azure, and Oracle Cloud, and in virtual machines on essentially every cloud service provider. MariaDB is a fork of MySQL by the original developers. Vitess is a database clustering system for horizontal scaling of MySQL, with automatic sharding. PlanetScale is a MySQL-compatible, serverless database platform powered by Vitess. SkySQL is a MariaDB service available in AWS and Google Cloud.

Neo4j is an ACID-compliant property graph database with many clustering features. Neo4j Aura is a fast, reliable, scalable, and completely automated Neo4j graph database, provided as a cloud service. The free and professional levels of Aura are only available on Google Cloud. The enterprise level is available on both AWS and Google Cloud.

Oracle Database was the first commercial relational database, and it’s still a leading relational multi-model database. It is available in Oracle Cloud as a service in multiple shapes and sizes; MySQL is also available in the Oracle Cloud as a service. Oracle Database is also available for on-premises deployment and in the AWS and Google clouds.

Redis is a NoSQL, in-memory data structure store that can persist on disk. It can function as a database, cache, and message broker. It provides high availability via Redis Sentinel and automatic partitioning with Redis Cluster. Redis Enterprise adds features for additional speed, reliability, and flexibility, and it’s available as a cloud database as a service. Redis on Flash is a Redis Enterprise feature that can dramatically reduce the cost of hardware for Redis. Redis Enterprise Cloud instances are available on AWS, Google Cloud, and Microsoft Azure; you can choose your own region or regions. You can also run Redis in cloud VMs, Kubernetes, or containers.

Whatever database or databases you pick for your application, don’t forget to run a proof of concept before you commit yourself, and a load test before you go into production. Many cloud databases can scale up and out as needed, but not all can do so without transferring data to a new instance and then shutting down the old instance.

Once your database is in production, set up continuous monitoring with alerts for out-of-range conditions, and be prepared to deal with emergencies. Note that some databases will require tuning and changes to their indexes as their load changes; others will tune themselves automatically.

This article originally appeared on InfoWorld.