The Rise of the Kubernetes Native Database

Originally posted on thenewstack.

A look at two databases that have made claims to the Kubernetes native label: TiDB and DataStax Astra DB.

The cloud computing revolution has inspired and benefitted from multiple interrelated trends. The availability of self-service, public cloud infrastructure has helped to drive the adoption of microservice architectures and DevOps practices, including automation and observability.

The drive toward containerization and then container orchestration has led to the widespread adoption of Kubernetes as an environment for managing cloud native applications.

But one of the lagging areas in this revolution has been data and data infrastructure. For too long, data has been something that has lived outside of Kubernetes, and this has led to a lot of extra effort and complexity for developers in deploying cloud native applications.

One oft-repeated axiom in the early years of Kubernetes was that it was not yet ready for stateful workloads. Thankfully, a major shift has been quietly under way and has in fact reached a point of maturity.

The transformation happened slowly at first, beginning with efforts to containerize existing databases. This worked relatively well in cases of small databases that ran on a single compute node, or databases that had been designed in a cloud native world, like Apache Cassandra and DynamoDB, but challenges remained.

Over the past two to three years, a new generation of databases has emerged. These “Kubernetes native” databases have been designed from the ground up to run on this open source orchestration system.

Here, we’ll define the qualities that make a database Kubernetes native and the benefits that result from adopting a Kubernetes native database. To do that, we’ll look at two databases that have made claims to the Kubernetes native label: TiDB and DataStax Astra DB.

Kubernetes Native MySQL with TiDB

First, let’s examine a database with a relational emphasis: TiDB (short for Titanium Database). TiDB is an open source system built by PingCAP that provides a MySQL-compatible database as well as a columnar database to support hybrid transactional and analytic processing (known as HTAP for short).

As shown in Figure 1 below, TiDB has a microservice design, where the TiDB query layer, TiKV MySQL databases, TiFlash columnar databases, Spark nodes and metadata management are each deployed as scalable microservices in their own clusters. This design separates compute-intensive work from storage-intensive work, as the query layer and database layer are independently scalable.

Figure 1: TiDB Architecture (Adapted from Source: PingCAP Documentation Site)

One critical commitment that the TiDB creators made was that the database only runs on Kubernetes. Is that enough to make it Kubernetes native? Let’s dig a bit deeper. First, TiDB is deployed and managed by a Kubernetes operator using custom resources (CRDs). The TiDB CRDs include the TiDBCluster, which enables you to specify the scaling and configuration of each microservice and how the database layer components use storage through Kubernetes Persistent Volumes. Additional CRDs are used to deploy monitoring tools and manage operational tasks like backup and restore.

TiDB also has an optional scheduler extension that interfaces with the default K8s scheduler to make more application-aware scheduling decisions. This emphasis on using existing Kubernetes capabilities where available is the mark of a Kubernetes native database.

Kubernetes Native Cassandra with DataStax Astra DB

Now let’s look at another Kubernetes native database and note some similarities and differences. Cassandra is a highly scalable NoSQL database that was one of the first to claim to be cloud native, but what does it look like to deploy Cassandra in Kubernetes? DataStax Astra DB is a version of Cassandra that has been factored into microservices, as shown in Figure 2.

Similar to TiDB, the database includes microservices that are concerned with query processing and data storage, as well as services for identity and access control, data repair and backup/restore. The data services are particularly interesting in their use of storage, with Kubernetes Persistent Volumes used only for caching and object storage used for longer-term persistence. Separating compaction into its own service enables this compute-intensive processing to happen in the background without affecting performance of data services that are serving read and write traffic.

Figure 2: DataStax Astra DB architecture (Source: DataStax Whitepaper)

Astra DB is offered as a managed service available in multiple cloud regions. Each region contains a data plane consisting of the services mentioned above, managed by a Kubernetes operator, as well as infrastructure services including the Kube-Promethus stack for observability and etcd for metadata management. The data planes are managed by a control plane that can run in one or more clouds to perform tasks such as managing customer accounts and databases and provisioning Kubernetes clusters in new regions.

One novel aspect of Astra DB is its multitenant architecture in which multiple user databases can share the same microservices and supporting infrastructure, lowering unit economics for smaller-scale users. As users grow their applications, they have the option of moving to dedicated resources to achieve optimal performance at scale, all on a “pay-as-you-go” basis.

Kubernetes Native Database Principles

Based on our observations of TiDB and Astra DB, we can derive some ideas of what makes a database Kubernetes native. Many of these correspond to a list of principles for cloud native data, which I described in a previous article:

  • Composable microservice architecture: First, a database that is broken into constituent microservices enables each service to be scaled independently. Some types of compute-intensive processing may even be scaled to zero for a true serverless solution, especially when combined with a multitenant design.
  • Treat compute, network and storage as commodities: Microservices composed of a Kubernetes native database should make maximum usage of Kubernetes APIs for managing the fundamental resources of cloud native applications: compute resources such as StatefulSets and deployments for managing workloads, the Persistent Volume subsystem for storage, Kubernetes ingress and services for exposing network access to data and more. This includes leveraging capabilities already present in Kubernetes such as etcd for metadata management instead of bringing along components with duplicative functionality.
  • Leverage Kubernetes best practices: Following common patterns for Kubernetes applications will yield multiple operational benefits, for example, exposing liveness and readiness checks on each microservice to help availability and exposing metrics via the Prometheus PromQL API for observability. By default, Kubernetes itself sets a great example that databases should follow for how to be secure: using Kubernetes Secrets to distribute security credentials, only exposing ports as needed and so on.
  • Declarative management via operators: A Kubernetes native database should embody the Kubernetes principles of declarative management via operators and custom resources, rather than relying on legacy database management UIs and CLIs. When necessary, Kubernetes extension points, such as scheduler extensions, can be used to add application-specific behavior. The goal is a clean separation of data plane functionality (managing data) from control plane functionality (managing the database).

Databases and other data infrastructure that faithfully adopt these principles will yield benefits including high performance for optimal cost at all scales, lower operational complexity resulting in faster time to market and standards-compliant solutions meeting today’s high availability and security demands.

The Future of Kubernetes Native Data Infrastructure

There is still much progress to be made, and it’s not limited to databases alone. Kubernetes native principles can be applied to other types of data infrastructure, including streaming, analytics and machine learning.

Kubernetes native solutions will continue to make strides in multicluster and multicloud deployments in order to scale globally, and will adopt multitenancy and serverless principles for better cost optimization. Kubernetes itself has room for improvement in adding more flexibility to StatefulSets and support for multicluster federation.

The key to continued progress is open collaboration. The Data on Kubernetes Community is a highly active group of data geeks bringing together builders of data-intensive applications and the infrastructure that support them.

Join us to talk about ideas like developing reusable operators that can manage multiple databases or defining a common set of CRDs for concepts like backup/restore and data loading. Together we’ll continue to push the horizon of cloud computing for the benefit of all.

This article is based on Chapter 7, “The Kubernetes Native Database,” from the O’Reilly book “Managing Cloud Native Data on Kubernetes” by Jeff Carpenter and Patrick McFadin.