Originally posted on cacm.acm
Most software applications require persistence programming of some kind—but what exactly is it, and more importantly, are we doing it right?
A few years ago, my team was working on a commercial Java development project for Enhanced 911 (E911) emergency call centers. We were frustrated by trying to meet the data-storage requirements of this project using the traditional model of Java over an SQL database. After some reflection about the particular requirements (and non-requirements) of the project, we took a deep breath and decided to create our own custom persistence layer from scratch. This ended up being a lot of work, but it also gave us a chance to rethink persistence programming in Java.
Along the way we uncovered some new, and possibly better, ways to do things. In summary, by reducing the “database” to its core functions and reimplementing everything else in Java, we found that managing persistence became more natural and more powerful. Although this was a Java project, the lessons learned are not specific to Java.
First of all, what exactly is persistence programming?
One simple definition is your program stores data outside of the context of the running program itself. Simply saying
int foo = 42 doesn’t qualify, but saving data to a file or writing data into a database does. Data is persistent if it can be read by a completely different invocation of your program or by a completely different program.
By this definition, you are doing persistence programming when you create an XML (Extensible Markup Language) document and send it over the network. In some sense, an XML document is a little database sent to the receiver, who then opens and reads from it. In fact, one insight from our project is that network communication and database storage involve several common issues, and these issues can be addressed with common tools (more about that later).
Although most software requires some form of persistence programming, programming languages typically provide limited support for it (for example, serialization of basic data types). In the Java programming language, accessing an
int is idealized: It’s atomic, instantaneous, it never fails, and it never requires a schema migration. If you want to access an
int that persists across program invocations, however, you face a host of new issues, with zero guidance from the language itself. It’s then your job to assemble (or homebrew) the required additional components.
This may be the correct choice from a language design perspective, but it has caused persistence programming to evolve unnaturally—from the outside in, so to speak. First, databases were created; then query languages such as SQL were designed to communicate with them (originally for use by humans); and finally, libraries were written to allow programs to send SQL queries over the network and retrieve results.
It was left to programmers to figure out how to bridge between the pure, idealized world of programming languages and the practical world of SQL tables, query design, performance, transactions, and network failures. The net result is that programmers have ended up catering to the demands and requirements of the database technology, instead of the other way around.
This can be seen in the many problems with JPA (Java Persistence API), the current state-of-the-art tool for persistence programming in Java. (In 2019, JPA was renamed Jakarta Persistence: https://en.wikipedia.org/wiki/Jakarta_Persistence.) JPA goes to heroic lengths to make querying and retrieving data from an SQL database as painless as possible, but the result is hardly elegant or intuitive.
Several problems with JPA will be discussed, but to be clear, these problems are not JPA’s fault. They are inherent in trying to bridge between the abstract and idealized world of high-level, object-oriented programming languages such as Java and the very different world of SQL databases. This is often referred to as the “object relational impedance mismatch.” As one example of this mismatch, consider that JPA defines more than 100 Java enum and annotation classes to cover all the different ways of mapping between the two domains.
Validation checks should be deferred until transaction commit time, unless explicitly requested earlier.
This provokes a basic question: Are we doing this right? To answer, let’s imagine what persistence programming might look like if we could start over and design it to address the needs of programmers first.
What Is a Database?
The need to persist data is not going away, so let’s consider what is essential about using a database. First, there must be a way to encode programming language data values into raw bits. Ideally these encodings should sort in the same order as the corresponding values in the programming language, which allows the database to respect that ordering.
Once those bits are put into a database, there must be a way to retrieve them. Since you often want to retrieve them in a different order from the way they were put in, the data must be keyed somehow. Then you can give the database an arbitrary key and get the corresponding value back. You may also want the database to keep the keys sorted, so you can iterate them efficiently and query by upper or lower bound. For general-purpose use, the ability to sort is expected, so we’ll assume that.
Obviously, key lookup must be efficient, which implies some kind of lookup-optimized data structure. Practically speaking, virtually all databases store data using one of two data structures: hash tables (for unsorted data) or balanced trees (for sorted data).
Finally, any nontrivial database must allow concurrent access, which means that it must define notions of transactions and corresponding semantics for atomicity, isolation, consistency, and durability, among others (see https://jepsen.io/consistency for a thorough discussion of consistency levels). There is a wide and interesting variety of models here, so we won’t make any specific assumptions.
A modern SQL database provides all of this, plus a bunch of “other stuff.” If you are starting from scratch, however, then a transactional, sorted binary key/value store is a reasonable lowest common denominator to assume. The simplicity of this definition also makes it a great place to define an API, allowing you to easily port existing databases and add new ones.
Everything else that databases typically provide—schemas, indexes, foreign key constraints, and command line interfaces—can be considered “other stuff” because nothing requires them to be implemented by the database itself. Omitting them from the definition of database leaves room to reimplement these features in a new way that better serves the programmer.
Bringing Persistence into the Language
Now that we have a notion of what a database provides underneath, let’s jump to the top layer and look at what persistence programming might look like from the programming-language level. One way to understand what we really want is to revisit our frustrations with JPA and for each one ask: What would be a better way to do things?
Basic types. In JPA, the first frustration is that the Java basic types don’t match their corresponding SQL types. For example, floating point
NaN values are often completely unsupported, SQL’s
DATETIME is not the same as
java.util.Date, and SQL databases sometimes silently truncate or space-pad character string values. In short, if you’re not careful, you won’t reliably read back what you write.
Moreover, the set of supported types is limited and fixed. For example, the only supported array type is
byte. If you need
LAT-LONG type but your database doesn’t natively support it, you’re out of luck.
Instead, we want exact support for any primitive, wrapper, or array type, and the common types such as
Date. Equally important, including any new Java type should be possible simply by providing its bit encoding, and these custom types should be first class, in that they are just as sortable and indexable as the built-in types.
Types, classes, and interfaces. JPA must map Java class hierarchies onto rectangular tables, resulting in unused columns or extra joins. Properties may be inherited only from superclasses, not from interfaces (that is, JPA properties must be concrete), and JPA is incompatible with certain uses of Java generics. Instead, we want efficient storage of class hierarchies and full compatibility with Java’s interface inheritance and generics.
Change notifications. JPA supports basic entity life-cycle notifications via
@PrePersist, @PreRemove, @PreUpdate, among others. These apply with per-object granularity (that is, multiple individual property changes will coalesce into a single notification), however, and notifications are generated asynchronously on cache flush, not immediately when they occur.
Instead, precise, synchronous notifications for both object life cycle and individual property change events are desired, as is the ability to detect nonlocal changes in objects reachable through an arbitrary number of references (ideally, either forward or inverse). For example, nodes in a tree with a parent property may want to get notified when the color property in any child (or grandchild) node changes, or vice versa.
Aside from being useful in their own right, having notifications with per-property granularity that can be synchronous and nonlocal is a key enabling technology for other useful new features.
Indexes. When you define an SQL table
USER with columns
USERNAME, the database creates an internal balanced tree with
ID keys and
USERNAME values. If you then index the
USERNAME column, the database creates a secondary balanced tree underneath the covers with
USERNAME keys and
ID values. The database automatically keeps the two trees consistent by updating the secondary tree whenever any
USERNAME in the primary tree is added, modified, or removed.
That’s a traditional index, which JPA supports. Thinking more generally, however, an index can be any combination of a secondary data structure that is entirely derived from primary data; change notifications that notify about changes in the primary data; and an update algorithm that updates the secondary data structure when notified. Why shouldn’t you be able to create any kind of index simply by defining a, b, and c?
Suppose you have a table of home values and you frequently want to access the median home value. This is not something you can index (or even directly query) in SQL. If the database could notify you whenever any home value is added, removed, or changed, however, a simple secondary data structure and update algorithm could complete the picture: Create a traditional index on home values so you can quickly find neighboring values, and then store the current median, the number of lower values, and the number of higher values in a new secondary data structure.
This example shows why change updates must be synchronous and per-property rather than per-object. Support for nonlocal change notifications is also important, because sometimes the information you want to index is not limited to a single object. Suppose you want nodes in a tree to index how many “blue” child nodes they have. Nodes could store that number in a private field
numBlueChildNodes and register for change notifications on the
color properties of child nodes to keep that field up to date.
Of course, any such index could be implemented manually by the programmer with more work, but it’s less messy and more robust when the database provides the change notifications. After all, the database is in the perfect position to do so because it sees every change to every data value. In summary, we want the database to provide tools that make it easy to implement arbitrary custom indexes. Synchronous, nonlocal, object-, and property-based change notifications are such a tool.
Validation and invariants. Key to any software that manages data is validation—that is, the maintenance and verification of required invariants (aka validation constraints). When a transaction starts, your code assumes the invariants hold, and your goal is to do whatever needs to be done, even if possibly violating some invariants temporarily, while ultimately ensuring the invariants are reestablished by the time the transaction commits. It’s the same principle that applies within Java
Enforcing invariants efficiently requires: code that can check the constraint; and a notification that fires when the constraint needs to be re-checked as a result of a change in the associated data. Again, precise change notifications are a key ingredient.
In JPA, the database itself provides a few validation constraints, such as foreign-key integrity and column uniqueness. At the Java level, JSR 303 (Java Specification Request) validation provides additional per-object validation. Both these implementations are imperfect: JPA can trigger phantom foreign-key violations caused by cache flush ordering (for example, when persisting two new objects that indirectly refer to each other); and both types of constraints apply on cache flush, not on transaction commit, which means they can happen in the middle of a transaction, before the invariants have been reestablished.
Instead, validation checks should be deferred until transaction commit time, unless explicitly requested earlier. The ability to enqueue objects for validation manually is important, as this makes implementing arbitrary custom (and precise) validation constraints easy: Simply register for change notifications on the fields involved and then enqueue for validation when notified.
With nonlocal change notifications, validation constraints can span multiple objects. For example, imagine implementing a constraint that no two child nodes may both be blue. You would register a change notification on child node
color properties, and when notified, enqueue the parent node for validation.
In practice, indexing and validation often work together. In the previous example, you could index the number of blue child nodes using a private
numBlueChildNodes property, and then simply enqueue for validation whenever
How do you query? Is the following SQL query efficient?
SELECT * FROM User WHERE lastName = 'Smith'.
It’s a trick question, because the answer depends on whether
last-Name is indexed, and this can’t be determined by looking at the query. This violates the basic principle that software should be understandable by looking at it.
In JPA, programmers are required to learn and use a new query language, but even when they do, there’s a lack of performance transparency. (JPA has three ways to query: SQL, JPQL, and Criteria.) To understand whether your queries are efficient, your skill set has to contain the union of computer programmer and database administrator.
In an ideal world, inefficient queries shouldn’t be able to hide like this. If you’re about to iterate through every
User in the database, that should be obvious when looking at the code.
What would be better is if queries were written in normal Java, using existing concepts such as
Set, List, and
Map. Then, a query’s efficiency would always be obvious, or at least visible. The sorted key/value pairs that the database provides can be modeled in Java as a
NavigableMap, and data extracted can be represented by a
Stream. An indexed property is then just a
Map from property value to the set of objects (such as key prefixes) with that value in that property. In fact, the Java language already has all the tools you need to query, using existing concepts that programmers already understand.
Schemas and migrations. Code evolves over time, and that means database schemas do as well. Therefore, the database structure must sometimes be updated via schema migrations. These migrations have two aspects: structural changes to the actual data format or layout; and semantic changes that are the corresponding “fixups” to the data.
For example, if you replace the
firstName columns with
fullName, the structural change is the
ALTER TABLE stuff to add and remove columns, while the semantic change is initializing each row’s new
fullName column to be the concatenation of
lastName. Note that performing the semantic change requires access to both the old (
firstName) and new (
JPA provides no tools for schema migration, nor does it verify that the schema being used is correct. Helper libraries exist for tracking schema migrations, but they require “stop the world” operations such as
ALTER TABLE, which are incompatible with rolling (zero downtime) updates, where multiple schemas can exist in the database at the same time. Moreover, these tools typically require both structural and semantic changes to be written manually, and in SQL rather than Java.
We would like to improve this in several ways: First, the database should be able to support more than one schema at a time, allowing rolling schema migrations where objects can be upgraded over time (for example, on demand), so you never need “stop the world” operations. Second, the schema(s) being used in the database should be tracked automatically in the database itself and then verified against what the code expects at the start of each transaction (obviously, this check should be efficient). Third, there’s no reason structural changes can’t be fully automated, which would eliminate bugs caused by inconsistent migrations Finally, semantic changes would be written more naturally in Java than in SQL.
Offline data. JPA has a somewhat awkward model for offline data, that is, data read from a transaction but used after that transaction has closed. For example, touching a collection that was not loaded during the transaction throws an exception, but touching a collection during the transaction loads the entire thing, which can get unwieldy. In any case, it’s not always clear what offline data is available, because JPA conflates offline data with its online cache: Offline data is whatever happened to be in the online cache when the transaction closed. Moreover, once the transaction closes, you lose the ability to query into your offline data, and even if you could, the query would be slow because no associated index information is retained.
JPA includes support for fetch joins and load graphs, but these can only partially address the problem, because you don’t always know what data you’ll need next until you’ve seen some of the data first. The core problem is that SQL is often not expressive or precise enough to define the exact data you need, even if you happen to know it ahead of time, without omitting some data or causing a “join explosion.”
It would be nice to have a more precise way to define what data to copy and retain as offline data after the transaction has closed. In addition, you should then be able to query offline data in all the usual ways, including index queries. This implies that secondary index data be retained as well. In short, it should be possible to define a subset of the whole database precisely, query it and pull it into memory, and then treat that like a normal in-memory mini-database even after the original transaction has closed.
For database designs that support snapshots efficiently (for example, log-structured databases), this process can even be a zero-copy operation, where the in-memory mini-database is really just a read-only, memory-mapped view into a snapshot.
Network communication. As mentioned earlier, network communication and persistence programming have many issues in common. Assuming a database is just a sorted key/value store, then network communication can be redefined as a simple form of persistence programming: First, create and initialize an in-memory database, including the usual schema-tracking information; second, populate that database with Java objects to transmit; finally, serialize the database (just a bunch of key/value pairs) and send them over the network. On the receiving end, open and read the database just as normal. Because the database also contains schema information, any required schema migrations happen automatically.
Many of the database issues that normally reappear with network programming are thus taken care of automatically: how to serialize an arbitrary graph of objects, define and document the data format (that is, schema), migrate automatically when different versions of the code are running on either end of the connection (for example, during rolling upgrades), and query into the data efficiently on the receiving end.
Bringing Code and Data Together
We developed Permazen, an open source project (https://github.com/permazen/permazen) to investigate and prototype the concepts described in this article, and now use it in our commercial solution. All of these ideas were implemented and deployed in some form, and programming with our new persistence layer was a truly refreshing experience. It also enabled us to implement some required custom functionality that would otherwise have been difficult or impossible—for example, a clustered key/value database based on the Raft Consensus Algorithm (https://raft.github.io/) with support for “standalone” mode. The project demonstrates that these ideas are actually feasible and perhaps worth further exploration.
Network communication and persistence programming have many issues in common.
However, a key caveat allowed this project to succeed: Each node was required to keep a complete, up-to-date local copy of the database (in case it needs to revert to standalone mode). As a result, individual database accesses within each transaction were low latency, because mostly, they required no network traffic. This was crucial to many of the new features mentioned here, such as nonlocal change notifications and custom indexes, which require frequent—but low-volume—access to the data in each transaction.
Put another way, the usual method of persistence programming using SQL over a network connection is kind of like grocery shopping via walkie-talkie with an intermediary who speaks Latin. If you can cut out the intermediary and go there yourself, the experience can be much more productive. In other words, distance itself (that is, latency) is a barrier to innovation in persistence programming. Sometimes distance is unavoidable, but in situations where you can give the code low-latency access to the data, new possibilities arise. In situations where the database and the application are separated by a network, this argues for sending the code to the data instead of the data to the code.
Historically, persistence programming has been driven from the database side, and this has limited programmers’ options. Redefining the database as just a sorted key/value store creates more room for innovation from the programming-language side. Our experience shows that, at least in some scenarios, this allows reimagining persistence programming to make it more natural and less frustrating, so we can spend less time wondering: Are we doing this right?