Originally posted on infoworld.
Rather than punt database design decisions to a cloud service or third-party provider, understand what you want to achieve and how best to deliver on that goal.
Software developers today have more options open to them. They have tools and services that can help them build new applications quickly, then launch those services to customers globally, and then scale them up to meet growing demand. Microservices architectures and agile development put the emphasis on moving faster and spinning up new services whenever customer needs and business needs have to be met.
This also applies to data. Developers must support the data that their applications create, and this means implementing a database. Choosing the right design can make all the difference to the application; it helps ensure that the application will be available, performant, and scalable over time. However, developers don’t want to have to implement and manage databases themselves. That’s why the majority of companies—90 percent, according to IDC—are in the midst of moving their databases and data workloads to the cloud.
For these companies, there are multiple different options available. These include managed services, cloud-based database installs, and database-as-a-service (DBaaS) offerings. These services all promise that they will ease the data management burden and help developers meet their goals of shipping new applications and application updates faster. Phrases like “schemaless” and ”fully managed” can make it seem that databases can be handed over, lulling developers into a sense of complacency.
In reality, developers are just as responsible for cloud infrastructure as they’ve been for traditional on-prem systems, particularly when it comes to design choices and how to implement the database. This includes not simply trusting that the default settings of DBaaS products are right for their applications.
Choosing the right database
Developers and application architects therefore have to look at the long-term future for their application projects, and make sure they understand the basic requirements that these projects will have. The first question is which database design to use for the project.
There are so many database options available today, the choices quickly become overwhelming. The DB-Engines Ranking lists 359 different databases, for example, so there is plenty of temptation to use a database that you already know, or one that makes extensive promises on what it will deliver. If you have implemented MongoDB, say, then why not use that same database for your next project?
However, there is no guarantee that what worked for one application will work for another. There are databases and data management approaches that are more suitable for specific use cases, such as graph and time-series databases, and there are others that may be better fits depending on the programming language or software development resources that will be used. While it is possible to force an unsuitable database deployment to fit a use case, the wrong choice can seriously curtail performance and increase costs.
To choose the right database involves understanding how an application workload will perform over time, how it will grow, and how access patterns might change. As any database implementation grows, it will have to handle more queries and more stored data. Putting the right approach in place at the start can make it easier to process more queries against that data. Ignoring this and relying on the database service to manage it on your behalf might work fine at the start but it could affect performance and cost dramatically down the road. Spending time on planning up front can therefore lead to significant cost reductions in the longer term.
How to think about database design
Taking a schemaless approach appeals to many developers. After all, if you let the database service take care of organizing the data, then you don’t have to. However, this is not really the case. All database providers—even those that offer “schemaless” approaches using JSON or the ability to add objects—encourage some form of schema validation. Schemaless databases store information as unstructured data, and this has a significant impact in terms of performance and cost as the implementation grows.
Even the smallest decisions can have a big impact as databases scale up. Take data formats, for example. Imagine you have a form in your application that will accept data inputs, such as which country someone lives in. What format should you use?
Country names will vary in length, so let’s assume an average of 12 characters for the entry. Storing that data in a variable character (
varchar) format with a UTF character set will take up three bytes per character, or 39 bytes in total for each entry. This does not sound huge, but let’s compare that with using
enum for that same field: An
int requires only four bytes in total for each entry, while an
enum takes only one byte. Scale this up to 100 million data points, and the
varchar option would take 3700 megabytes (MB) of space, while the
enum option would require 95MB, a reduction of 97.5%.
The amount of data that you store has a bigger impact than increasing the disk space you use. When you have more data to work with, you will normally scale up the machine image that you use to process that data in memory. If you take a less efficient approach to the data, then you will have to increase the CPU and memory resources for processing the data. While the cost to store terabytes of data on disk is relatively cheap, the cost of CPU and compute time is expensive, so you should try to take the most efficient approach possible.
Alongside this, it’s important to consider data access patterns. How you plan to search for data will affect how you design your database. If you expect to have common search requests for your application, then you can create indexes that can improve performance. Similarly, you may find that your users’ behavior changes over time, and certain queries grow more popular. To manage this, you should review those patterns as the queries and indexes that you have in place will not be what you need in the future.
One important element here is that database design is potentially challenging to think through. However, you can make this much easier for yourself if you keep your deployment as simple as possible rather than trying to accommodate potential edge cases or future requirements. It is always possible to expand your database schema or extend your deployment in the future, rather than concentrating on future needs right now.
Think before you build
What you decide before you start coding will have the biggest impact on your scalability and stability, compared to any decision you make during the life of a project. It’s therefore important to give your data—and what you choose to use for managing that data—the proper respect.
Rather than handing all responsibility over to a cloud service or a third-party provider, understand what you want to achieve and how best to deliver on that goal. However, you don’t give up the responsibility for that decision by choosing a service, and you do trade flexibility for performance and cost. Simply adding more cloud resources is not an efficient approach to scaling up. The database and design choices you choose will have an impact on how successful your new application or service will be over time.