ML, AI, HDFS, Caffe2, TensorFlow, deep learning. These are just some of the buzzwords and technologies intended to clarify and characterize the rapid changes in technology and the geometric progression of applications in our brave new world. The application technology road map is changing at such a rapid pace that you need Waze to keep track of it.
The underlying platforms, the way in which data is ingested (emitted from sensors, in the case of IoT) and the real-time requirement for leveraging data have driven this change. We’ve seen transformation in the software industry — from tools to applications — but those days were nothing like the changes that are occurring today.
It seems like every day a new evolutionary or disruptive technology comes out, and as that technology’s adoption rate grows, another one comes along to disrupt it. Look at Kubernetes. Remember when Mesos led the market just a few years ago? Now, technologies like Kubeflow are being developed and released that add to the ecosystem of Kubernetes, and that’ll continue to push the current, dramatic adoption of Kubernetes.
However, as industry thought leaders, we must always keep our focus: It’s the journey that’s important, and speed is simply one parameter along that journey.
Historically, speed was driven by application changes and functional capabilities. The various data persistence stores and data formats were created to improve application performance, enable general reporting and provide access to the raw data, but they were never purposely built for the next generation of technologies — machine learning, artificial intelligence and deep learning — that are finding rapid adoption.
Throughout this digital transformation that we are seeing play out, there has been one constant: the data. Data formats may iterate, but the asset represented in the data remains constant and, generally, continues to be untapped by most enterprises. The journey from reporting tool advancement to data warehouses to data marts to data lakes is transforming again. Similar to the evolution of DevOps for operating systems, the data and database community is seeing the shift to DataOps. Initially, it was driven by the adoption of data scientists in functional areas that required data access. We’re now seeing this trend further normalize with the creation of data engineering. At MapR, we believe that a new term “dataware” — like hardware, software and middleware — will emerge as the primary method of describing this evolution.
The question that I’m most frequently asked is: “You’ve been a CIO many times before — what would you do?” My goal is to provide guidance around the challenges, pitfalls and opportunities this transformation represents.
First, do not overlook the organization and process changes that are required to embrace digital and data transformation. Don’t throw the baby out with the bathwater. It’s an old expression, but it rings true — particularly when it comes to your data. So much critical information is locked up in what are termed legacy systems, ranging from antiquated ERP systems to older versions of software as a service.
The convergence of this data is driven by what I’ve referenced above as data revitalization. Data revitalization requires understanding the data. To gain the greatest insights and value for your business, this data needs to be shared or, in some way, logically and physically linked to perform the critical analytics that are required.
Not all data is equal. That’s why data scientists spend more than 80% of their time preparing data sets versus doing the analysis they want. I’ve advised many Global 2000 CIOs that there are organizational and technological aspects to this dynamic change. Organizations should create a newly defined position called the data engineer. I’ve found that in many cases, existing senior DBA resources can be teamed with data scientists to fulfill this responsibility.
Their primary role? Identify and connect the many data sources that are critical, both inside and outside of the enterprise. The data engineer should focus on the data locale and context, identifying relationships between the data. The role is vital and strategic for the use of deeper analytics represented by machine learning and deep learning.
On the technology front, why move data when it’s not absolutely necessary to do so? The ability to access data that leverages both legacy protocols (like NFS) and alternative approaches (via containerization) is crucial to enable the full context required for deep learning and artificial intelligence. Too often in older legacy technologies the data must be pulled together in data warehouses or data marts. This involves copies of data floating around the enterprise that are easily made obsolete or used out of context.
As digital transformation becomes more widespread, the way in which data is collected, ingested and managed will continue to improve.